At present, virtualization technology is developing rapidly, and Xen is a classic representative of Type-1 bare-metal virtualization solution, which is also applied in cloud computing and data center scenarios. When developers use highly concurrent network services, the TCP accept() system call performance may fall off a cliff, even more than 50% lower than the physical machine. This performance damage is not a simple resource scramble, but a macro strategy from virtualized network stacks to micro path tracing to CPU scheduling. The details are as follows!
Inherent limitations of the virtualized network model
Xen's Split Driver Model divides network traffic processing into Frontend and Backend. The virtual network card driver in DomainU (the client) acts as the front end and communicates with the Domain0 back-end driver through the Event Channel and Shared Memory Ring. When a physical NIC receives a packet, the Domain0 back-end driver copies the packet to the shared ring and triggers an event to notify DomainU. This process is particularly sensitive in the accept() scenario: each newly connected SYN packet goes through the full cross-domain transport chain.
According to the critical path analysis in the kernel source code, when accepting () is executed in DomainU, if the connection is not fully established, inet_csk_wait_for_connect() is triggered to enter the wait queue:
sk = inet_csk_wait_for_connect(sk, timeo);
The scheduler may suspend the process at this point, and wake up depends on network data arrival events. In the Xen environment, event delivery needs to traverse the Hypervisor layer, resulting in a significant increase in interrupt latency. The flame graph captured using the perf tool shows that more than 30% of CPU time is consumed in back-end processing threads such as xen_netbk_kthread.
A hidden tax on memory isolation and data replication
Xen's security isolation design requires that memory cannot be shared directly between Domain0 and DomainU. When a TCP handshake packet arrives, the backend driver must map the data pages to the client address space through the Grant Table mechanism. For the accept() scenario with a high small-packet frequency, the overhead of this per-page authorization is dramatically magnified. Experimental data shows that a single DomainU needs to process more than 500,000 Grant Table operations per second at full load of a 10Gb network card:
grant_ref_t ref = gnttab_claim_grant_reference();
gnttab_grant_foreign_access(ref, domid, mfn, GNT_read_only);
These operations not only consume CPU cycles, but also cause frequent TLB refreshes. When multiple Vcpus compete for Grant Table locks, Spinlock contention further deteriorates performance. Partitioning the Grant Table by modifying the Xen source code increased the accept() throughput by 22%, but this also exposed scalability limitations at the architectural level.
Interrupt storm and vCPU scheduling deadlock
Traditional physical machine network drivers use the NAPI mechanism to merge interrupts, but Xen's Virtual interrupts need to be dispatched by the Hypervisor. As new connection requests flood in, each packet generates a virtual interrupt. In tests running eight DomainU on 64 nuclear physical machines, xl dmesg showed more than 800,000 evtchn per second: send event to remote Events:
void notify_remote_via_evtchn(evtchn_port_t port) {
struct evtchn_send send;
send.port = port;
HYPERVISOR_event_channel_op(EVTCHNOP_send, &send);
}
These interrupts forced the vCPU to fall into VM exits frequently, and Xen's Credit Scheduler was unable to respond in a timely manner under its default configuration. When the vCPU processing network I/O is forced to schedule due to credit exhaustion, the backlog of interrupt queues can create a vicious circle. Using real-time scheduling policy (RTDS) can alleviate the problem, but it will destroy the fair allocation of CPU resources.
Impedance mismatch between kernel stack and virtual device
The TCP implementation of the Linux kernel is optimized based on physical devices, assuming that network devices have stable DMA capabilities and low latency interrupt response. xen's virtual network card (Xen-Netfront) is actually an abstraction of the memory queue, and this difference is particularly noticeable during the three-way handshake phase. When accept() wakes up on a listening socket, the following key actions need to be performed:
struct sock child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, dst);
The function chain involves memory-intensive operations such as SKB allocation and protocol control block initialization. In the Xen environment, SKB's DMA mapping requires additional transformations due to the isolation of the front-end driver's memory pool from the back-end. When CONFIG_XEN_DEBUG_FS is turned on for tracing, it is observed that the gnttab_map_refs() call latency accounts for 37% of the total handshake time.
Technical practice to break through the performance dilemma
In response to the above problems, the industry has developed a multi-dimensional optimization program. At the driver level, using PVH mode instead of pure PV reduces the number of times you get stuck in the Hypervisor. Adjust the scheduler parameters by xl set-parameters:
xl sched-credit -d Domain0 -c -t 1000
Reducing the credit allocation time slice from the default 30ms to 1ms reduces interrupt response latency by 40%. On the kernel stack side, modifying the net.core.somaxconn parameter is only a temporary measure, and further optimization requires customizing the TCP accept() queue locking mechanism. Facebook's Open source TFO (TCP Fast Open) patch reduces a round of handshake latency in the Xen environment, but requires strict security policies.
The ultimate solution is to bypass the kernel stack and adopt DPDK or SR-IOV pass-through technology. Assign a physical NIC to DomainU using VF:
xl pci-assignable-add 0000:0b:00.0
xl pci-attach vm1 0000:0b:00.0
In this way, the accept() operation is completely completed in the user mode driver, and the measured throughput can reach 7 times of the Xen virtual network. However, this approach sacrifices the core value of virtualization - resource elasticity - and requires a careful trade-off between performance and flexibility.
Xen's accept() performance problem is essentially security isolation versus computing efficiency. Understanding Xen's accept() bottleneck is not only to optimize a system call, but also to gain insight into the deep logic of virtualization technology evolution. If there is more to know, you can consult us or find answers on our website.