几张图了解 KVM 和 QEMU
图一
图二
1、QEMU 和 KVM 是如何实现的 (通用交互流程,基于第一张图的解释)
sequenceDiagram
participant GuestApp as Guest Application/OS
participant GuestCPU as Guest vCPU
participant KVM as KVM Kernel Module
participant QEMU as QEMU Process
participant HostCPU as Host Physical CPU
QEMU->>KVM: 1. ioctl(KVM_CREATE_VM)
activate KVM
KVM-->>QEMU: VM File Descriptor
deactivate KVM
QEMU->>KVM: 2. ioctl(KVM_CREATE_VCPU, vm_fd)
activate KVM
KVM-->>QEMU: vCPU File Descriptor
deactivate KVM
QEMU->>QEMU: 3. Setup VM: Memory, Device Emulation, Load Guest Image
QEMU->>KVM: 4. ioctl(vCPU_fd, KVM_RUN)
activate KVM
KVM->>HostCPU: 5. Switch to Guest Mode, Execute Guest Code
activate HostCPU
HostCPU->>GuestCPU: Runs Guest Instructions
activate GuestCPU
GuestCPU->>GuestApp: Normal Guest Operation
GuestApp->>GuestCPU: Attempts Privileged Operation (e.g., I/O, HLT)
GuestCPU->>HostCPU: 6. Triggers VM Exit (Hardware-assisted)
deactivate GuestCPU
HostCPU-->>KVM: 7. Returns to Host Mode, KVM handles exit
deactivate HostCPU
KVM-->>QEMU: 8. KVM_RUN returns with Exit Reason
deactivate KVM
QEMU->>QEMU: 9. Analyze Exit Reason
alt I/O Exit
QEMU->>QEMU: Emulate I/O Device Access
else Other Exit (e.g., HLT, MMIO)
QEMU->>QEMU: Handle specific exit type
end
QEMU->>KVM: 10. Loop: ioctl(vCPU_fd, KVM_RUN) to resume Guest
交互流程核心:
1、QEMU 通过 /dev/kvm 的 ioctl 系统调用来与 KVM 交互。
2、QEMU 请求 KVM 创建一个虚拟机 (KVM_CREATE_VM),这会得到一个代表该虚拟机的 fd (file descriptor)。
3、QEMU 为 VM 配置内存 (KVM_SET_USER_MEMORY_REGION)。
4、QEMU 为 VM 创建虚拟 CPU (vCPU) (KVM_CREATE_VCPU),每个 vCPU 也有一个 fd。
5、QEMU 配置 vCPU 的状态(例如,寄存器)。
6、QEMU 启动 vCPU 的执行 (KVM_RUN)。
7、KVM 接管控制权,让 vCPU 在物理 CPU 上直接运行 Guest 代码(处于 Guest 模式)。
8、VM Exit:当 Guest 代码尝试执行某些特权操作(如 I/O 访问、处理特殊指令、中断等)或者发生需要 Hypervisor 干预的事件时,硬件会触发 "VM Exit",暂停 Guest 执行,将控制权交还给 KVM,KVM 再将控制权和退出的原因返回给 QEMU(KVM_RUN 调用返回)。
9、QEMU 根据 KVM 返回的退出原因进行处理:
- I/O Exit:如果 Guest 尝试访问某个模拟设备的 I/O 端口或内存映射 I/O (MMIO) 区域,QEMU 会模拟这个 I/O 操作。例如,向模拟的 VirtIO 网卡发送数据。
- 中断/信号:QEMU 可能需要向 Guest 注入中断。
- 其他事件:处理 HLT 指令、调试事件等。
10、处理完退出事件后,QEMU 再次调用 KVM_RUN,让 KVM 恢复 Guest 的执行。
这个 Run -> VM Exit -> Handle -> Run 的循环是 QEMU+KVM 运行的核心。
2、Notifications (vmexit & vcpu irq) 详细说明 (基于第一张图的解释)
2a. Guest 通知 Host (QEMU) - 通过 VM Exit
sequenceDiagram
participant GuestDriver as Guest VirtIO Driver
participant GuestCPU as Guest vCPU Context
participant KVM as KVM Kernel Module
participant QEMU_vCPUThread as QEMU vCPU Thread
participant QEMU_MemSS as QEMU Memory Subsystem
participant QEMU_VirtIO_Backend as QEMU VirtIO Device Backend
GuestDriver->>GuestDriver: 1. Prepares data in vring (available ring)
GuestDriver->>GuestCPU: 2. Writes queue_index to VirtIO Notify Register (MMIO/PIO)
activate GuestCPU
GuestCPU->>KVM: 3. Executes MMIO Write / PIO 'OUT' instruction
deactivate GuestCPU
activate KVM
KVM->>KVM: 4. Traps access, Hardware triggers VM Exit
KVM-->>QEMU_vCPUThread: 5. KVM_RUN returns with exit_reason (KVM_EXIT_MMIO / KVM_EXIT_IO)
deactivate KVM
activate QEMU_vCPUThread
QEMU_vCPUThread->>QEMU_MemSS: 6. If MMIO: address_space_rw If PIO: kvm_handle_io
activate QEMU_MemSS
QEMU_MemSS->>QEMU_VirtIO_Backend: 7. Routes access to registered MemoryRegionOps or PIO handler\n
deactivate QEMU_MemSS
activate QEMU_VirtIO_Backend
QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 8. Extracts queue_index from address/data
QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 9. Calls virtio_queue_notify(vdev, queue_idx)
QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 10. Backend processes available descriptors from vring
deactivate QEMU_VirtIO_Backend
QEMU_vCPUThread->>KVM: 11. Loop: Resumes Guest via KVM_RUN
deactivate QEMU_vCPUThread
2b. Host (QEMU) 通知 Guest - 通过 vCPU IRQ
sequenceDiagram
participant QEMU_VirtIO_Backend as QEMU VirtIO Device Backend
participant QEMU_KVM_IF as QEMU KVM Interface
participant KVM_Core as KVM Core Logic
participant KVM_vIRQ_Chip as KVM Virtual IRQ Chip (e.g., vAPIC, vGIC)
participant Guest_vCPU as Guest vCPU
participant Guest_Driver_IRQ_Handler as Guest VirtIO Driver IRQ Handler
QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 1. Data processed, results in vring (used ring)
QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 2. Needs to notify Guest: Calls virtio_notify() / virtio_irq(vq)
QEMU_VirtIO_Backend->>QEMU_KVM_IF: 3. Determines IRQ (GSI/MSI vector), calls e.g., kvm_set_irq() / kvm_send_msi()
activate QEMU_KVM_IF
QEMU_KVM_IF->>KVM_Core: 4. ioctl(vm_fd, KVM_IRQ_LINE, &irq_level) or KVM_SIGNAL_MSI
deactivate QEMU_KVM_IF
activate KVM_Core
KVM_Core->>KVM_vIRQ_Chip: 5. Updates state of virtual interrupt controller (e.g., sets bit in IRR)
activate KVM_vIRQ_Chip
KVM_vIRQ_Chip-->>KVM_Core: Interrupt marked as PENDING
deactivate KVM_vIRQ_Chip
deactivate KVM_Core
note over KVM_Core, Guest_vCPU: Later, when KVM_RUN is called or vCPU is about to re-enter Guest mode...
KVM_Core->>Guest_vCPU: 6. Pre-Guest entry check: Are pending & unmasked IRQs for this vCPU?
activate KVM_Core
KVM_Core->>KVM_vIRQ_Chip: Query vIRQ Chip state
activate KVM_vIRQ_Chip
KVM_vIRQ_Chip-->>KVM_Core: Yes, IRQ X is pending and deliverable
deactivate KVM_vIRQ_Chip
KVM_Core->>Guest_vCPU: 7. Injects IRQ into Guest vCPU state (modifies vCPU context for IRQ entry)
deactivate KVM_Core
activate Guest_vCPU
Guest_vCPU->>Guest_Driver_IRQ_Handler: 8. Guest CPU diverts execution to its IRQ handler for IRQ X
activate Guest_Driver_IRQ_Handler
Guest_Driver_IRQ_Handler->>Guest_Driver_IRQ_Handler: 9. Processes used ring, consumes data
deactivate Guest_Driver_IRQ_Handler
deactivate Guest_vCPU
3、vhost-user 与标准模式在 notify/irq 上的区别
3a. Guest 通知 Host/DPDK (Kick)
- 标准 QEMU/KVM 路径 (回顾简化版)
sequenceDiagram participant GuestDriver participant KVM participant QEMU_Backend GuestDriver->>KVM: Writes Notify Register (MMIO/PIO) KVM->>QEMU_Backend: VM Exit (KVM_EXIT_MMIO/IO) -> QEMU handles QEMU_Backend->>QEMU_Backend: Processes vring
- vhost-user/DPDK 路径 (使用 ioeventfd)
Setup Phase
sequenceDiagram
participant QEMU
participant KVM
participant DPDK_App as DPDK Application (Vhost-user Backend)
QEMU->>QEMU: 1. Creates an eventfd (kick_fd) for a vring
QEMU->>KVM: 2. ioctl(KVM_IOEVENTFD, guest_mmio_addr, kick_fd, KVM_IOEVENTFD_FLAG_DATAMATCH)
activate KVM
KVM->>KVM: Associates Guest MMIO Notify Addr with kick_fd signal
KVM-->>QEMU: Success
deactivate KVM
QEMU->>DPDK_App: 3. VHOST_USER_SET_VRING_KICK (sends kick_fd via Unix socket)
activate DPDK_App
DPDK_App->>DPDK_App: 4. Stores kick_fd, Adds it to epoll/monitoring set
DPDK_App-->>QEMU: Ack
deactivate DPDK_App
Runtime Phase
sequenceDiagram
participant GuestDriver
participant KVM
participant kick_fd as Eventfd (Kernel object)
participant DPDK_App as DPDK Application
GuestDriver->>KVM: 1. Writes queue_idx to VirtIO Notify MMIO Address
activate KVM
KVM->>kick_fd: 2. Matches MMIO write to ioeventfd rule, Signals kick_fd (Kernel signals eventfd)
note right of KVM: No VM Exit to QEMU for this!
deactivate KVM
activate kick_fd
kick_fd-->>DPDK_App: 3. kick_fd becomes readable, epoll_wait() in DPDK_App returns
deactivate kick_fd
activate DPDK_App
DPDK_App->>DPDK_App: 4. Reads from kick_fd (to clear signal)
DPDK_App->>DPDK_App: 5. Identifies corresponding vring, Processes vring directly (shared memory)
deactivate DPDK_App
3b. Host/DPDK 通知 Guest (IRQ/Call)
- 标准 QEMU/KVM 路径 (回顾简化版)
sequenceDiagram
participant QEMU_Backend
participant KVM
participant GuestOS
QEMU_Backend->>KVM: ioctl(KVM_IRQ_LINE / KVM_SIGNAL_MSI)
KVM->>KVM: Marks IRQ pending in vIRQ Chip
KVM->>GuestOS: Injects IRQ on KVM_RUN
- vhost-user/DPDK 路径 (使用 irqfd)
Setup PhasesequenceDiagram participant QEMU participant KVM participant DPDK_App as DPDK Application QEMU->>QEMU: 1. Creates an eventfd (call_fd) for a vring QEMU->>KVM: 2. ioctl(KVM_IRQFD, call_fd, guest_irq_line_gsi_or_msi_vector) activate KVM KVM->>KVM: Associates call_fd signal with Guest IRQ injection for the specified line KVM-->>QEMU: Success deactivate KVM QEMU->>DPDK_App: 3. VHOST_USER_SET_VRING_CALL (sends call_fd via Unix socket) activate DPDK_App DPDK_App->>DPDK_App: 4. Stores call_fd for the vring DPDK_App-->>QEMU: Ack deactivate DPDK_App
Runtime Phase
sequenceDiagram
participant DPDK_App as DPDK Application
participant call_fd as Eventfd (Kernel object)
participant KVM
participant KVM_vIRQ_Chip as KVM Virtual IRQ Chip
participant GuestOS
DPDK_App->>DPDK_App: 1. Finishes processing, updates used vring, needs to IRQ Guest
DPDK_App->>call_fd: 2. eventfd_write(call_fd, 1)
activate call_fd
call_fd-->>KVM: 3. Kernel's eventfd/irqfd mechanism is triggered by write
deactivate call_fd
activate KVM
KVM->>KVM_vIRQ_Chip: 4. (via irqfd_wakeup workqueue) Calls internal kvm_set_irq for the associated GSI/vector. Interrupt marked PENDING.
note right of KVM: No QEMU ioctl call needed from userspace here!
activate KVM_vIRQ_Chip
KVM_vIRQ_Chip-->>KVM: IRQ pending
deactivate KVM_vIRQ_Chip
KVM->>GuestOS: 5. On next KVM_RUN (or if already running & interruptible), KVM injects the pending IRQ
deactivate KVM
4、notify_ops (MMIO) 在 QEMU virtio-pci 中的作用和调用时机
- 注册阶段 (virtio_pci_modern_regions_init)注册阶段 (virtio_pci_modern_regions_init)
classDiagram class VirtIOPCIProxy { notify.mr 创建并初始化设备 } class memory_region_init_io { Function 创建内存区域 } class notify_ops { read: virtio_pci_notify_read write: virtio_pci_notify_write } class MemoryRegion { 用于MMIO通知 } class GuestPhysicalAddressSpace { MMIO区域映射位置 } VirtIOPCIProxy --> memory_region_init_io : 调用 notify_ops --> memory_region_init_io : 提供回调 memory_region_init_io --> MemoryRegion : 创建 MemoryRegion ..> GuestPhysicalAddressSpace : 映射到
- 调用场景 A: notify_ops.write 被调用 (标准路径, 无 ioeventfd 旁路)
sequenceDiagram
participant GuestDriver
participant KVM
participant QEMU_vCPUThread as QEMU vCPU Thread
participant QEMU_MemSS as QEMU Memory Subsystem
participant NotifyMR_Ops as "proxy->notify.mr .ops (notify_ops)"
participant virtio_pci_notify_write as "virtio_pci_notify_write()"
participant virtio_queue_notify as "virtio_queue_notify()"
GuestDriver->>KVM: 1. Writes to VirtIO Notify MMIO Address
activate KVM
KVM-->>QEMU_vCPUThread: 2. KVM_EXIT_MMIO (phys_addr, data, len, is_write=true)
deactivate KVM
activate QEMU_vCPUThread
QEMU_vCPUThread->>QEMU_MemSS: 3. address_space_rw(phys_addr, data, ...)
activate QEMU_MemSS
QEMU_MemSS->>NotifyMR_Ops: 4. Locates 'proxy->notify.mr', dispatches to its '.write' op
deactivate QEMU_MemSS
activate NotifyMR_Ops
NotifyMR_Ops->>virtio_pci_notify_write: 5. Calls ops->write(opaque, offset_in_region, data_val, size)
deactivate NotifyMR_Ops
activate virtio_pci_notify_write
virtio_pci_notify_write->>virtio_pci_notify_write: 6. Calculates queue_index based on offset
virtio_pci_notify_write->>virtio_queue_notify: 7. Calls virtio_queue_notify(vdev, queue_idx)
activate virtio_queue_notify
virtio_queue_notify->>virtio_queue_notify: 8. Triggers backend processing for the queue
deactivate virtio_queue_notify
deactivate virtio_pci_notify_write
deactivate QEMU_vCPUThread
- 调用场景 B: notify_ops.write 被旁路 (vhost-user 使用 ioeventfd)
sequenceDiagram
participant GuestDriver
participant KVM
participant ioeventfd
participant DPDK_App
Note over GuestDriver, KVM: QEMU has registered ioeventfd
GuestDriver->>KVM: Writes to VirtIO Notify MMIO
KVM->>ioeventfd: MMIO write signals eventfd
Note right of KVM: No KVM_EXIT_MMIO to QEMU
ioeventfd-->>DPDK_App: Eventfd notifies DPDK
DPDK_App->>DPDK_App: Process notification directly
Note over DPDK_App: No QEMU involvement