图解 notify 工作原理

几张图了解 KVM 和 QEMU

图一

图二

1、QEMU 和 KVM 是如何实现的 (通用交互流程,基于第一张图的解释)

sequenceDiagram
    participant GuestApp as Guest Application/OS
    participant GuestCPU as Guest vCPU
    participant KVM as KVM Kernel Module
    participant QEMU as QEMU Process
    participant HostCPU as Host Physical CPU

    QEMU->>KVM: 1. ioctl(KVM_CREATE_VM)
    activate KVM
    KVM-->>QEMU: VM File Descriptor
    deactivate KVM

    QEMU->>KVM: 2. ioctl(KVM_CREATE_VCPU, vm_fd)
    activate KVM
    KVM-->>QEMU: vCPU File Descriptor
    deactivate KVM

    QEMU->>QEMU: 3. Setup VM: Memory, Device Emulation, Load Guest Image

    QEMU->>KVM: 4. ioctl(vCPU_fd, KVM_RUN)
    activate KVM

    KVM->>HostCPU: 5. Switch to Guest Mode, Execute Guest Code
    activate HostCPU

    HostCPU->>GuestCPU: Runs Guest Instructions
    activate GuestCPU

    GuestCPU->>GuestApp: Normal Guest Operation
    GuestApp->>GuestCPU: Attempts Privileged Operation (e.g., I/O, HLT)

    GuestCPU->>HostCPU: 6. Triggers VM Exit (Hardware-assisted)
    deactivate GuestCPU

    HostCPU-->>KVM: 7. Returns to Host Mode, KVM handles exit
    deactivate HostCPU

    KVM-->>QEMU: 8. KVM_RUN returns with Exit Reason
    deactivate KVM

    QEMU->>QEMU: 9. Analyze Exit Reason

    alt I/O Exit
        QEMU->>QEMU: Emulate I/O Device Access
    else Other Exit (e.g., HLT, MMIO)
        QEMU->>QEMU: Handle specific exit type
    end

    QEMU->>KVM: 10. Loop: ioctl(vCPU_fd, KVM_RUN) to resume Guest

交互流程核心:
1、QEMU 通过 /dev/kvm 的 ioctl 系统调用来与 KVM 交互。
2、QEMU 请求 KVM 创建一个虚拟机 (KVM_CREATE_VM),这会得到一个代表该虚拟机的 fd (file descriptor)。
3、QEMU 为 VM 配置内存 (KVM_SET_USER_MEMORY_REGION)。
4、QEMU 为 VM 创建虚拟 CPU (vCPU) (KVM_CREATE_VCPU),每个 vCPU 也有一个 fd。
5、QEMU 配置 vCPU 的状态(例如,寄存器)。
6、QEMU 启动 vCPU 的执行 (KVM_RUN)。
7、KVM 接管控制权,让 vCPU 在物理 CPU 上直接运行 Guest 代码(处于 Guest 模式)。
8、VM Exit:当 Guest 代码尝试执行某些特权操作(如 I/O 访问、处理特殊指令、中断等)或者发生需要 Hypervisor 干预的事件时,硬件会触发 "VM Exit",暂停 Guest 执行,将控制权交还给 KVM,KVM 再将控制权和退出的原因返回给 QEMU(KVM_RUN 调用返回)。
9、QEMU 根据 KVM 返回的退出原因进行处理:

  • I/O Exit:如果 Guest 尝试访问某个模拟设备的 I/O 端口或内存映射 I/O (MMIO) 区域,QEMU 会模拟这个 I/O 操作。例如,向模拟的 VirtIO 网卡发送数据。
  • 中断/信号:QEMU 可能需要向 Guest 注入中断。
  • 其他事件:处理 HLT 指令、调试事件等。

10、处理完退出事件后,QEMU 再次调用 KVM_RUN,让 KVM 恢复 Guest 的执行。

这个 Run -> VM Exit -> Handle -> Run 的循环是 QEMU+KVM 运行的核心。

2、Notifications (vmexit & vcpu irq) 详细说明 (基于第一张图的解释)

2a. Guest 通知 Host (QEMU) - 通过 VM Exit

sequenceDiagram
    participant GuestDriver as Guest VirtIO Driver
    participant GuestCPU as Guest vCPU Context
    participant KVM as KVM Kernel Module
    participant QEMU_vCPUThread as QEMU vCPU Thread
    participant QEMU_MemSS as QEMU Memory Subsystem
    participant QEMU_VirtIO_Backend as QEMU VirtIO Device Backend

    GuestDriver->>GuestDriver: 1. Prepares data in vring (available ring)
    GuestDriver->>GuestCPU: 2. Writes queue_index to VirtIO Notify Register (MMIO/PIO)

    activate GuestCPU
    GuestCPU->>KVM: 3. Executes MMIO Write / PIO 'OUT' instruction
    deactivate GuestCPU

    activate KVM
    KVM->>KVM: 4. Traps access, Hardware triggers VM Exit
    KVM-->>QEMU_vCPUThread: 5. KVM_RUN returns with exit_reason (KVM_EXIT_MMIO / KVM_EXIT_IO)
    deactivate KVM

    activate QEMU_vCPUThread
    QEMU_vCPUThread->>QEMU_MemSS: 6. If MMIO: address_space_rw If PIO: kvm_handle_io

    activate QEMU_MemSS
    QEMU_MemSS->>QEMU_VirtIO_Backend: 7. Routes access to registered MemoryRegionOps or PIO handler\n
    deactivate QEMU_MemSS

    activate QEMU_VirtIO_Backend
    QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 8. Extracts queue_index from address/data
    QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 9. Calls virtio_queue_notify(vdev, queue_idx)
    QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 10. Backend processes available descriptors from vring
    deactivate QEMU_VirtIO_Backend

    QEMU_vCPUThread->>KVM: 11. Loop: Resumes Guest via KVM_RUN
    deactivate QEMU_vCPUThread

2b. Host (QEMU) 通知 Guest - 通过 vCPU IRQ

sequenceDiagram
    participant QEMU_VirtIO_Backend as QEMU VirtIO Device Backend
    participant QEMU_KVM_IF as QEMU KVM Interface
    participant KVM_Core as KVM Core Logic
    participant KVM_vIRQ_Chip as KVM Virtual IRQ Chip (e.g., vAPIC, vGIC)
    participant Guest_vCPU as Guest vCPU
    participant Guest_Driver_IRQ_Handler as Guest VirtIO Driver IRQ Handler

    QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 1. Data processed, results in vring (used ring)
    QEMU_VirtIO_Backend->>QEMU_VirtIO_Backend: 2. Needs to notify Guest: Calls virtio_notify() / virtio_irq(vq)
    QEMU_VirtIO_Backend->>QEMU_KVM_IF: 3. Determines IRQ (GSI/MSI vector), calls e.g., kvm_set_irq() / kvm_send_msi()

    activate QEMU_KVM_IF
    QEMU_KVM_IF->>KVM_Core: 4. ioctl(vm_fd, KVM_IRQ_LINE, &irq_level) or KVM_SIGNAL_MSI
    deactivate QEMU_KVM_IF

    activate KVM_Core
    KVM_Core->>KVM_vIRQ_Chip: 5. Updates state of virtual interrupt controller (e.g., sets bit in IRR)
    activate KVM_vIRQ_Chip
    KVM_vIRQ_Chip-->>KVM_Core: Interrupt marked as PENDING
    deactivate KVM_vIRQ_Chip
    deactivate KVM_Core

    note over KVM_Core, Guest_vCPU: Later, when KVM_RUN is called or vCPU is about to re-enter Guest mode...
    KVM_Core->>Guest_vCPU: 6. Pre-Guest entry check: Are pending & unmasked IRQs for this vCPU?

    activate KVM_Core
    KVM_Core->>KVM_vIRQ_Chip: Query vIRQ Chip state
    activate KVM_vIRQ_Chip
    KVM_vIRQ_Chip-->>KVM_Core: Yes, IRQ X is pending and deliverable
    deactivate KVM_vIRQ_Chip
    KVM_Core->>Guest_vCPU: 7. Injects IRQ into Guest vCPU state (modifies vCPU context for IRQ entry)
    deactivate KVM_Core

    activate Guest_vCPU
    Guest_vCPU->>Guest_Driver_IRQ_Handler: 8. Guest CPU diverts execution to its IRQ handler for IRQ X
    activate Guest_Driver_IRQ_Handler
    Guest_Driver_IRQ_Handler->>Guest_Driver_IRQ_Handler: 9. Processes used ring, consumes data
    deactivate Guest_Driver_IRQ_Handler
    deactivate Guest_vCPU

3、vhost-user 与标准模式在 notify/irq 上的区别

3a. Guest 通知 Host/DPDK (Kick)

  • 标准 QEMU/KVM 路径 (回顾简化版)
    sequenceDiagram
    participant GuestDriver
    participant KVM
    participant QEMU_Backend
    
    GuestDriver->>KVM: Writes Notify Register (MMIO/PIO)
    
    KVM->>QEMU_Backend: VM Exit (KVM_EXIT_MMIO/IO) -> QEMU handles
    
    QEMU_Backend->>QEMU_Backend: Processes vring
  • vhost-user/DPDK 路径 (使用 ioeventfd)
    Setup Phase
sequenceDiagram
    participant QEMU
    participant KVM
    participant DPDK_App as DPDK Application (Vhost-user Backend)

    QEMU->>QEMU: 1. Creates an eventfd (kick_fd) for a vring
    QEMU->>KVM: 2. ioctl(KVM_IOEVENTFD, guest_mmio_addr, kick_fd, KVM_IOEVENTFD_FLAG_DATAMATCH)
    activate KVM
    KVM->>KVM: Associates Guest MMIO Notify Addr with kick_fd signal
    KVM-->>QEMU: Success
    deactivate KVM
    QEMU->>DPDK_App: 3. VHOST_USER_SET_VRING_KICK (sends kick_fd via Unix socket)
    activate DPDK_App
    DPDK_App->>DPDK_App: 4. Stores kick_fd, Adds it to epoll/monitoring set
    DPDK_App-->>QEMU: Ack
    deactivate DPDK_App

Runtime Phase

sequenceDiagram
    participant GuestDriver
    participant KVM
    participant kick_fd as Eventfd (Kernel object)
    participant DPDK_App as DPDK Application

    GuestDriver->>KVM: 1. Writes queue_idx to VirtIO Notify MMIO Address
    activate KVM
    KVM->>kick_fd: 2. Matches MMIO write to ioeventfd rule, Signals kick_fd (Kernel signals eventfd)
    note right of KVM: No VM Exit to QEMU for this!
    deactivate KVM
    activate kick_fd
    kick_fd-->>DPDK_App: 3. kick_fd becomes readable, epoll_wait() in DPDK_App returns
    deactivate kick_fd
    activate DPDK_App
    DPDK_App->>DPDK_App: 4. Reads from kick_fd (to clear signal)
    DPDK_App->>DPDK_App: 5. Identifies corresponding vring, Processes vring directly (shared memory)
    deactivate DPDK_App

3b. Host/DPDK 通知 Guest (IRQ/Call)

  • 标准 QEMU/KVM 路径 (回顾简化版)
sequenceDiagram
    participant QEMU_Backend
    participant KVM
    participant GuestOS

    QEMU_Backend->>KVM: ioctl(KVM_IRQ_LINE / KVM_SIGNAL_MSI)
    KVM->>KVM: Marks IRQ pending in vIRQ Chip
    KVM->>GuestOS: Injects IRQ on KVM_RUN
  • vhost-user/DPDK 路径 (使用 irqfd)
    Setup Phase

    sequenceDiagram
    participant QEMU
    participant KVM
    participant DPDK_App as DPDK Application
    
    QEMU->>QEMU: 1. Creates an eventfd (call_fd) for a vring
    QEMU->>KVM: 2. ioctl(KVM_IRQFD, call_fd, guest_irq_line_gsi_or_msi_vector)
    activate KVM
    KVM->>KVM: Associates call_fd signal with Guest IRQ injection for the specified line
    KVM-->>QEMU: Success
    deactivate KVM
    QEMU->>DPDK_App: 3. VHOST_USER_SET_VRING_CALL (sends call_fd via Unix socket)
    activate DPDK_App
    DPDK_App->>DPDK_App: 4. Stores call_fd for the vring
    DPDK_App-->>QEMU: Ack
    deactivate DPDK_App

Runtime Phase

sequenceDiagram
    participant DPDK_App as DPDK Application
    participant call_fd as Eventfd (Kernel object)
    participant KVM
    participant KVM_vIRQ_Chip as KVM Virtual IRQ Chip
    participant GuestOS

    DPDK_App->>DPDK_App: 1. Finishes processing, updates used vring, needs to IRQ Guest
    DPDK_App->>call_fd: 2. eventfd_write(call_fd, 1)
    activate call_fd
    call_fd-->>KVM: 3. Kernel's eventfd/irqfd mechanism is triggered by write
    deactivate call_fd
    activate KVM
    KVM->>KVM_vIRQ_Chip: 4. (via irqfd_wakeup workqueue) Calls internal kvm_set_irq for the associated GSI/vector. Interrupt marked PENDING.
    note right of KVM: No QEMU ioctl call needed from userspace here!
    activate KVM_vIRQ_Chip
    KVM_vIRQ_Chip-->>KVM: IRQ pending
    deactivate KVM_vIRQ_Chip
    KVM->>GuestOS: 5. On next KVM_RUN (or if already running & interruptible), KVM injects the pending IRQ
    deactivate KVM

4、notify_ops (MMIO) 在 QEMU virtio-pci 中的作用和调用时机

  • 注册阶段 (virtio_pci_modern_regions_init)注册阶段 (virtio_pci_modern_regions_init)
    classDiagram
    class VirtIOPCIProxy {
        notify.mr
        创建并初始化设备
    }
    
    class memory_region_init_io {
        Function
        创建内存区域
    }
    
    class notify_ops {
        read: virtio_pci_notify_read
        write: virtio_pci_notify_write
    }
    
    class MemoryRegion {
        用于MMIO通知
    }
    
    class GuestPhysicalAddressSpace {
        MMIO区域映射位置
    }
    
    VirtIOPCIProxy --> memory_region_init_io : 调用
    notify_ops --> memory_region_init_io : 提供回调
    memory_region_init_io --> MemoryRegion : 创建
    MemoryRegion ..> GuestPhysicalAddressSpace : 映射到
  • 调用场景 A: notify_ops.write 被调用 (标准路径, 无 ioeventfd 旁路)
sequenceDiagram
    participant GuestDriver
    participant KVM
    participant QEMU_vCPUThread as QEMU vCPU Thread
    participant QEMU_MemSS as QEMU Memory Subsystem
    participant NotifyMR_Ops as "proxy->notify.mr .ops (notify_ops)"
    participant virtio_pci_notify_write as "virtio_pci_notify_write()"
    participant virtio_queue_notify as "virtio_queue_notify()"

    GuestDriver->>KVM: 1. Writes to VirtIO Notify MMIO Address
    activate KVM
    KVM-->>QEMU_vCPUThread: 2. KVM_EXIT_MMIO (phys_addr, data, len, is_write=true)
    deactivate KVM
    activate QEMU_vCPUThread
    QEMU_vCPUThread->>QEMU_MemSS: 3. address_space_rw(phys_addr, data, ...)
    activate QEMU_MemSS
    QEMU_MemSS->>NotifyMR_Ops: 4. Locates 'proxy->notify.mr', dispatches to its '.write' op
    deactivate QEMU_MemSS
    activate NotifyMR_Ops
    NotifyMR_Ops->>virtio_pci_notify_write: 5. Calls ops->write(opaque, offset_in_region, data_val, size)
    deactivate NotifyMR_Ops
    activate virtio_pci_notify_write
    virtio_pci_notify_write->>virtio_pci_notify_write: 6. Calculates queue_index based on offset
    virtio_pci_notify_write->>virtio_queue_notify: 7. Calls virtio_queue_notify(vdev, queue_idx)
    activate virtio_queue_notify
    virtio_queue_notify->>virtio_queue_notify: 8. Triggers backend processing for the queue
    deactivate virtio_queue_notify
    deactivate virtio_pci_notify_write
    deactivate QEMU_vCPUThread
  • 调用场景 B: notify_ops.write 被旁路 (vhost-user 使用 ioeventfd)
sequenceDiagram
    participant GuestDriver
    participant KVM
    participant ioeventfd
    participant DPDK_App

    Note over GuestDriver, KVM: QEMU has registered ioeventfd
    GuestDriver->>KVM: Writes to VirtIO Notify MMIO
    KVM->>ioeventfd: MMIO write signals eventfd
    Note right of KVM: No KVM_EXIT_MMIO to QEMU
    ioeventfd-->>DPDK_App: Eventfd notifies DPDK
    DPDK_App->>DPDK_App: Process notification directly
    Note over DPDK_App: No QEMU involvement
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇