QEMU内存分析(四):ept页表构建
简介:
在虚拟化环境下,intel CPU在处理器级别加入了对内存虚拟化的支持。即扩展页表EPT,而AMD也有类似的成为NPT。在此之前,内存虚拟化使用的一个重要技术为影子页表。
在虚拟化环境下,虚拟机使用的是客户机虚拟地址GVA,而其本身页表机制只能把客户机的虚拟地址转换成客户机的物理地址也就是完成GVA->GPA的转换,但是GPA并不是被用来真正的访存,所以需要想办法把客户机的物理地址GPA转换成宿主机的物理地址HPA。影子页表采用的是一步到位式,即完成客户机虚拟地址GVA到宿主机物理地址HPA的转换,由VMM为每个客户机进程维护。本节对于影子页表不做过多描述,重点在于EPT。内容分为两部分,第一部分根据intel手册分析EPT地址转换机制;第二部分借助于HAXM源代码分析EPT构建过程。
一: EPT转换机制
当一个逻辑CPU处于非根模式下运行客户机代码时,使用的地址是客户机虚拟地址,而访问这个虚拟地址时,同样会发生地址的转换,这里的转换还没有设计到VMM层,和正常的系统一样,这里依然是采用CR3作为基址,利用客户机页表进行地址转换,只是到这里虽然已经转换成物理地址,但是由于是客户机物理地址,不等同于宿主机的物理地址,所以并不能直接访问,需要借助于第二次的转换,也就是EPT的转换。注意EPT的维护有VMM维护,其转换过程由硬件完成,所以其比影子页表有更高的效率。
我们假设已经获取到了客户机的物理地址,下面分析下如何利用一个客户机的物理地址,通过EPT进行寻址。
注意不管是32位客户机还是64位客户机,这里统一按照64位物理地址来寻址。EPT页表是4级页表,页表的大小仍然是一个页即4KB,但是一个表项是8个字节,所以一张表只能容纳512个表项,需要9位来定位具体的表项。客户机的物理地址使用低48位来完成这一工作。从上图可以看到,一个48位的客户机物理地址被分为5部分,前4部分按9位划分,最后12位作为页内偏移。当处于非根模式下的CPU使用客户机操作一个客户机虚拟地址时,首先使用客户机页表进行地址转换,得到客户机物理地址,然后CPU根据此物理地址查询EPT,在VMCS结构中有一个EPTP的指针,其中的12-51位指向EPT页表的一级目录即PML4 Table.这样根据客户机物理地址的首个9位就可以定位一个PML4 entry,一个PML4 entry理论上可以控制512GB的区域,这里不是重点,我们不在多说。PML4 entry的格式如下:
1、其实这里我们只需要知道PML4 entry的12-51位记录下一级页表的地址,而这40位肯定是用不完的,根据CPU的架构,采取不同的位数,具体如下:
在Intel中使用MAXPHYADDR来表示最大的物理地址,我们可以通过CPUID的指令来获得处理支持的最大物理地址,然而这已经不在此次的讨论范围之内,我们需要知道的只是:
当MAXPHYADDR 为36位,在Intel平台的桌面处理器上普遍实现了36位的最高物理地址值,也就是我们普通的个人计算机,可寻址64G空间;
当MAXPHYADDR 为40位,在Inter的服务器产品和AMD 的平台上普遍实现40位的最高物理地址,可寻址达1TB;
当MAXPHYADDR为52位,这是x64体系结构描述最高实现值,目前尚未有处理器实现。
而对下级表的物理地址的存储4K页面寻址遵循如下规则:
① 当MAXPHYADDR为52位时,上一级table entry的12~51位提供下一级table物理基地址的高40位,低12位补零,达到基地址在4K边界对齐;
② 当MAXPHYADDR为40位时,上一级table entry的12~39位提供下一级table物理基地址的高28位,此时40~51是保留位,必须置0,低12位补零,达到基地址在4K边界对齐;
③ 当MAXPHYADDR为36位时,上一级table entry的12~35位提供下一级table物理基地址的高24位,此时36~51是保留位,必须置0,低12位补零,达到基地址在4K边界对齐。
而MAXPHYADDR为36位正是普通32位机的PAE模式。
2、使用对应的地址位数定位下一级的页表EPT Page-Directory-Pointer-Table的基址,根据客户物理地址的30-38位定位此页表中的一个表项EPT Page-Directory-Pointer-Table entry。注意这里如果该表项的第7位为1,该表项指向一个1G字节的page.为0,则指向下一级页表。下面我们只考虑的是指向页表的情况。
Page-Directory-Pointer-Table如下:
3、然后根据表项中的12-51位,继续往下定位到第三级页表EPT Page-Directory-Table的基址,在根据客户物理地址的21-29位来定位到一个EPT Page-Directory-Table Entry。如果此entry的第7位为1,则表示该entry指向一个2M的page,为0就指向下一级页表。
4、根据entry的12-51位定位第四级页表EPT Page-Table基址 ,然后根据客户物理地址的12-20位定位一个PT。
PT的12-51位指向一个4K物理页面,最后根据客户物理地址的最低12位作为偏移,定位到具体的物理地址。各个位属性如下图:
二 :EPT初始化
2.1 ept-tree初始化
hax_accel_init() //hax-all.c hax_init() //hax-all.c hax_vm_create() //hax-all.c hax_create_vm() //vm.c ept_tree_init() //vm.c
至此 ,我们已经建立了一张空的ept表,那么ept的实际内容什么时候填充呢?
EPT的构建过程,其构建模式和普通页表一样,属于中断触发式。即初始页表是空的,只有在访问未命中的时候引发缺页中断,然后缺页处理程序构建页表。
初始状态EPT页表为空,当客户机运行时,其使用的GVA转化成GPA后,还需要CPU根据GPA查找EPT,从而定位具体的HPA,但是由于此时EPT为空,所以会引发缺页中断,发生VM-exit, 此时CPU进入到根模式,运行VMM(这里指HAXM),在HAXM中定义了一个异常处理数组来处理对应的VM-exit,
static int (*handler_funcs[])(struct vcpu_t *vcpu, struct hax_tunnel *htun) = { [VMX_EXIT_EPT_VIOLATION] = exit_ept_violation, };
2.2 EPT页表填充
上一节已经介绍了ept-tree是如何创建的,并且说明了ept页表通过缺页异常填充,这一节主要从代码角度分析如何填充ept页表。 缺页异常在VCPU_RUN的过程中发成VM-exit,被捕获后通过上一节介绍的handler执行对应的函数,先来分析下qemu侧如何调用到HAXM使VCPU_RUN。main_impl() //vl.c machine_class->init()::pc_init1() //pc_piix.c pc_new_cpu() //pc_piix.c x86_cpu_realizefn() //cpu.c qemu_init_vcpu() //cpus.c qemu_hax_start_vcpu() //cpus.c qemu_hax_cpu_thread_fn() //cpus.c hax_init_vcpu() //hax-all.c hax_vcpu_creat() //hax-all.c hax_host_create_vcpu() //hax-windows.c hax_smp_cpu_exec() hax_vcpu_exec() //hax-all.c hax_vcpu_hax_exec() //hax-all.c hax_vcpu_run() //hax-windows.c
hax_init_vcpu()最终通过HAXM创建了vcpu,hax_vcpu_run()最终通过HAXM使VCPU运行。
vcpu_execute() //vcpu.c cpu_vmx_execute() //vcpu.c cpu_vmexit_handler() //cpu.c vcpu_vmexit_handler() //vcpu.c handler_funcs[basic_reason] == exit_ept_violation
当basic_reasion是VMX_EXIT_EPT_VIOLATION时,将会执行缺页异常。
static int exit_ept_violation(struct vcpu_t *vcpu, struct hax_tunnel *htun) { //获取退出原因 htun->_exit_reason = vmx(vcpu, exit_reason).basic_reason; //gla1和gla2的作用暂不清楚 if (qual->ept.gla1 == 0 && qual->ept.gla2 == 1) { vcpu_set_panic(vcpu); hax_log(HAX_LOGPANIC, "Incorrect EPT setting\n"); dump_vmcs(vcpu); return HAX_RESUME; } //根据vmx获取gpa地址 gpa = vmx(vcpu, exit_gpa); //ept缺页后核心处理函数,下面重点分析这个 ret = ept_handle_access_violation(&vcpu->vm->gpa_space, &vcpu->vm->ept_tree, *qual, gpa, &fault_gfn, &first_access); //错误处理先省略 // ret == 0: The EPT violation is due to MMIO //后续分析这个函数 return vcpu_emulate_insn(vcpu); }
ept_handle_access_violation函数计算gpa的所在页面:
int ept_handle_access_violation(hax_gpa_space *gpa_space, hax_ept_tree *tree, exit_qualification_t qual, uint64_t gpa, uint64_t *fault_gfn, bool *first_access) { //右偏移12位,获取4K的index作为gfn(guest frame number) gfn = gpa >> PG_ORDER_4K; hax_assert(gpa_space != NULL); //根据gfn在gpa_space中查找slot,此处一定可以找到slot, //slot通过qemu的memory_lisenter机制的region_add,通过系统调用HAX_VM_IOCTL_SET_RAM2在haxm侧创建 //遗留:未找到如何把gpa的所有slot全部赋值,因为到这里是不会拿到NULL的 slot = memslot_find(gpa_space, gfn); if (!slot) { // The faulting GPA is reserved for MMIO hax_log(HAX_LOGD, "%s: gpa=0x%llx is reserved for MMIO\n", __func__, gpa); return 0; } // Extract bits 5..3 from Exit Qualification //检查ept的权限,因为是ept violation所以qual从vcpu的exit_qualification得到的是ept结构中内容 /* struct { uint32_t r : 1; uint32_t w : 1; uint32_t x : 1; uint32_t _r : 1; uint32_t _w : 1; uint32_t _x : 1; uint32_t res1 : 1; uint32_t gla1 : 1; uint32_t gla2 : 1; uint32_t res2 : 3; uint32_t nmi_block : 1; uint32_t res3 : 19; uint32_t res4 : 32; } ept; */ combined_perm = (uint) ((qual.raw >> 3) & 7); if (combined_perm != HAX_EPT_PERM_NONE) { if ((qual.raw & HAX_EPT_ACC_W) && !(combined_perm & HAX_EPT_PERM_W) && (slot->flags == HAX_MEMSLOT_READONLY)) { // Handle a write to ROM/ROM device as MMIO hax_log(HAX_LOGD, "%s: write to a read-only gpa=0x%llx\n", __func__, gpa); return 0; } // See IA SDM Vol. 3C 27.2.1 Table 27-7, especially note 2 hax_log(HAX_LOGE, "%s: Cannot handle the case where the PTE " "corresponding to the faulting GPA is present: qual=0x%llx, " "gpa=0x%llx\n", __func__, qual.raw, gpa); return -EACCES; } // Ideally we should call gpa_space_is_page_protected() and ask user space // to unprotect just the host virtual page that |gfn| maps to. But since we // pin host RAM one chunk (rather than one page) at a time, if the chunk // that |gfn| maps to contains any other host virtual page that is protected // (by means of a VirtualProtect() or mprotect() call from user space), we // will not be able to pin the chunk when we handle the next EPT violation // caused by the same |gfn|. // For now, we ask user space to unprotect all host virtual pages in the // chunk, so our next hax_pin_user_pages() call will not fail. This is a // dirty hack. // TODO: Make chunks more flexible, so we can pin host RAM in finer // granularity (as small as one page) and hide chunks from user space. if (gpa_space_is_chunk_protected(gpa_space, gfn, fault_gfn)) { hax_log(HAX_LOGE, "%s: gfn = 0x%llx(fault_gfn = 0x%llx) " "is in protected chunk\n", __func__, gfn, *fault_gfn); return -EFAULT; } // The faulting GPA maps to RAM/ROM //此处 计算start_gpa和size,用来传给下一个函数 //主要通过gpa/slot/trunk/block的偏移关系计算,因为chunk已经设置成4K大小 //所以下面的计算可以简洁化,chunk一定在slot内部,不存在slot的起始地址包含在chunk中的这种情况 //位置分布情况参考下图 is_rom = slot->flags & HAX_MEMSLOT_READONLY; offset_within_slot = gpa - (slot->base_gfn << PG_ORDER_4K); hax_assert(offset_within_slot < (slot->npages << PG_ORDER_4K)); block = slot->block; hax_assert(block != NULL); offset_within_block = slot->offset_within_block + offset_within_slot; hax_assert(offset_within_block < block->size); chunk = ramblock_get_chunk(block, offset_within_block, true, first_access); // Compute the union of the UVA ranges covered by |slot| and |chunk| chunk_offset_low = chunk->base_uva - block->base_uva; //因为通过gfn找到的slot,所以gpa一定落在slot的范围内 start_gpa = slot->base_gfn << PG_ORDER_4K; //因为chunk设置成4k,所以else分支实际不会再执行了 //此时计算出start_gpa实际就是一个chunk的起始地址,4K对齐 if (chunk_offset_low > slot->offset_within_block) { start_gpa += chunk_offset_low - slot->offset_within_block; offset_within_chunk = 0; } else { offset_within_chunk = slot->offset_within_block - chunk_offset_low; } chunk_offset_high = chunk_offset_low + chunk->size; slot_offset_high = slot->offset_within_block + (slot->npages << PG_ORDER_4K); //chunk为4k,所以size其实也是4k size = chunk->size - offset_within_chunk; //因为chunk和slot都 4K对齐,chunk只有一个页面大小,所以这个if其实也不会执行 if (chunk_offset_high > slot_offset_high) { size -= chunk_offset_high - slot_offset_high; } //此处开始创建一个页面的ept表项 ret = ept_tree_create_entries(tree, start_gpa >> PG_ORDER_4K, size >> PG_ORDER_4K, chunk, offset_within_chunk, slot->flags); return 1; }
接下来将通过ept_tree_create_entries进行表项的创建工作
// Given a GFN and a pointer (KVA) to an EPT page table at a non-leaf level // (PML4, PDPT or PD) that covers the GFN, returns a pointer (KVA) to the next- // level page table that covers the GFN. This function can be used to walk a // |hax_ept_tree| from root to leaf. // |tree|: The |hax_ept_tree| to walk. // |gfn|: The GFN from which to obtain EPT page table indices. // |current_level|: The EPT level to which |current_table| belongs. Must be a // non-leaf level (PML4, PDPT or PD). // |current_table|: The KVA of the current EPT page table. Must not be NULL. // |kmap|: A buffer to store a host-specific KVA mapping descriptor, which may // be created if the next-level EPT page table is not a frequently-used // page. The caller must call hax_unmap_page_frame() to destroy the KVA // mapping when it is done with the returned pointer. // |create|: If true and the next-level EPT page table does not yet exist, // creates it and updates the corresponding |hax_epte| in // |current_table|. // |visit_current_epte|: An optional callback to be invoked on the |hax_epte| // that belongs to |current_table| and covers |gfn|. May // be NULL. // |opaque|: An arbitrary pointer passed as-is to |visit_current_epte|. static hax_epte * ept_tree_get_next_table(hax_ept_tree *tree, uint64_t gfn, int current_level, hax_epte *current_table, hax_kmap_phys *kmap, bool create, epte_visitor visit_current_epte, void *opaque) { int next_level = current_level - 1; hax_ept_page_kmap *freq_page; uint index; hax_epte *epte; hax_epte *next_table = NULL; hax_assert(tree != NULL); hax_assert(next_level >= HAX_EPT_LEVEL_PT && next_level <= HAX_EPT_LEVEL_PDPT); //此处根据gfn和level获取当前gfn在所查找的表项中的下标 //下标取值:若为pt表则取11..20,pd则取21..29,pdpt则30..38, pml4则39..47 index = (uint) ((gfn >> (HAX_EPT_TABLE_SHIFT * current_level)) & (HAX_EPT_TABLE_SIZE - 1)); hax_assert(current_table != NULL); //获取当前表中的表项 epte = ¤t_table[index]; //ept violation中下发为null,if不执行 if (visit_current_epte) { visit_current_epte(tree, gfn, current_level, epte, opaque); } //ept violation中 create为true,if不执行 if (epte->perm == HAX_EPT_PERM_NONE && !create) { return NULL; } //这里在tree中存放了常用的表项,加快访问速度 // Only HAX_EPT_FREQ_PAGE_COUNT EPT pages are considered frequently-used, // whose KVA mappings are cached in tree->freq_pages[]. They are: // a) The EPT PML4 table, covering the entire GPA space. Cached in // freq_pages[0]. // b) The first EPT PDPT table, pointed to by entry 0 of a), covering the // first 512GB of the GPA space. Cached in freq_pages[1]. // c) The first n EPT PD tables (n = HAX_EPT_FREQ_PAGE_COUNT - 2), pointed // to by entries 0..(n - 1) of b), covering the first nGB of the GPA // space. Cached in freq_pages[2..(n + 1)]. freq_page = ept_tree_get_freq_page(tree, gfn, next_level); if (hax_cmpxchg64(0, INVALID_EPTE.value, &epte->value)) { // epte->value was 0, implying epte->perm == HAX_EPT_PERM_NONE, which // means the EPT entry pointing to the next-level page table is not // present, i.e. the next-level table does not exist hax_ept_page *page; uint64_t pfn; hax_epte temp_epte = { 0 }; void *kva; //这里申请了页面空间,填充的表项和共享的物理空间也是在这里处理,后续分析具体实现 //此处理解返回了hax_ept_page的结构体,同时里面的pmdl也申请了空间,然后加入了tree的page_list //相当于放回了一段空间的头部描述信息,放入对应页表中下标为index的位置 page = ept_tree_alloc_page(tree); if (!page) { epte->value = 0; hax_log(HAX_LOGE, "%s: Failed to create EPT page table: gfn=0x%llx," " next_level=%d\n", __func__, gfn, next_level); return NULL; } page->level = next_level; //从上面ept_tree_alloc_page中为page申请的空间中获取pfn pfn = hax_get_pfn_phys(&page->memdesc); hax_assert(pfn != INVALID_PFN); temp_epte.perm = HAX_EPT_PERM_RWX; // This is a non-leaf |hax_epte|, so ept_mt and ignore_pat_mt are // reserved (see IA SDM Vol. 3C 28.2.2 Figure 28-1) temp_epte.pfn = pfn; //获取申请的mdl空间映射的内核地址 kva = hax_get_kva_phys(&page->memdesc); hax_assert(kva != NULL); if (freq_page) { // The next-level EPT table is frequently used, so initialize its // KVA mapping cache freq_page->page = page; freq_page->kva = kva; } // Create this non-leaf EPT entry epte->value = temp_epte.value; next_table = (hax_epte *) kva; hax_log(HAX_LOGD, "%s: Created EPT page table: gfn=0x%llx, " "next_level=%d, pfn=0x%llx, kva=%p, freq_page_index=%ld\n", __func__, gfn, next_level, pfn, kva, freq_page ? freq_page - tree->freq_pages : -1); } else { // !hax_cmpxchg64(0, INVALID_EPTE.value, &epte->value) // epte->value != 0, which could mean epte->perm != HAX_EPT_PERM_NONE, // i.e. the EPT entry pointing to the next-level EPT page table is // present. But there is another case: *epte == INVALID_EPTE, which // means the next-level page table is being created by another thread void *kva; int i = 0; while (epte->value == INVALID_EPTE.value) { // Eventually the other thread will set epte->pfn to either a valid // PFN or 0 if (!(++i % 10000)) { // 10^4 hax_log(HAX_LOGI, "%s: In iteration %d of while loop\n", __func__, i); if (i == 100000000) { // 10^8 (< INT_MAX) hax_log(HAX_LOGE, "%s: Breaking out of infinite loop: " "gfn=0x%llx, next_level=%d\n", __func__, gfn, next_level); return NULL; } } } if (!epte->value) { // The other thread has cleared epte->value, indicating it could not // create the next-level page table hax_log(HAX_LOGE, "%s: Another thread tried to create the same EPT " "page table first, but failed: gfn=0x%llx, next_level=%d\n", __func__, gfn, next_level); return NULL; } if (freq_page) { // The next-level EPT table is frequently used, so its KVA mapping // must have been cached kva = freq_page->kva; hax_assert(kva != NULL); } else { // The next-level EPT table is not frequently used, which means a // temporary KVA mapping needs to be created hax_assert(epte->pfn != INVALID_PFN); hax_assert(kmap != NULL); kva = hax_map_page_frame(epte->pfn, kmap); if (!kva) { hax_log(HAX_LOGE, "%s: Failed to map pfn=0x%llx into " "KVA space\n", __func__, epte->pfn); } } next_table = (hax_epte *) kva; } return next_table; }
不考虑其他线程创建next_level的表项,也不考虑next_level已经创建的情况,假设next_level没有创建,那么这里会使用ept_tree_alloc_page创建一个页面,用来存储next_level的 表项,继续分析这个函数
// Allocates a |hax_ept_page| for the given |hax_ept_tree|. Returns the // allocated |hax_ept_page|, whose underlying host page frame is filled with // zeroes, or NULL on error. static hax_ept_page * ept_tree_alloc_page(hax_ept_tree *tree) { hax_ept_page *page; int ret; //申请page结构体,其中memdesc存储这个page的具体空间 /* typedef struct hax_ept_page { hax_memdesc_phys memdesc; // Turns this object into a list node hax_list_node entry; int level; } hax_ept_page; */ page = (hax_ept_page *) hax_vmalloc(sizeof(*page), 0); if (!page) { hax_log(HAX_LOGE, "%s: hax_vmalloc for page fail\n", __func__); return NULL; } //为memdesc申请4k空间 ret = hax_alloc_page_frame(HAX_PAGE_ALLOC_ZEROED, &page->memdesc); if (ret) { hax_log(HAX_LOGE, "%s: hax_alloc_page_frame() returned %d\n", __func__, ret); hax_vfree(page, sizeof(*page)); return NULL; } hax_assert(tree != NULL); ept_tree_lock(tree); //将页面挂到链表上 hax_list_add(&page->entry, &tree->page_list); ept_tree_unlock(tree); return page; }
其中hax_alloc_page_frame为真正申请4k空间的函数,为这一级表项申请存储指向下一级表的指针的空间。
int hax_alloc_page_frame(uint8_t flags, hax_memdesc_phys *memdesc) { PHYSICAL_ADDRESS low_addr, high_addr, skip_bytes; ULONG options; PMDL pmdl; //填充 PHYSICAL_ADDRESS结构体 low_addr.QuadPart = 0; high_addr.QuadPart = (int64_t)-1; skip_bytes.QuadPart = 0; // TODO: MM_ALLOCATE_NO_WAIT? options = MM_ALLOCATE_FULLY_REQUIRED; if (!(flags & HAX_PAGE_ALLOC_ZEROED)) { options |= MM_DONT_ZERO_ALLOCATION; } // This call may block //申请空间大小为PAGE_SIZE_4K,存放512项表项指针 //从主内存分配0填充、非分页的页面,使用MmAllocatePagesForMDL或者MmAllocatePagesForMdlEx。 //这些函数返回一个MDL描述内存的分配。驱动使用函数MmGetSystemAddressForMdlSafe映射MDL描述的页到内核虚拟地址空间。 pmdl = MmAllocatePagesForMdlEx(low_addr, high_addr, skip_bytes, PAGE_SIZE_4K, MmCached, options); if (!pmdl) { hax_log(HAX_LOGE, "%s: Failed to allocate 4KB of nonpaged memory\n", __func__); return -ENOMEM; } memdesc->pmdl = pmdl; return 0; }