6.033 2011 Lecture 8: Virtualization today's plan last lecture: multiplexing a computer between multiple programs this lecture: multiplexing a computer between multiple operating systems recall our original goal: to build complex program, enforce modularity original sketch: OS kernel isolates processes, provides bounded buffer reality: OS kernels have gotten very complex, 13MLOC for Linux if OS kernel bug is triggered, kernel might no longer provide isolation client and/or server might crash worse yet, processes might subtly break particularly worrisome: attacker exploits bug to gain control over kernel will look at security in more detail later in the course => would like to achieve stronger isolation between programs client and server programs may need different OS kernels (linux vs windows) client and server may need diff. versions (windows xp; linux 2.4 vs 2.6) back to the obvious solution: run multiple machines. works well in part because each machine has its own independent kernel. quite widespread in practice. problem: expensive in terms of equipment cost, maintenance, power, etc. => we saw almost the same situation a few lectures ago! would like to multiplex between multiple OS kernels on one computer. does the same technique work for context-switching between OSes? more things that need to be saved/restored by suspend and resume need to interpose on accesses to privileged hardware state instead of prohibiting privileged operations, need to make them work diagram hardware, virtual machine monitor (host), VM: guest kernel + guest processes show privileged state in hardware; one copy of such state for each VM state that needs to be saved: user/kernel bit interrupt descriptors interrupt masking flag + everything that we saved for a process: regs, SP, PC, page table register as before, no need to save memory because it's partitioned / protected how does a virtual machine monitor run the guest OS? naive approach: interpret each instruction, could work but too slow. better approach: run most of the instructions directly on hardware, similar to how the OS kernel runs different processes on hardware. challenge: what to do about privileged instructions (paging, U/K bit, ..)? cannot allow guest kernel to access them directly: would not be isolated. we have to run the guest kernel with the U/K bit set to "user". but still need to implement their functionality. virtualizing page tables draw diagram of paging in an OS kernel; virtual machine monitor below virtual machine monitor has its own translation that it wants to apply terminology for addresses: guest virtual, guest physical, host physical problem: guest page tables contain guest physical addresses cannot load the guest page table into hardware: will not work, no isolation solution: VMM constructs a shadow page table copy guest's page table, translate guest physical to host physical addrs if guest kernel issues instr to load page table reg, VMM loads shadow pt not quite enough: what if guest kernel modifies page table's contents? need to monitor the guest page table for changes mark all mappings for guest page table's pages as read-only, in shadow pt hardware will cause a trap into VMM if guest tries to modify page table VMM invalidates corresponding entry in shadow PT, restarts guest fill in shadow PT on demand if guest accesses address missing from shadow PT, check guest PT allows VMM to start running guest faster no need to fill in entire shadow PT, just start with empty shadow PT shadow paging can be slow, if guest modifies/switches page tables often recent hardware supports "nested paging" two sets of page tables implemented in hardware all "physical" addresses from 1st page table are translated using 2nd pt allow guest kernel to manipulate 1st page table in any way use 2nd page table to implement VMM's translation good for now, but not recursive -- what if we want to switch between VMMs? virtualizing the user/kernel bit what does the user/kernel bit affect? whether privileged instructions can be executed whether kernel-only pages can be accessed via paging VMM keeps an extra bit, which determines whether to allow privileged instr maintain two shadow pts: one for virtual user, one for virtual kernel when the virtual U/K bit changes, VMM changes shadow pt accordingly virtualizing interrupts VMM keeps virtual interrupt descriptor register when VMM wants to deliver interrupt to guest kernel, modify its stack.. what about guest process executing an "int" instruction for system call? real CPU starts executing VMM's trap handler. VMM's trap handler must check the state of CPU when trap happened. if trap caused by guest process, simulate interrupt to guest kernel. what happens to instructions that try to read privileged hardware state? read current U/K bit value, interrupt flag, page table register, .. the actual privileged state might be different than the "virtual" state e.g., guest OS thinks U/K bit is "K", but VMM sets U/K bit to "U" if these instructions are privileged, then hardware will trap to VMM. easy to emulate inside VMM, return desired value to guest. problem: on x86, some of these instructions are unprivileged. i.e., execute without trapping even when U/K bit is set to "user". step back for a second: why was this ok with regular user process? privileged state (e.g., pt register) was intended for use by OS kernel. well-behaved programs should not depend on its values for correctness; they were written to run on top of an OS kernel already. thus, programs shouldn't break if the OS kernel changes value. in contrast, the OS kernel expects specific behavior from hardware. e.g., if it writes a value to pt reg, read should return same value. kernels expect to access, modify privileged state. (most) kernels not written to run on top of VMM. solution 1 for unprivileged instructions: para-virtualization change the guest kernel to make the VMM's job easier guest kernel knows it's inside a VMM, doesn't use these instructions is this still useful? probably yes: the needed changes can be pretty small. requires being able to change the guest kernel (hard for Windows). solution 2 for unprivileged instructions: binary rewriting replace problematic instructions with another instruction that traps most archs provide such an instruction for debugging (e.g., "int3" on x86) int3 takes up 1 byte of memory, so it can go in place of any other instr when VMM receives a trap from the int3 instruction, look up original instr simulate effects, update guest's virtual state, resume execution need to be careful in case guest reads its own instruction memory in practice, more complicated rewriting used instead of int3, insert instructions to implement necessary operation no performance penalty due to extra traps solution 3 for unprivileged instructions: hardware support recent Intel, AMD CPUs allow VMM to request traps for problematic instructions as an optimization, recent CPUs can also execute problematic instructions using a virtual copy of the privileged state, instead of the real one. avoids overhead of trapping into VMM for each such instruction. virtualizing devices kernel provides to processes: file system, bounded buffers VMM provides to kernel: virtual versions of a disk, network card VMM traps on guest I/O instructions that talk to devices VMM supplies values from a virtual disk or network card device summary different OS kernels, each kernel might be complex and buggy can use CPU virtualization again to multiplex between several OSes key problem: virtualizing privileged state main approaches: trap & emulate; binary translation; hardware support