The goal of the system is to support unikernel uses cases, providing good real-time guarantees and performance in general while providing strong language-based safety guarantees.
The implementation language is Rust, which permits 'unsafe' code to perform operations which the compiler cannot prove are safe but normally enforces rules ensuring that user code cannot violate memory- or (certain) thread-safety rules. At the same time, programs written to run on a conventional operating system should be portable with minimal effort.
Compared to existing unikernels, this system achieves an advantage in one or more of these goals. Those built to run programs implemented in high-level languages have unpredictable performance characteristics as a consequence, such as in Mirage or HaLVM where the target languages (OCaml and Haskell, respectively) provide memory safety via a garbage-collecting runtime.
Others including OSv and the NetBSD-based "rump kernel" framework are built on very large bases of existing unsafe (usually C or C++) code which is difficult to verify. Though the existing code these are built on is typically mature and free of major bugs, there are likely undiscovered bugs in each of them. A system designed to minimize the volume of unsafe code in which bugs may lurk is less likely to contain surprises than those built entirely with unsafe code.
Non-rust components of the system are kept to a minimum. Currently only three parts:
The boot code is used only for system initialization, so its correctness is not relevant to runtime correctness. The two C libraries are assumed to be correct and well-tested, since both come from mature projects and have relatively limited API surfaces (most if not all contained functions are pure and depend only on their inputs).
Though the "escape hatch" exists to bypass compiler-enforced safety, where possible the internal APIs are designed to provide similar guarantees: an API consumer cannot cause unexpected behavior through API misuse, since the compiler will not allow it.
In those portions of the system which must perform unsafe operations, the implementations include assertions regarding the observable properties of inputs, which should catch some bugs. While these alone are unable to assert the correctness of the caller, they can catch some particularly harmful misuses.
When combined with adherence to the principle that the footprint of unsafe code should be as small as possible, this provides good confidence that the system as a whole is correct. Additional verification through testing or other approaches should be able to further support that the system's assumptions are correctly upheld when they cannot be automatically verified.
PhysPtr<T>
One example of an API designed to statically prohibit misuse is PhysPtr<T>
, a wrapper around pointers which represent physical (rather than virtual) addresses. In typical C, such a type would usually be implemented as an alias of an untyped pointer (void *
), which can be easily misused, such as by treating the pointer as a virtual address or changing the pointed type to an incompatible one.
PhysPtr<T>
prevents both of those misuses in safe code. The type parameter T
is invariant so the pointed type cannot be changed. Retrieving the value of the physical pointer as a virtual address (which may then be dereferenced and be known to refer to the correct physical memory) is a safe operation, while extracting the literal value of the pointer without VA translation may only be done in unsafe code.
Dereferencing of raw pointers formed by translating physical addresses into virtual still must be performed in unsafe code because the compiler cannot track the validity of the target, but code written to minimize the amount of unsafe code will be prevented from incorrectly skipping the translation step. Where required, however, it remains possible to extract the pointer value without translation.
As is typical in unikernels, the software ignores hardware support for privilege isolation, instead opting to run all code at the highest privilege level, equivalent to that of the kernel in a typical monolithic OS design. The unikernel model is designed to never run arbitrary untrusted code, generally limiting itself to code included at compile-time which is assumed to be trustworthy.
As a consequence of all code being mutually trusted, there is no equivalent of process isolation. All threads execute in a single shared address space, and hardware interrupts do not require independent stacks. Where a typical OS must switch stacks when handling an interrupt to guard against malicious corruption by other threads in the same process (implying storage be allocated to the new stack), when all code is trusted such a threat need not be guarded against.
The system memory map is uniform for all threads, using virtual memory only to hide the typically fragmented view of physical memory on x86 and simplify the goal of preventing memory corruption through stack overflow. This approach also avoids the complexity of maintaining a record of allocated physical memory.
Since the majority of the memory map is static across all execution contexts, rescheduling of threads does not incur significant costs in TLB invalidation. Additionally, because the only dynamic portion of the virtual memory mappings is maintained on a per-thread basis, inter-CPU TLB shootdowns are completely avoided in SMP configurations.
At boot-time, firmware or a boot loader loads the boot sector from disk, with the CPU in real mode. The boot sector loads additional boot code and the long mode kernel image, which is placed at +2 megabytes in physical memory. This image contains all of the system's code (excepting boot code) and is not relocatable. Runtime loading of code is deliberately not supported.
Boot code acquires information from the firmware about the system memory map and stores that information in memory at a location accessible by the kernel image for further initialization. It then switches into 64-bit (long) mode and jumps to kmain
, the kernel entry point with virtual memory configured to provide a 1:1 mapping to physical memory in the first 512GB of address space.
kmain
kmain
uses the memory map provided in low memory by the boot code to set up the normal runtime memory map, which is laid out as follows.
The unmapped 2 megabyte region at address 0 exists out of an abundance of caution, permitting attempted null pointer dereferences to be caught. This should serve no purpose for 'safe' Rust code, but may be useful for detecting certain bugs in 'unsafe' code. This also simplifies the code which bootstraps this memory map, since this unmapped region can be used as scratch memory while setting up other regions, holding page tables and the execution stack.
The memory pool is built from all available physical memory, excluding the parts occupied by the kernel code. This is where all non-static storage exists.
The thread stack region is mapped on a per-CPU basis, with the mapping changing according to the currently-executing thread such that the thread's stack (allocated from the memory pool) starts at the top of memory and grows downward. Past the end of the allocated stack, memory is unmapped to protect against stack overflows. Inside the thread stack region a page fault will trigger allocation of additional memory up to a limit. The unmapped region below thread stacks acts as a guard page, placing a hard size limit of 511GB on any single thread's stack.
Page tables for all but the stack region can be shared by all threads, so the minimum per-thread memory requirement is three pages: two for page tables, and one for stack. The thread's stack will be initialized some distance below the top of memory, using the area above it as storage for context-switching.
The stack pages are used in this way to minimize the memory footprint of threads, so threads which never have very deep call stacks require less memory. This makes applications utilizing large numbers of simple or short-lived threads more feasible than the alternative would imply.
With the standard memory map configured, the system allocates from the memory pool (with the default allocator) to hold the page tables, which to that point have been placed in the low-memory unmapped region. The active page tables may then be updated to point into the memory pool.
The ID-mapped region is used to access memory where only physical addresses are available, such as when modifying page tables (which contain physical addresses, rather than virtual). Translating a virtual address to physical requires only walking the page tables, but going the opposite direction requires a scan of all page tables. Since the code always operates with virtual addresses, the ID-mapped region allows linear-time access to arbitrary physical addresses simply by adding to a virtual address.
Finally, a thread context is created for the bootstrap thread, which after this point is treated like any other thread. Necessary storage is allocated from the memory pool, including space for registers on context switch, stack space and page tables.
Finally the executing code's stack is moved into the stack region, eliminating the final user of "scratch" memory in the low-memory region. The page mappings for that region can be removed and remapped into the memory pool, or simply discarded.
The peripherals and external hardware to be used with any given instance of the system are specified at compile-time via configuration. This exposes details of the hardware to the compiler allowing better optimizations. Compare to a dynamic configuration, where the required dynamic dispatch through method pointers is by necessity opaque to the compiler.
Rust permits concrete implementations of a generic interface to be selected while ensuring compliance to the interface through compile-time monomorphization. Operations which are meant to be compatible with any implementation of an interface can be made generic over that interface, which will be monomorphized by the compiler. It is also possible to specify dynamic dispatch through a generated vtable, but this is only rarely necessary.
Other unikernels based on full operating systems must use dynamic dispatch to support a wide variety of hardware, while the language in use (C, generally) lacks the expressiveness to eliminate dynamic dispatch. Those supporting smaller sets of hardware are in practice the same ones which require runtime support for the language, which tends to preclude optimization to a similar level.
While a program running on Mirage for example will need to call into the runtime library at a minimum before being placed in, for example, a disk queue, a program in my system may have all of the relevant operations inlined to the point where the only action actually taken by the optimized code is to copy parameters directly into a disk queue. For applications performing large amounts of I/O, the savings could be large.
The actual compile-time configuration for system hardware is taken from a plaintext specification which is compiled into code which is compiled into the binary. This configuration is envisioned to be not unlike Linux's device tree, where the presence of a particular piece of hardware is specified in a declarative manner. Software policies may be supplied in addition to the hardware.
A simple example might specify a system with two hardware serial ports, used for the system console and debugging respectively, and a network adapter which does nothing but is not the default network device:
[console.default]
type = "serial"
device = "COM1"
[debugger.default]
type = "serial"
device = "COM2"
[[resource]]
type = "network"
device = "nullnet"
[[device]]
name = "COM1"
type = "serial"
driver = "8250"
format = "9600,8n1"
ioaddr = 0x3F8
[[device]]
name = "COM2"
type = "serial"
driver = "8250"
format = "115200,8n1"
ioaddr = 0x2F8
[[device]]
name = "nullnet"
type = "network"
driver = "null"
Multiple software devices are not permitted to refer to the same hardware, and compiling the configuration into code involves verifying that the set of specified devices do not overlap, such as in the above example ensuring the COM1 and COM2 devices do not use overlapping regions of I/O memory.
Interrupt vectors are automatically (statically) allocated as required by drivers. Once again, this static dispatch exposes more information to the compiler allowing better optimizations.
The current code lays groundwork for more useful capabilities, notably memory allocation and threading. Under the assumption that applications may require additional memory or CPUs at short notice (such as in the case of a service seeing a sudden increase in traffic), CPUs and memory are detected dynamically at boot-time. All other hardware is statically configured.
Pages are allocated with the least granularity possible in order to minimize the number of TLB entries required to access any arbitrary region of memory (thus improving TLB hit rates), and to minimize the amount of "wasted" memory needed to store page tables.
Some concurrency primitives (mutexes) are implemented, but only as required for resources which cannot be shared (particularly, hardware for I/O). Atomic primitives are provided by Rust's libcore
, so the hardware requirements for additional concurrency primitives are easy to meet.
Supporting the implementation of interrupt handlers, I have proposed two different approaches for inclusion in the Rust compiler. The first of these, support for naked functions, was initially deferred but is being reconsidered following the second proposal (additional calling conventions) and call for the feature from others.
Supporting naked functions allows the programmer to implement the entry and exit sequences for hardware ISRs without the compiler inserting its own stack frame maintenance code which will corrupt stored state.
The safety of such functions is difficult to enforce, since the compiler in general assumes the presence of a usable stack frame, the control of which is explicitly disabled in a naked function. This has been the subject of most discussion relating to this proposal.
Attempting to address these concerns was enough to leave the RFC un-accepted for a time, though it is now being considered again following demand for the same feature from other users.
Naked functions allow the programmer to write ISRs without involving external tools or resorting to unidiomatic constructs in their code, but are in some respects suboptimal since the compiler cannot be aware of the required actions and thus is unable to optimize effectively.
An alternate approach, giving the compiler (LLVM in this case) knowledge of what requirements are placed on an ISR, allows better optimization and removes room for user error in writing an ISR as a naked function.
I implemented a proof-of-concept patch for x86 hardware interrupts in LLVM as a custom calling convention, which was met with interest. A more fleshed-out approach was proposed to the Clang developers mailing list, inspired by my patch.
The Rust RFC for interrupt calling convention support was, after some discussion, deferred due to limited utility and the relatively wide-reaching changes required to support it in a portable fashion (the user details of doing so depend on the target system, so the feature has a large user-visible footprint).
Interest in this feature among LLVM developers has resurged somewhat since my initial patch, but in the short term I do not expect it to be useful to my project. In the longer term, having good support for writing ISRs in LLVM (which implements similar support for some commonly-embedded architectures) may trigger additional discussion regarding its inclusion in Rust.
Compiler-mediated ISRs like this are novel among systems targeting complex architectures like x86 (compared to ARM for example, where hardware manages most state for ISRs and some compilers offer annotations) and offer appreciable advantages in visibility to the optimizer, implying the capacity for better performance. As a simple example, the compiler may only need to spill a small number of registers on ISR entry and do so, whereas a naive implementation must spill all registers because it is unaware of which registers are actually used and must be conservative.
The immediate goals are implementing an allocator from the memory pool and subsequently threading. These modules should be pluggable like other components such as the console and debug drivers, such that the default system allocator and thread scheduler can be specified at compile-time.
Rust recently gained support for pluggable allocators (not requiring a custom build of the compiler or patches to the libraries), which allows libstd
to target platforms which are not officially supported by providing a custom allocator. This allows parts of the standard library such as common data structures to be used even in a bare-metal environment. Consequently, this enables better portability of applications. Implementing this support is desirable, but requires some changes to my allocator APIs (which are currently experimental in any case).
Implementing tests for unsafe code will strengthen the safety guarantees of the system. Conventional testing techniques could be used given appropriate API design, but more sophisticated tools such as KLEE might be useful for simplifying testing. KLEE is a good prospect because it operates on LLVM bitcode which rustc
can emit, but may not be feasible due to mismatched LLVM versions in use between the two tools. Ideally all unsafe code will have tests, but in practice this will probably not be possible.
Some interfaces, particularly with hardware, can be adapted from other codebases without appreciably weakening the safety guarantees. Given that most code that interacts directly with hardware depends on correct interpretation of hardware specifications which have no method for automatic verification and that the code doing these interactions must in general be unsafe, using existing hardware abstraction layers such as DPDK for networking is a more convincing argument for safety than reimplementing such functionality in unsafe code. The amount of code taken for these functions must be carefully controlled to minimize the amount of unsafe code in use, however. In effect, any code written in C should be treated as unsafe.