Understanding Wasm, Part 3: You Are Here

Sep 04, 2023 47 min read

Understanding Wasm

Part 3: You Are Here

(This is part 3 of a series. See part 1, “Virtualization” and part 2, “Whence Wasm”.)

If WASM+WASI existed in 2008, we wouldn’t have needed to create Docker. That’s how important it is. WebAssembly on the server is the future of computing. A standardized system interface was the missing link. Let’s hope WASI is up to the task!

Solomon Hykes via twitter

When we left off last time, we were just about to dig into what “Operating System Support” might mean for a virtual instruction set architecture. Solomon Hykes claimed that if WASI —the WebAssembly System Interface— had existed in 2008, he and his companions at dotCloud¹ wouldn’t have had to invent Docker. Given the accomplishments of Docker over the last decade, that’s pretty wild!

To understand what a “standardized system interface” might do for us, I wanted to first understand system interfaces— how we virtualize them, how they came to be, and how they help us work together.

Does WASI represent a bridge to the future for existing applications, or a destination?

To understand the future of computing, let’s look at its past.

Processes and Virtual Machines

Processes are a universal abstraction in modern operating systems. They comprise three capabilities:

A continuous view of processing resources that initially includes a single advancing program state with the ability to spawn additional, concurrent states against the same program within the process (“threads”.)
A contiguous address namespace, into which memory may be allocated by the operating system upon request. This initially contains the program instructions (or “text”), environment variables and arguments, read-only (or “static”) data, space reserved for structures to be initialized at startup (“block starting symbol”, or “bss”²), and room for a “stack” of activation frames. A number of system libraries may also be loaded by the operating system into this address space.
A system interface, made available through a set of supervisory calls (“system calls”.) Typically this is implemented through use of dedicated syscall or system interrupt instructions (e.g., int 0x80, in x86 assembly) available through the instruction set architecture. The use of supervisory calls is dictated through calling conventions, including what registers must be saved, where parameters to the syscall are placed—in memory or on registers— and how results are retrieved after control returns to the process thread. These conventions describe an “application binary interface”, or “ABI”.

This is a virtual view of the physical machine’s resources: an “extended machine”³. The operating system is responsible for enforcing the illusion that each process operates with independent, full access to the system’s resources. This illusion is constructed through the operating system’s careful orchestration of the machine’s processor features — the processor’s memory mapping hardware, interrupts and traps, mode and protection rings, and use of privileged instructions.

As widespread as the concept of a process is today, it wasn’t always so: processes and operating systems emerged alongside virtual machines in the ’60s.

Hardware and software design were, at the start of the 1960s, intertwined. It was common practice to design parallel computers —those with a number of processors— by assigning a program to each processor, along with a range of physical memory to be accessed by that processor. If, during development, programmers discovered that one process on the system needed additional space, all other processes had to be reprogrammed to accommodate the new memory partitions. Similarly, if one process went idle, no other processes could make use of that idle processor hardware. Programs had to be submitted in batches in order to maximize utilization of the hardware. Time-sharing systems would change that.

In 1961, DARPA, newly flush with cash⁴, funded research on time-sharing operating systems through “Project MAC”⁵. MIT, a participating member, brought their “Compatible Time Sharing System” (“CTSS”⁶) with them, and along with Bell Laboratories and General Electric, began work on a successor system called “MULTICS”.

That term we introduced earlier, “extended machine”, originated from this work. It described a machine that was “easier to program” than the underlying physical machine. The role of a “monitor” (or “nucleus”, or “kernel”) was to support these extended machines by using the “bare machine interface”, or hardware, directly. The monitor scheduled user programs to minimize costly idle time.

The extended machine typically supported virtual memory, supervisory calls, and protection rings. Virtual memory was developed so that programs could be written in isolation, without advance knowledge of other programs running on the system.

This was accomplished by introducing indirection: the address namespace of a program no longer mapped directly to the physical address space of memory. Three techniques were used to accomplish this indirection: paging, segmentation, or the combination of the two.

Expand for more on virtual memory…

With paging, both the process and physical address space are subdivided into “pages” of memory. Each page represented several hundred “words” of memory⁷. Accesses to “non-resident” pages in process address space trigger hardware faults. The kernel sets traps for those faults: the fault transfers control to the kernel, which takes action to load the missing page into physical memory to complete the mapping. Control transfers back to the process at this point⁸.

^{Illustration from “Segmentation and the Design of Multiprogrammed Systems”, Jack B. Dennis, 1965. “N” represents the process address namespace, “M” the physical namespace. Note how the contiguous “N” namespace maps to a discontiguous namespace in “M”.}

Memory segmentation was also used to support virtual memory. Segmentation has the advantage of allowing larger address namespaces than the native computer word size would otherwise be capable of representing. Consider that a 16-bit word can only represent values from 0 to 65535. One popular segmentation scheme addresses 20-bit values. It accomplishes this by holding another 16-bit value in a “segment register”, shifting it left by 4 bits (multiplying it by 16) and adding the resulting value to the base offset held in the operand register. This allows addressing up to 1MiB of memory. When the segment register changes, the entire segment of memory is made resident at once — which could be a single word of memory or a significant subset of physical memory. Some hardware supports transferring control to the kernel when the segment register changes, which allows the implementation of virtual memory purely through segmentation.

^{Illustration from “Segmentation and the Design of Multiprogrammed Systems”, Jack B. Dennis, 1965. The left diagram illustrates using a word address to index into the segment. The right diagram illustrates the namespace of segments.}

The variable size of segments could lead to conflict —or thrash— between processes if the two segments happened to overlap. Given the much expanded range available to them, desktop 64-bit processors tend to use paging by itself to implement virtual memory. Processors of the era we’re discussing used both techniques in conjunction to combine the advantages of paging with the ability to increase the addressable memory space.

Finally, protection rings change the instruction set architecture available to the program, trapping or disallowing use of privileged instructions having to do with memory mapping, input/output (“I/O”) devices, or the manipulation of software timer interrupts. The operating system runs in a protection ring with higher privileges. Processes run in a lower ring. Processes may request services from the operating system, like memory allocation or I/O, by making supervisory calls. System requests are implemented using interrupt instructions or dedicated syscall instructions available in the instruction set architecture of the processor.

(For more on all of this, check out cpu.land and Phil Opp’s “Writing an OS in Rust” series).

Project MAC’s contemporaries included Project GENIE at UC Berkeley and the IBM System/360 and 370. These technologies were built in competition with MULTICS; IBM found that they their university bids were being out-competed by time-sharing systems spun out from Project MAC. This led them to develop early whole-machine virtualization software.

While extended machines made it possible to safely create and run relocatable programs, much research revolved around the development of new operating systems. This was more difficult than developing a program for an extended machine, as only a single kernel could run at a time. This lead to the development of “pseudo-machines” or “virtual machine monitors”, which used the bare machine interface to provide many copies of that same bare machine interface for the purposes of kernel development. Further, these virtual machine monitors could be nested, so long as the appropriate resource mappings were set up in hardware ahead of time. There was some debate around “pure” virtual machine monitors as opposed to “impure” VMMs at this time! “Impure” VMMs presented an extended bare machine interface to the guest; the guest machine was aware of the virtualization.

IBM implemented virtual machine monitors on the System/370 using a feature of the processors’ protection rings. Whenever a non-privileged program performed a privileged operation, the processor hardware would fault. The VMM trapped these privilege faults; mapping incoming requests for resources and operations to appropriate backing resources without the knowledge of the guest operating system. In his July 1974 article “A Survey of Virtual Machine Research”, Robert P. Goldberg noted that the primary difficulty in implementing efficient virtualization of machines lies in the lack of comprehensive hardware support for trapping “privilege-sensitive” instructions. Indeed, the month prior, he and Gerald J. Popek proposed a definition for “virtualizable architecture” in “Formal Requirements for Virtualizable Third Generation Architectures”⁹.

^{Illustration of VMM, extended machine, and interfaces from “Survey of Virtual Machine Research”, Robert P. Goldberg, COMPUTE June 1974}

So why did virtual machine monitor research halt for so many years?

The Modern Operating System

The Mansfield amendments, passed 1969 and 1973, narrowed the scope of Department of Defense funds to projects with direct military applications. This cut public funding for operating systems research. The General Electric 635 used to build MULTICS at MIT cost as much as a passenger jet. MULTICS had been over-budget and behind schedule for years; Bell Labs pulled out of the MULTICS project in 1969. The aftermath of the OS research era left many useful ideas floating in the ether, while virtual machine monitor research would freeze until the late 1990s¹⁰.

Many researchers from the Project MAC days joined up at Xerox PARC. There they developed the Xerox Alto and Smalltalk, which prefigured the modern personal computer. By the late 1970s, commercial personal computers were available to consumers: the Apple II, Tandy TRS-80, and others. The IBM PC launched in 1981, the Apple Lisa in 1983, and the Macintosh in 1984. The earliest versions of these computers only ran a single program at a time, typically in conjunction with a disk operating system (“DOS”); later they would run many processes cooperatively. There was only one user and thus, no need to worry about time-sharing — these systems could be written to assume cooperation between all programs running on the machine¹¹. In 1982, Intel released the first commercial chips capable of protected mode operation & on-die memory mapping in the form of the 80286 processor¹². Microsoft, Intel, and IBM’s personal computers rapidly chipped away at the market for time-shared minicomputers from the consumer side, while high-performance workstations made inroads on the commercial side.

For the most part, those workstations ran a variant of UNIX.

Dennis Ritchie, Ken Thompson, Douglas McIlroy, and Joe Ossanna developed UNIX at Bell Laboratories in the aftermath of Bell’s strategic retreat from the MULTICS project. UNIX benefited from its circumstances: a small team¹³ with little supervision working on a comparatively cheap computer (the PDP-7) proved able to incorporate the best ideas in operating systems research rapidly. UNIX did not start out as a time-sharing system, nor was it a preemptive operating system. It wasn’t much more than a filesystem supporting a game at first.

Back around 1970-71, Unix on the PDP-11/20 ran on hardware that not only did not support virtual memory, but didn’t support any kind of hardware memory mapping or protection, for example against writing over the kernel. This was a pain, because we were using the machine for multiple users. When anyone was working on a program, it was considered a courtesy to yell “A.OUT?” before trying it, to warn others to save whatever they were editing.

“A hardware story”, Dennis Ritchie, 2002

The initial system was multiprogrammed — that is, there were two, time-shared shell processes running, one for each terminal connected to the machine. When the shell executed another program, it would read the file in over the top of the shell code and start executing it. The exit() syscall would reload the shell program over the top of the shell and restart execution. Support for a tree of processes was added rapidly, however.

UNIX inherited Conway’s fork/join semantics from Project GENIE as fork/exec: one command to make a duplicate of the current process as a child of the forking process, and a second command run in that child process to replace the duplicate with the target code. This model of copying processes using prototypal inheritance directly enabled the container models we’ll talk about shortly.

By 1977 work was underway to port UNIX from the PDP-11 to the Interdata 8/32, as we noted way back in the first post in this series. (The work had been validated by porting a copy of UNIX to an IBM System/370 virtual machine.) With the support of Sun Microsystems, IBM, HP, and others, UNIX and C had swept through the industry by the late eighties. Consumer operating systems were converging on preemptive, protected time-sharing, buoyed by advances in Intel’s commercial hardware. Hardware virtualization features, if supported, primarily existed to support older DOS programs. Entering the 90’s, “virtual machine” came to refer primarily to language virtual machines, like Smalltalk and Erlang.

The Internet

The internet changed things (he said, blandly.)

The internet meant unbound connectivity, which meant servers had to support unprecedented numbers: of processes, of users, of connections. The added pressure was not tenable with contemporary operating systems approaches, which were too rigid. Both scaling and consolidation were difficult and relocating processes was nearly impossible.

Going into the 90’s, the ever-problematic, ever-popular x86 ISA was thought impossible to virtualize efficiently, but pressure was building behind that cork.

These problems spurred a renaissance of virtualization research.

Modern Virtual Machines

Scaling a website to meet traffic demands meant procuring and provisioning hardware. This could not be done quickly. Internet companies were required to expend considerable money well in advance of expected traffic and were largely unable to recoup costs if that traffic did not materialize. The industry needed a way to commoditize hardware.

x86 was the most popular processor architecture with the best economies of scale and broadest software support. However, it could not be virtualized directly: privileged instructions and memory operations would silently fail without triggering traps.

Unfortunately, many current architectures are not strictly virtualizeable. This may be because either their instructions are non-virtualizeable, or they have segmented architectures that are non-virtualizeable, or both.

Unfortunately, the all-but-ubiquitous Intel x86 processor family has both of these problematic properties, that is, both non-virtualizeable instructions and non-reversible segmentation. Consequently, no [Virtual Machine Monitor] based exclusively on direct execution can completely virtualize the x86 architecture.

“Virtualization System Including a Virtual Machine Monitor For a Computer With Segmented Architecture”, United States Patent #6,397,232

The late 90’s saw a growing interest in direct binary translation (“DBT”.) The hegemony of the x86 processor was predicted to crumble. Efforts such as DAISY and Transmeta’s Crusoe bet against the continued popularity of x86. Meanwhile, in the language virtual machine space, just-in-time compilation research blossomed. VMWare was founded in this context, submitting a patent early on for a direct-binary-translation method of implementing virtual machine monitors for the x86 architecture.

VMWare’s virtual machine monitor triggered a translator whenever control entered or exited a protection ring. This translator monitored for execution of untrapped privileged instructions in order to inject manual VMM trap calls. This enabled virtualization but added overhead to every system call.

Because DBT added uneven overhead to kernel execution, it was difficult to associate system use with virtual machine execution. This prevented accurate billing systems from being constructed around such virtual machines.

“Xen and the Art of Virtualization” changed all of that.

By allowing 100 operating systems to run on a single server, we reduce the associated costs by two orders of magnitude. Furthermore, by turning the setup and configuration of each OS into a software concern, we facilitate much smaller-granularity timescales of hosting.

“Xen and the Art of Virtualization”, 2003

Xen achieved this through paravirtualization: instead of pure virtualization, it achieved “impure”, cooperative virtualization by modifying the guest operating systems. In particular, Xen moved the operating system out of the most protected ring of the processor: from ring 0 to ring 1. Processes continued to run in ring 3. This gave the operating system its own “extended machine”: while it lost access to privilege instructions, it gained access to its own supervisory call system (“hypervisor calls”.)

^{Comparison of native Linux, Xen, VMWare Workstation, and User-mode Linux on various benchmarks, from “Xen and the Art of Virtualization”. Note that OLTP represents relational database workloads and WEB99 represents web-serving.}

In particular, Xen’s performance on web application workloads was a breakthrough.

Cloud virtual machines commoditized hardware, severing the connection between “procuring and provisioning hardware” and “scaling a web service.”

As with memory mapping and protection rings in the 80s, consumer hardware lagged behind the market’s needs. AMD and Intel introduced hardware support for virtualization of x86(_64) architectures in 2005 through SVM and VT-X, respectively. ARM added hardware virtualization support in Cortex-A in 2011¹⁴.

Hardware is really just software crystallized early. It is there to make program schemes run as efficiently as possible. But far too often the hardware has been presented as a given and it is up to software designers to make it appear reasonable. […]

As Bob Barton used to say: “Systems programmers are high priests of a low cult.”

“An Early History of Smalltalk”, Alan Kay

Hardware virtualization support allowed operating systems to integrate hypervisor capabilities: Windows Server added Hyper-V (2008); Linux, Kernel Virtual Machines (“KVM”, 2006); and macOS, Hypervisor.framework (2020). (EC2 started moving from Xen to KVM-based “Nitro” virtual machines in 2017. Brendan Gregg wrote more about the path from the first version of Xen to Nitro here.) In lieu of running Xen or VMWare, consumer operating systems themselves became capable of running guest operating systems, accelerating the development of new hypervisor software.

A few notes on timing: by 2003, the dotcom bubble had burst. The servers that internet startups had loaded up on flooded the market with cheap used hardware, causing Sun Microsystems to hemorrhage money. At Amazon, Benjamin Black circulated a document describing a standardized infrastructure; Bezos tasked Chris Pinkham with developing this in 2004. EC2 launched publicly in 2006, powered by Xen. According to Steve Yegge, sometime in 2002-2003 or so, Bezos issued his famous “services edict”, stipulating that all Amazon engineering teams deliver their work in the form of networked services. By 2009, Netflix had moved their video encoding operations to AWS. Netflix finished transitioning to the cloud in 2011. And Docker launched in 2013.

Docker

The high-level metaphor of Docker is that of shipping containers. Standardized containers revolutionized the shipping industry: a standard form factor meant a standard way to load, unload, and transport material. The technology made shipping everything faster, cheaper, and more reliable. This is Docker’s raison d’être: to do for computer applications and operations teams what standard containers did for the shipping industry.

The metaphor is pretty durable: the concrete problem that Docker solved was that of shared resources. Consider: given a large company, two developers may develop two different web serving applications (“services”) that expect to listen on the same network port¹⁵. Before containers, these two developers would have to sync up: who gets to listen on port 80? They’d have to agree, up-front, before being able to deploy their software. Worse, if they couldn’t come to agreement about the use of resources, they’d have to involve a third-party — someone from the operations department — who might solve the problem by introducing a third server, a reverse proxy layer, to resolve the conflict.

Docker solves this by giving each service a virtualized view of the operating system and its resources. Both developers may remain agnostic of each other AND the operations team. The operations team can map network ports as necessary – or rely on orchestration software, like Nomad or Kubernetes, to do this for them. Standard containers yield standard tools.

Docker achieves¹⁶ virtualization through a combination of abstract programming interfaces (“APIs”) supported by the kernel of the host operating system: secure computing mode (“seccomp”), namespaces, and control groups — on which, more later. Each of these APIs must be properly configured to make durable the illusion that the process is running on a virtualized system.

Windows supports containers natively through a hcsshim runtime that provides namespace and cgroup support through Windows’ native Host Compute Service. Apple’s macOS, however, runs docker containers by spinning up a virtualized Linux host to run a Docker daemon and container processes.

While the Docker/OCI¹⁷ container model is ubiquitous today, the capabilities underpinning it have their origins in the virtual machine renaissance of the late 90s.

Let’s walk back towards the year 2000.

Docker was originally implemented using an earlier container model called “LXC” (“Linux Containers”), introduced in 2008. LXC, like many of its contemporary container models, was built to act like a virtual machine. Rather than Docker’s model of taking a single process and giving it a virtual system environment, LXC containers typically virtualized an init process (like systemd, upstart, or SysV init¹⁸.) This made them “feel” much more like a virtual machine, with their own long-lived daemons, periodic tasks, and system logs. But, akin to Docker, LXC was built from cgroups and namespaces.

Google contributed cgroups (or “control groups”) to the Linux kernel in 2007. Control groups allowed userland to communicate resource quotas for process subtrees to the kernel: effectively allowing container runtimes to dictate the maximum CPU time, parallelism, network, memory, and disk usage a set of sub-processes should be able to use. This doesn’t affect the ability of those processes to see certain subsystems: control groups don’t affect what the process has access to, only the available quality of service for that access¹⁹.

I consulted the local gurus about the security of a chroot environment. Their conclusion: it is not perfectly secure, but if compilers and certain programs are missing, it is very difficult to escape.

“An Evening with Berferd”, Bill Cheswick, 1990

Namespaces are related to earlier jail and chroot capabilities²⁰. This family of APIs control what operating system subresources are visible to jailed processes. Linux began adding namespace support in 2002. The mount namespace for filesystems came first, inspired by Plan 9²¹. Eric W. Biederman enumerated the necessary support in “Multiple Instances of the Global Linux Namespaces”:

mnt: The filesystem namespace.
uts: The UNIX Time-sharing namespace (controlling what hostname is visible to the process.)
ipc: Inter-process communication namespace.
net: The network namespace.
pid: The process identifier namespace.
user: The user and group namespace.
time: The time namespace.

Linux added the user namespace in 2013 — the same year Docker was first released publicly. (This was later followed up with the cgroup namespace in 2016.)

Namespaces and cgroups form the basis of what we think of as containers today²² — and they exist to work around the lack of virtualization support in turn-of-the-century processors.

From the hardware perspective, processes and guest operating systems look nearly the same. Virtualizing the system at the process layer means sharing more code between the virtualized systems with a finer granularity of abstractions, but has the advantage of being achievable purely through software: hardware can’t really tell the difference between a container and any other process.

Because virtualizing the system at the hardware interface layer is more coarse-grained with less shared code, VMMs are generally considered safer targets for multi-tenancy —that is, running untrusted code from third parties collocated on the same hardware²³. On the other hand, VMMs introduce more startup overhead. However, in recent years, containers and VMMs have converged: Kata containers, AWS’ Firecracker, and others use the container user interface but run the containers inside lightweight virtual machines, achieving remarkable performance, security, and density²⁴.

Density

The more efficiently tasks can be collocated, the better the margins on equipment; this is a competitive edge for a hosting company. Processes, whether virtual machines or containers, have overhead. Switching between processes takes time and memory; Tyler McMullen, CTO of Fastly, noted that were limits on the number of processes running on a single box in his 2018 “Software Fault Isolation”. This was a natural place to look for improvements, and so content delivery networks — which handle some of the highest volume of traffic on the internet — started digging into the problem. Fastly and Cloudflare both landed on web technologies, launching products in 2018.

Cloudflare launched Workers, which co-locates many user tasks in a single process using V8 isolates. V8 is, as we discussed previously, the JIT JavaScript engine open-sourced by Google as part of their Chrome browser. Workers customers upload JavaScript applications that implement the ServiceWorker spec; Cloudflare deploys the application to their edge network, with points of presence across the globe. As a bonus, since WebAssembly support is available through the web platform API, users that really needed to could target Wasm. (Though they’d be responsible for writing their own bindings to the ServiceWorker API.)

Fastly launched Compute@Edge, skipping JavaScript in favor of WebAssembly. But this posed a problem. While JS could rely on the Web platform as a system interface, Wasm does not automatically provide one. In order to hook Wasm up to their edge compute, Fastly had to define an interface: one that was stable and versioned.

WASI

(Finally.)

Fastly, Intel, and Mozilla teamed up in 2019 to form the Bytecode Alliance. Their goal was to define a specification for a standard system interface for WebAssembly²⁵.

The WebAssembly ISA allows imports and exports: functions that the host can pass in to the Wasm module for internal use, and functions that the Wasm module can hand back to the host to be called by the host on demand. The host may also provide a chunk of memory for the Wasm module to operate on. Imported and exported functions may only take primitive values — integer and floating point values of various widths. The host and module may cooperate to transfer more complicated types. For example, the host and module might agree that a string is represented as an integer pointer into the module’s memory along with an integer length.

This is a lot like the sort of ABI a process System Interface uses.

Tools like emscripten and wasm-bindgen treat the interface between the host and Wasm module as an internal detail. Those tools were designed to take an existing application and get it running in a browser with minimal changes, generating both Wasm, HTML, and the JavaScript to integrate the Wasm with the Web Platform. If one were to recompile an application with a newer version of emscripten and try to drop the Wasm onto an older version of the HTML and JS, it may not work. Likewise, the Wasm from an application compiled with wasm-bindgen could not be dropped onto emscripten HTML and JS or vice versa. They are not ABI compatible.

The first preview of WASI launched in 2019. For the most part, it closely resembles POSIX with some light editorialization: for example, POSIX APIs whose behavior changed based on parameters, like unlink, were split into separate functions. However, WASI preview 1 deferred implementation of several key capabilities: C++ stack unwinding/exception support, full network socket support, and fork/exec were omitted.

fork and exec form the bedrock of the POSIX process model. Their descendants, clone and unshare, form the basis for working with Linux cgroups and namespaces. In short, it is hard to support fork without also taking the traditional process model on board. And WebAssembly may not want to do that.

System interface ABIs don’t just define the boundary between a process and the kernel. They’re also used to enable software linking — the reuse of compiled artifacts: shared objects (.dylib, .so, or .dll.) Linking can be performed at compile time (static linking) or at process start (dynamic linking.) While shared object linking wasn’t defined as part of the first WASI preview²⁶, neither did the specifiers wish to preclude it in future versions of the specification.

The standard approach to sockets would have exposed too much power between linked modules. Sockets are typically represented as a “descriptor” or “handle” in the form of an integer the operating system hands to the process on request. Operations on the socket are performed by making a system call referencing that integer, and all linked modules are assumed to have access to the same system interface. This model is difficult to secure: malicious linked modules can “guess at” active descriptors by picking random integers²⁷. The system can’t differentiate between linked modules and the core application logic when receiving calls referencing the descriptors.

The problem with the POSIX model is that the system interface forms a monolithic wall between the kernel and the user process, leaving undifferentiated space on either side. The ABI emphasizes the importance of the system interface over the importance of the interface between modules.

A Component Model

Through the Wasm Component Model, WASI preview 2 fundamentally rethinks the process model.

One of WebAssembly’s unique attributes is the ability to run sandboxed without relying on OS process boundaries. Requiring a 1-to-1 correspondence between wasm instances and heavyweight OS processes would take away this key advantage for many use cases. Fork/exec are the obvious example of an API that’s difficult to implement well if you don’t have POSIX-style processes, but a lot of other things in POSIX are tied to processes too. So it isn’t a simple matter to take POSIX, or even a simple subset of it, to WebAssembly.

“Why not a more traditional set of POSIX-like syscalls?”, WASI Rationale - Dan Gohman

The Wasm Component Model defines a static linking approach for combining many Wasm core modules into a single file, a “component”. The Component Model also a new, modular interface definition language called “WebAssembly Interface Types” (“WIT”.) WIT allows for the definition of named functions, groups of functions, and high-level types like strings and structs. WIT also allows for groupings-of-groupings with requirements around exports and imports, called “worlds”. WIT worlds can be used to generate host code or as a target for compiling shared code — it is a language for defining contracts between modules.

Instead of one monolithic wall of functions representing all system services, the Wasm Component Model proposes a system of smaller fences placed between every module. Each interface only specifies the functionality it needs, and may be fulfilled by any other module — or the host. Instead of exposing entire namespaces of functionality, like mnt, uts, or net, the interface can be described in a fine-grained way: “module A requires a function for reading input data.”

This allows sockets to be represented as higher level objects whose full capabilities aren’t transferred between linked modules, as opposed to an integer descriptor tracked by the host. (See Dan Gohman’s excellent “No Ghosts!” for more on this.)

WASI preview 2 aims to stabilize wasi-cli and wasi-http-proxy worlds first. However, at the time of writing the Component Model proposal is only in stage 1, so many popular tools and runtimes do not support Wasm components or WASI preview 2. And many platform-as-a-service (“PaaS”) startups have appeared since Solomon Hyke’s tweet.

Companies have proposed alternative system interfaces.

Deis labs introduced WAGI in 2020. WAGI sidesteps the issue of missing POSIX support by using WASI preview 1’s standard input and output instead of sockets. This is reminiscent of the venerable “common gateway interface” (“CGI”) from the 1990s. Fermyon, a Wasm PaaS company, supports WAGI through its Spin framework.

In 2023, Wasmer introduced WASIX. WASIX, a superset of WASI, aims to target more of the POSIX API, including support for fork. This is akin to an Emscripten for server applications, with the goal of lowering the barrier between existing web services and PaaS Wasm companies. Like Emscripten, WASIX values easy portability of existing applications. As a result, it doesn’t editorialize too much on the POSIX API. However, at the time of writing, WASIX doesn’t support shared object linking. While WASIX doesn’t preclude linking shared objects in the future, it seems likely that it would follow the existing linking model, whether for good or for ill.

Every technology we’ve talked about made it easier for developers to collaborate by reducing their need to coordinate ahead of time. Interfaces are a contract, a minimum viable agreement, between two parties. Interfaces build two-sided network effects: implementing the contract lets your program work with an unbound number of other programs which consume the contract and vice versa. Interfaces are living systems: they have internal motion. Inertia.

The inertia of an interface may even overcome deep flaws with that interface.

In this series, we’ve talked about how the inertia of the x86 ISA spurred an unexpected breakthrough in hardware design — despite the bet that we’d all be using VLIW RISC processors today. How the inertia of JavaScript, through the web platform, precluded competing web technologies, like Java, PNaCL, and Dart, and eventually gave rise to WebAssembly. This is the heart of “Worse is Better” and Gall’s Law: “A complex system that works is invariably found to have evolved from a simple system that worked.” It’s the human side of computing: it is better to be in a room full of people gathered around the warmth of an imperfectly useful interface than it is to be out, alone in the cold, with a perfect interface.

Good interfaces are grown, not invented: coaxed out of usefully-bad interfaces by redirecting some of their inertia. Borrowing the energy of the inertia to achieve a new orbit, so to speak.

Processes and virtual machines are some of our oldest, most widely-used usefully-bad interfaces. Their use at scale — and our ability to reason about their capabilities — have begun to fail us, but they are the mass around which all of modern computing orbits. This is what’s interesting about WASI and the WebAssembly Component model: it is a moonshot. The initial trajectory won’t get WASI to the moon: WASI preview 2, if successful, still looks like embedded Wasm in web proxies and command lines. But that might be a high enough apogee to slingshot to a new, higher orbit.

On the other hand, moonshots are not a sure thing: like Java before it, WASI doesn’t represent a bridge, but a destination. That is the fundamental difference between the component approach and the WASIX approach.

For my part, I think this WASI moonshot warrants energy: Docker succeeded, in part, because it described a vision of the future that anyone could take part in building. Through npm, JavaScript had a similar trajectory. I think we’re on the cusp of the same moment with WebAssembly.

Epilogue

So, whew. That’s a lot.

I’m excited to announce that I’ll be joining Dylibso as of this month to work on Wasm materials, tools, and smooth out friction in the ecosystem wherever I find it.

Working on these posts has been hugely educational for me on a number of levels, and I’d like to thank everyone who reviewed these posts (C J Silverio, Eric Sampson, and Aria Stewart), advised and helped source research (Ron Gee, Dan Gohman) and encouraged me. In particular I’d like to thank my family for their support: my wife, Krysten, and my parents, Mark and Sue. They dealt with an entire summer of me talking non-stop about Wasm, writing, and computing history.

Bibliography and Timeline

So many PDFs this time around!

I’d like to call out “The Ideal Versus the Real: Revisiting the History of Virtual Machines and Containers” by Allison Randal, which dives more deeply into the history and interrelations between these technologies than my effort here. Give it a read!

1961. “Dynamic Storage Allocation in the Atlas Computer”, John Fotheringham,
1961. “One-level storage system”, Kilburn, Edwards, Lanigan, Sumner
1961. Compatible Time-Sharing System first operational
1962. Atlas Computer operational
1963. “A multiprocessor system design”, Melvin E. Conway
1964. Compatible Time-Sharing System in operational use
1964. IBM System/360
1965. GE 645 hardware simulated at MIT for MULTICS
1965. “Segmentation and the Design of Multiprogrammed Computer Systems” - Jack B. Dennis
- cites 1961 “one-level storage system”
1966. “Programming Semantics for Multiprogrammed Computations” - Jack B. Dennis, Earl C. Van Horn
- cites 1963 “A multiprocessor system design”
1967. GE 645 hardware received at MIT
1968. “Virtual Memory, Processes,and Sharing in MULTICS” - Robert C. Daley, Jack B. Dennis
1969. UNIX v1 running on a pdp-7
1969. MULTICS replaces GECOS on GE 645
1970. “Virtual Memory”, Peter Denning
- cites 1961 “one-level storage”
- cites 1961 “dynamic storage allocation”
- cites 1966 “Programming Semantics for Multiprogrammed Computations” Jack B. Dennis, Earl C. Van Horn
1970. IBM System/370 released
1971. UNIX moved to the PDP-11
1973. “Virtual machine or virtual operating system”, Bellino, J, C Hans
- this seems a LOT like paravirtualization and/or containers — could not achieve full performance virtualizing an entire IBM 360 machine, so the VMM provided “higher level” primitives
1974. “Survey of Virtual Machine Research”, Robert P. Goldberg
- cites bellino/hans 1973 re: “pure” vs “impure” virtual machines
1979. UNIX v7, introduction of chroot
1979. “Software Engineering in 1968” - B. Randell
1981. “The Origin of the VM/370 Time-Sharing System” - R. J. Creasy
1982. chroot added to BSD
1998. immunix apparmor introduced
1998. VMWare x86
1999. Resource Containers
2000. FreeBSD jails
2000. SELinux first released OSS
2001. linux vserver
2001. VMWare releases gsx, first x86 server virtualization product
2002. VMWare patent - “Virtualization System Including a Virtual Machine Monitor For a Computer With Segmented Architecture”, United States Patent #6,397,232
2002. linux namespaces (2.4.19); mount namespace
2002. linux security modules
2003. “Xen and the Art of Virtualization”
2003. SELinux merged into linux kernel (2.6.0-test3)
2003. Work on AWS EC2 kicks off
2004. Solaris Zones
2005. OpenVZ (open virtuzzo)
2005. intel vt-x, amd-v
2006. Slicehost launched
2006. “Multiple instances of the global linux namespaces” - Eric W. Biederman
2006. AWS EC2 first released - built on xen virtualization
2006. “Virtualization in Linux” - Kirill Kolyshkin
2007. “Adding Generic Process Containers to the Linux Kernel” - Paul B. Menage, Google
2007. cgroups lands in linux kernel (2.6.24); final form of google process containers
2007. KVM lands in the linux kernel (forked QEMU added KVM support)
2008. LXC released
2008. Microsoft launches Hyper-V for windows server
2009. AWS VPC launch, ELB, Autoscaling, Cloudwatch
2009. Netflix moves movie encoding to the cloud
2009. Canonical adopts apparmor
2010. Netflix moves account signup, etc to the cloud
2010. Hyper-V support merged into libvirt
- In 2010, Bolte et al. incorporated support for Hyper-V into libvirt, so it could be managed through a standardized interface, together with Xen, QEMU+KVM, and VMware ESX.
2011. Netflix’s entire operation is on the cloud
2012. KVM support merged into mainline QEMU
2013. user namespaces land in the linux kernel (3.8); this completes support for containers.
2013. Docker released (based on LXC)
2013. LMCTFY (“Let me contain that for you”)
2014. Kubernetes released
2014. Docker Swarm released
2014. docker-compose v1
2014. Docker replaces LXC with libcontainer
2014. Canonical begins work on lxd, orchestration for LXC containers
2015. Terraform released
2015. Consul released
2015. Hashicorp releases first version of vault, nomad
2015. Docker spins runc out of its container runtime
2016. dirty cow container CVE
2016. CRI-O launched
2016. docker-compose v2
2016. LXD released
2016. “AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration with AWS (CON313)”
2017. docker-compose v3
2017. “The Evolution of Container Usage at Netflix”
2018. Kata, gVisor, Nabla
2018. “Software Fault Isolation and Edge Computing | Altitude San Francisco 2018” - Tyler McMullen
2019. CRI-O handed over to cncf
2019. Docker Swarm start of 2-year EOL
2019. “A fork() in the road” - Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe
2020. “The Ideal Versus the Real: Revisiting the History of Virtual Machines and Containers” - Allison Randal
2023. “WASI Command and Reactor Modules” - Rob Wong, Zach Shipko, Gavin Hayes

Which would later be rebranded as “Docker”, after their most famous product.

“BSS” is also known as “better save space”.

Hey, there’s that “machine” word again.

⁴

Which had nothing at all to do with Yuri Gagarin’s recent orbital trip, I’m sure.

⁵

“MAC” expanded to “Mathematics and Computation” early on, later expanding to “Multiple Access Computer”, “Machine Aided Cognitions”, and “Man and Computer”. It also funded the MIT AI lab.

⁶

MIT also originated the “Incompatible Time Sharing”, which would give us EMACS.

⁷

A “word” in this case refers to the number of bits the hardware was optimized to process. You’re probably reading this on a device with a native word size of 64-bits, but in the past 32- and 16-bit word sizes were common. In the distant past, you might see 36-bit words!

⁸

There are are all sorts of neat tricks that paging enables, including “copy on write” pages – mapping the same memory to different places in the same address space, and only creating copies of them when they’re mutated.

⁹

“Third Generation Architecture” refers to the generation of computers designed in the 1960’s using early integrated circuits; these are typically called “minicomputers”. They were succeeded by fourth generation architecture in the early 1970’s which began to use microprocessors.

Even today, virtualizable architectures can be set to meet “Popek and Goldberg virtualization requirements.”

¹⁰

IBM continued to ship virtual machine monitor systems throughout this period. However, they were primarily focused on virtualizing earlier technologies, like mainframes and minicomputers, on top of newer hardware.

¹¹

Now, that said, that assumption was frequently (and spectacularly) invalidated.

¹²

Gates called these processors “brain dead”. At the time, IBM and Microsoft were co-developing OS/2 for the IBM PC, and IBM wanted to target the 286. However, the protection modes of the 286 were “one-way” – once the processor entered protected mode, re-entering real mode required restarting the system.

¹³

The benefits of a small team are made evident through this quote, which is a prime example of Conway’s Law:

Where under Unix one might say
ls >xx
to get a listing of the names of one’s files in xx, on Multics the notation was
iocall attach user_output file xx
list
iocall attach user_output syn user_i/o
Even though this very clumsy sequence was used often during the Multics days, and would have been utterly straightforward to integrate into the Multics shell, the idea did not occur to us or anyone else at the time. I speculate that the reason it did not was the sheer size of the Multics project: the implementors of the IO system were at Bell Labs in Murray Hill, while the shell was done at MIT. We didn’t consider making changes to the shell (it was their program); correspondingly, the keepers of the shell may not even have known of the usefulness, albeit clumsiness, of iocall. […]

Because both the Unix IO system and its shell were under the exclusive control of Thompson, when the right idea finally surfaced, it was a matter of an hour or so to implement it.

“The Evolution of the Unix Time-sharing System”, Dennis M. Ritchie, 1996

¹⁴

Indeed, ARM Cortex’s virtualization support is specifically marketed as meeting “Goldberg and Popek virtualizability requirements”!

¹⁵

A port is a numbered resource, managed by an operating system, representing a stream of incoming or outgoing network requests.

¹⁶

Docker comprises a long-lived process (or “daemon”, in this case called containerd) for running these virtualized containers (via runc or crun), a user interface for controlling those containers, a registry protocol for sharing container images, and a file format describing how to build those container images. The file format for building docker images, a Dockerfile, specifies a series of commands that construct a container image; each command forms a distinct, content-addressable “layer”. Dockerfiles may source other Dockerfiles during the build process, common layers are reused between builds.

¹⁷

In 2013, dotCloud rebranded themselves as Docker in an attempt to capture some of the value of this ecosystem. To ensure trust that the ecosystem would outlive the company, Docker and CoreOS formed the “Open Container Initiative” in 2015. As a result, there are alternative OCI/Docker runtimes: crio and podman, for example.

¹⁸

Init processes are responsible for setting up the userland system: they are the root of the tree of userland processes; setting up daemons for resolving DNS, networking, devices, filesystem mounts, and more.

¹⁹

This control allowed Google to more efficiently allocate shared resources using their internal orchestration software, AKA “Borg”.

²⁰

The earliest jail-like capability, chroot, was added to AT&T Unix in 1979. It was also added to the Berkeley Software Distribution (“BSD”²⁸) in 1982. chroot allowed a process to “pivot” the root directory to a subdirectory, effectively hiding parent directories from a process. This offered incomplete protection, thus BSD introduced the jail system call in 2000.

In the case of the chroot(2) call, a process’s visibility of the file system name-space is limited to a single subtree. However, the compartmentalisation does not extend to the process or networking spaces and therefore both observation of and interference with processes outside their compartment is possible.

To this end, we describe the new FreeBSD ‘Jail’ facility, which provides a strong partitioning solution, leveraging existing mechanisms, such as chroot(2), to what effectively amounts to a virtual machine environment.

“Jails: Confining the omnipotent root.”, Poul-Henning Kamp, Robert N. M. Watson, 2000

Jails did not address resource management or scheduling concerns. Sun addressed this in 2004 with Solaris’s “Zones”, which provided an Docker-like experience years in advance. However, as we mentioned, Sun fell on hard times during the 2000’s, eventually meeting its demise in 2010 after being acquired by Oracle. As a result, Solaris didn’t experience the widespread adoption that various Linux distributions enjoyed during this time.

We have been gratified when casual users mistake the technology for a virtual machine.

“Solaris Zones: Operating System Support for Consolidating Commercial Workloads”

²¹

Plan 9! Which, as you’ll recall from the last article, Java killed! In the 80s and 90s, “virtual machine” in the sense of “system emulation” had withered so far as to be supplanted by “virtual machine” meaning “language model runtime.”

²²

Ok, I’m adding this one as a footnote because this is a long post already. I’m only going to gesture at linux-vserver, which was focused on scaling webservers through containerized networking. However, Virtuozzo (and its successor project, OpenVZ) approached containers for an entirely different reason: to enable checkpoint/restore of work on high-performance batch clusters. This would allow relocation of processes between computers (“nodes”) in a cluster by namespacing all of their system resources. This required maintaining patches against the linux kernel at the time, so as far as I can tell it never really took off, but it did spawn the “Checkpoint and Restore in Userspace” (“CRIU”) project.

²³

This is notwithstanding the Meltdown and Spectre vulnerabilities. Meltdown exploits a race condition between memory access and privilege checking and affects operating systems and hypervisors. Exploits allow processes and VMs to read memory across security boundaries, effectively breaking the illusion of virtual memory. Spectre exploits speculative execution — a property of modern superscalar processors. Speculative execution executes every code path leading out from a branch point, throwing away the results from the paths not taken. However, this speculative execution can affect caches, so the path not taken may be observed by measuring operation timings after the fact.

²⁴

For an in-depth look at how container and virtualization approaches compare in terms of performance on various axes, check out “A Fresh Look at the Architecture and Performance of Contemporary Isolation Platforms”.

²⁵

And to develop the Cranelift, Wasmtime, and Wasm Micro Runtime (“WAMR”) projects.

²⁶

WASI preview 1 instead classes modules as “reactors” or “commands”. You can read more on this on Dylibso’s blog.

²⁷

This is known as “forging” a descriptor.

²⁸

It’s out of scope for this post, but suffice it to say that UNIX split in the 80’s and 90’s: roughly, Linux, BSD, and SysV. Linux provides the kernel of popular distributions like Redhat, CentOS, Ubuntu, Debian, and Android. BSD provides a specification for a kernel for operating systems: most conspicuously, Apple’s modern operating systems and SunOS. SunOS’s successor, Solaris, was based on AT&T’s UNIX System V, along with HP’s HP-UX and IBM’s AIX.