Bootstrapping a compiler

Often high level language compilers are implemented using some other languages, not always very suitable for such a task, but reasonable in terms of build dependencies and maintenance. One notorious example is Clojure, with a huge part of it implemented in Java (almost 60kloc of Java vs. 7kloc in core.clj).

Another typical approach to bootstrap is to write a fully functional compiler in the same language, and then keep relying on the previous version of this compiler or on alternative implementations of the language forever. There is nothing wrong with this approach, besides some minor maintenance complications, but I wanted something different. I wanted to bootstrap a powerful, very high level language compiler using a tiny interpreter, as dumb as possible, and with only a tiny bit of a seed code. This interpreter can be implemented in any language, not necessarily very suitable for compilers and such.

I ended up implementing MBase using a dumb interpreter written in C# (could have been anything else, really), which runs a very simple language – no variables, no closures, no recursion. This language (creatively called L0) has only the following structures:
  • Function block
  • Numbered variable or argument reference
  • Global symbol reference
  • Constant
  • Function application
  • Sequence
  • If

The only relatively complicated bit here is the function block. It creates a record containing a given number of context variable slots, a given number of argument slots, and an expression to be executed with this context. The context variables are filled with values upon function block instantiation. To add more confusion, I called this thing ‘lambda’, which has nothing to do with the proper lambda functions.

For example, (lambda (3 (arg 0) (env 1)) ...) will create a function block with 3 argument slots (to be referred as (arg-ref 0), (arg-ref 1), etc.) and two context variable blocks, which will immediately get values from the current context (arg 0) and (env 0). Inside the function block expression they will be referred as (env-ref 0) and (env-ref 1).

Of course it would have been a horrible and traumatising experience to code manually in such a language. It’s hilariously easy to interpret it with an acceptable efficiency, and it is a good starting point for bootstrapping, but we have to make another step in order to make it any useful.

The next step would be to compile a much more pleasant language into this L0. Let’s call this next language L1. It features proper, lexically scoped lambda functions with symbolic arguments, and, the most important thing, it provides macros – this will be our main tool for growing the language into something practical.

L1 compilation sequence is following:

First, macro-expand the input:
  • For a list starting with a symbol, check if this symbol is in the current macros context. If it is there, fetch a function for this macro and apply it to this list. Repeat.
  • Optional: If it is a local macro definition form, update the local environment with the new definitions and carry on.
  • Do the same for all the list elements, unless it is a lambda form – then do not macro-expand the first two elements.

Now we have a pure L1 code with no macros. It is translated into L0 as follows:
  • For each lambda form, calculate a list of free variables and find an intersection with a current list of bound variables.
  • Emit a function block with a given number of arguments and use the previously calculatedintersection to initialise its contet variables.
  • Update a list of bound variables with this lambda arguments and captured context variables, and translate the inner expressions
  • For each symbol, look it up in the bound variables list. If found, replace wih an (arg N) or (ref N) expression, otherwise consider it to be a reference to a global symbol.
  • Leave all the constants as is.

Using a reasonable set of runtime functions provided by the interpreter (with the most complicated being functions for accessing hash maps), such a compiler fits just a few trivial L1 functions. Of course, we are also relying on the underlying runtime functionality, such as garbage collection, but for the bootstrap compiler we could have avoided this altogether.

And now we’d need some batteries to make things simple. Most notable thing needed is quasiquotation. It is exceptionally tricky to write macros without one. And this is the ugliest piece of the bootstrap code. Luckily, we only had to do it once.

Another missing piece is recursion. Neither L0 nor L1 provide any iteration, while the compilation algorithms described above are iterative (or recursive) indeed. We do not want to complicate our bootstrap interpreter, it must stay as dumb as possible, so this is where an obscure theoretical construct comes to help. A fixed-point combinator is usually considered to be an ivory-tower thing, suitable for torturing unexpecting undergraduates, but useless in a wild. Well, it is only useless if you do not want to bootstrap a high level language implementation with as little effort as possible.

Very early in the bootstrap compiler code we have this definition:
(def 'bootlib::Y (lambda (x)
            ((lambda (procedure)
               (x (lambda (arg) ((procedure procedure) arg))))
             (lambda (procedure)
               (x (lambda (arg) ((procedure procedure) arg)))))))

And, of course, a timeless classic: let binding. So far, the only variable binding feature we’ve had was lambda with its arguments. Transforming a let binding into it is trivial:
(let ((a ...) (b ...)) ...) should become ((lambda (a b) ...) ... ...)

And, for convenience:
(let* ((a ...) (b ...)) ...) becomes ((lambda (a) ((lambda (b) ...) ...)) ...)

Equipped with all these features, which are already equivalent to a good portion of a functional subset of Scheme, we can easily code the compiler from L1 to L0 described above.

I mentioned Scheme for a reason: the very first L1 to L0 compiler was coded in Scheme. Ever since then, we only have to keep an L0 version of this compiler to bootstrap its next version. It is small, and it is possible to manually modify it if something goes wrong with the bootstrap. Nothing as huge and terrifying as keeping the binaries or a bytecode for your previous version of compiler around.

What’s next? Growing a language all the way up to a level when it’s more suitable for writing compilers. It involves creating multiple eDSLs – one for defining abstract syntax trees, another for defining simple transforms over such trees. All the utility things – pattern matching, various parsing tools, sets, graph colouring, etc.

Once all the required tools are stuffed into the language, they’re used to implement a compiler for an extended version of L1 (called L1′ to divert from our boring linear numbering). L1 was designed for interpretation over L0, and it’s not very suitable for efficient compilation. L1′ fixes this by adding many of the previously macro-defined things as low level, first-class citizens – let bindings are now introducing proper local variables, for example, and we’ve got real labels and goto statements to make it easier to optimise things, as well as an embedded CLR assembler to make this backend truly flexible.

Here comes another bootstrapping round: the interpreted L1’→CLR compiler is used to build a compiled version of L1’→CLR and all the libraries. Once the implementation is fully compiled, a new range of language growth possibilities opens, which is further exploited in the additional libraries (ML and Prolog compilers, Packrat+Pratt parsing, PFront language, etc.).
Bootstrapping a compiler