I spent some time tonight working on the compiler I am writing for my own education. I started this project while only understanding bits and pieces of what it takes to write a compiler, but I am quickly learning more. Various texts on compilers start with creating a simple calculator program that takes a string like “10 + 2 * 6 – 4 / 2″ and calculates the result. For this example the problem can be broken down into a few simple steps which are merely transformations:
- Transform the input into a string of discrete tokens. “10 + 3 * 4″ becomes “10″, “+”, “3″, “*”, “4″. This is achieved with a scanner/lexer like lex, or the newer flex.
- Transfom the token stream into a parse tree using a BNF description of the grammar with a parser tool like yacc or the newer bison. Using infix notation, our stream becomes (“+”, “10″, (“*”, “3″, “4″)).
- Reduce the tree by walking it and calculating the value of each node, depth first, until you have calculated the value of the root node which is the value of the whole tree.
- Evaluate left side of “+” operation. It is the constant value 10.
- Evaluate right side of “+” operation.
- Evaluate left side of “*” operation. It is the constant value 3.
- Evaluate right side of “*” operation. It is the constant value 4.
- Reduce: 3 * 4 is 12
It is the numerical value 12.
- Reduce: 10 + 12 is 22
The whole thing is straight-forward and almost insultingly easy. The tools I’ve been using are Flex and Bison which generate the scanner and parser, respectively, in C. While writing these parts to generate a parse tree, which could also be called an abstract syntax tree or AST, I had a realization. Code generation, whether it is byte-code or machine-code, is simple. The code does the same tree-walk, in the same order, but instead of evaluating and reducing the elements, just generate byte-code or machine-code which will do those simple operations at runtime. The program spits out a flat sequence of code. You are merely writing code that flattens the tree.
To put it another way, the sequence of mathematical operations that reduce the parse tree to a single value is the same ordered sequence of operations that the compiled code should represent. It’s so obvious, it’s painful. I just hadn’t thought about it before. I thought there would be some deep voodoo involved in compilation, but there isn’t. Your program is likely to be a series of expressions, not just one. All you have to do is evaluate them in order, tacking the byte-code or machine code for each expression on to the end of the code.
Here is a series of transformations that does what I am talking about:
- Input : “x = 10 + y * 3″
- Scanner : “10″, “+”, “y”, “*”, “3″
- Parser : (“=”, “x”, (“+”, “10″, (“*”, “y”, “3″)))
- Tree walk :
- Store the location of “x” in R0
- Store “10″ in R1.
- Store the value at location “y” in R2
- Store “3″ in R3
- Store the product of R2 and R3 in R4
- Store the product of R1 and R4 in R5
- Store R5 at the location specified in R0
It’s just one more step to take the list of operations and turn them into a list of coded instructions. At that point you are done.
One of the things I wanted with my compiler is type inference, so when I get to an “=” or “assignment” node of my tree, I evaluate the right side first, which gives me a type, then I know what the type of the identifier on the left should be. The first time an identifier is used should always be an assignment, and the type of the identifier is set at that time, if another assignment to the same identifier is made of a different type, the program should fail to compile. Here is the messy / incomplete code that handles this right now:
int genAssignByteCode(simpleblock b, simplenode n) { if(n->R == NULL) { printf("No R-value in assignment!\n"); exit(EXIT_FAILURE); } if(n->L == NULL) { printf("No L-value in assignment!\n"); exit(EXIT_FAILURE); } if(n->L->type != IDENTIFIER) { printf("L-value is not an identifier!\n"); exit(EXIT_FAILURE); } int reg_num = genByteCode(b, n->R); if(reg_num == -1) { printf("The r-value has no value. Lame.\n"); exit(EXIT_FAILURE); } enum reg_type rval_type = get_reg_type(b, reg_num); simplesymbol inmap = in_map(b, n->L->data.string); struct byte_code bc; if(inmap == NULL) { // new value with big, scary TYPE INFERENCE int newr = new_reg(b,rval_type); new_symbol(b, n->data.string, newr); if(rval_type == rt_int) { bc.t = BCI_STORE_INT; } else if(rval_type == rt_string) { bc.t = BCI_STORE_STR; } bc.reg_a = newr; bc.reg_b = reg_num; } else { // existing value, better do some static type checks enum reg_type lval_type = get_reg_type(b,inmap->reg_num); if(rval_type != lval_type) { printf("You got the wrong type, pal.\n"); exit(EXIT_FAILURE); } if(rval_type == rt_int) { bc.t = BCI_STORE_INT; } else if (rval_type == rt_string) { bc.t = BCI_STORE_STR; } bc.reg_a = inmap->reg_num; bc.reg_b = reg_num; } new_code(b,bc); return bc.reg_a; }
Basically, simpleblock is a typedef’ed pointer to a struct that represents a block of code. It keeps an array of virtual registers allocated, array of instructions, and linked list of symbols (identifiers). This code generates the byte-code for the assignment operation. It is called from the genByteCode function which it calls recursively to evaluate the right side of the statement. The line new_code(b,bc); actually appends the byte-code bc to the array of instructions in block instance b. This is C code, and as such, it is really ugly. Also, the error messages are unhelpful and quite rude. This is a rough-draft of sorts. There are a lot of parts not in the code that would help explain it, too.
While constructing this part, I realized that it is very easy to do type inference. Java and C# took years and years to acquire type inference of any kind, but I don’t see what all the fuss is about, it is quite simple to implement, really. Compilers are not scary, go write one of your own already. Go create that perfect programming language. Do it now, or else you’ll spend the rest of your life writing software in programming languages that you don’t like and you’ll only have yourself to blame.
Comments 2
Try implementing the C++ language specification and tell us what you think then.
Compilers are pretty straightforward for small languages but as the language gets more complex the compiler becomes more complex. Because the compiler must be highly reliable this makes it very risky to add big new features, especially if you want them to work in a way that makes sense to developers who are not compiler-writers.
I do agree that everyone should work with compiler tech so they understand their development stack better.
Posted 09 Feb 2010 at 10:39 am ¶No, thank you, to implementing the C++ spec. It’s an over-complicated spec, by far, and an over-used language.
If you can create new features in the language itself with macros, than you don’t have to add them to the compiler, keeping it relatively simple. You just have to have a parse tree that can be manipulated by the language itself. I’m still thinking about how to get that to happen. I’m not at that point yet, though.
Posted 09 Feb 2010 at 12:01 pm ¶Post a Comment