A Simple Compiler, Part 1: Parsing with Python and Ply
Posted by postfuturist on 2010-03-24 23:55:37

In an earlier blog post I encouraged folks to write their own programming language for fun and education. It may not end up being useful, but you'll probably have fun and almost certainly learn something along the way. That said, I've been making my own baby-steps towards creating a programming language of my own. I don't have a name for it yet (other than "simple," because that's what it is) but it is coming along. And, yes, this is the fun part, designing the syntax. A warning for all the ninja hackers who write DSL's in their sleep: this is noob territory, so don't expect any new theories on tracing JIT's or whatever is hip in the world of programming language nerds.

I'm using the terrific PLY, a Python implementation of Lex and Yacc. At first I started with Flex and Bison, the spiritual successors to Lex and Yacc, but I got bogged down with picking the right C data structures and related memory management. What a waste of time. The Python workflow is much faster. I ported my Flex and Bison code to PLY in an evening and then added code which builds a simple AST and code to "run" a sequence of expressions--all in the same evening. I am not really compiling anything, per se. I am just building an abstract syntax tree and then interpreting the tree. There is no machine code or even byte-code. Since it is Python interpreting the code, it is probably a couple orders of magnitude slower than would be reasonable for a production programming language. But it is a toy language and I'm just having fun, so here goes.

This is simplelex.py, the scanner (or lexer) which merely interprets a string of characters and emits a string of tokens based mostly on some regular expressions. Nothing here is new, most of it is straight out of the PLY documentation:

# simplelex.ply | a simple lexer in python using ply
import ply.lex as lex

reserved = {
    'print' : 'PRINT',
    'func' : 'FUNC',
}

tokens = [
    'IDENTIFIER',
    'STRING',
    'NUMBER',
    'EQUALS',
    'STAR',
    'SLASH',
    'PLUS',
    'LPAREN',
    'RPAREN',
    'LCURLY',
    'RCURLY',
] + list(reserved.values())

t_EQUALS        = r'='
t_STAR          = r'\*'
t_SLASH         = r'/'
t_PLUS          = r'\+'
t_LPAREN        = r'\('
t_RPAREN        = r'\)'
t_LCURLY        = r'\{'
t_RCURLY        = r'\}'

def t_STRING(t):
    r'"[^"]*"'
    t.value = t.value[1:-1]
    return t

def t_IDENTIFIER(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    t.type = reserved.get(t.value,'IDENTIFIER')
    return t

def t_NUMBER(t):
    r'-?\d+'
    t.value = int(t.value)
    return t

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

t_ignore = ' \t'

def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

lexer = lex.lex()

OK, I'm using some pretty naive string matching and silly naming of "*" and "/" operators, for fun. Remember, programming language design is fun. If you don't know what's going on in any of this code, look at the relevant PLY docs (the section on lexing) and it will all make sense. Basically, you can define tokens with a simple regular expression or a function that starts with a regular expression and can munge the input a bit before spitting it out, like stripping the double quotes off of a string literal.

Step 2 is a little more interesting, thought still relatively trivial. That's the parsing stage. I'll give one to you in chunks:

# simpleyacc.py | a simple parser in Python using PLY
import ply.yacc as yacc

from simplelex import tokens

precedence = (
    ('right', 'EQUALS'),
    ('left', 'PLUS'),
    ('left', 'STAR', 'SLASH'),
)

def p_error(e):
    print('error: %s'%e)

def p_explist(p):
    '''explist :
    | explist expression'''
    if(len(p) < 3):
        p[0] = ('EXPLIST', [])
    else:
        p[1][1].append(p[2])
        p[0] = p[1]

The parser needs to know about the tokens defined by the lexer, so those get pulled in from simplelex.py. The precedence bit gives PLY some knowledge about the binary operators (the ones that operate with two values). They are listed from lowest to highest priority. Enough about that.

The "explist" function defines the top-level nonterminal. A file or string of text will essentially be reduced to a list of expression. In this example, they are formed left recursively, starting with the empty set and adding expressions on to the right. The doc-string for the function is basically Yacc/Bison-style BNF. The function itself gets passed an array (or indexable sequence of some sort) which represents the bits of the rule. p[0] always represents the left side of the rule, and you can assign any value to it that you like, it is what the parser will return. For this code, I just am having the parser rules return tuples where the first element is a name and the other parts are the important pieces of the code. In this first rule, this code either constructs a tuple with an empty list, or adds the expression to the existing explist and assigns that as the value of the outer explist.

What's interesting about this is that I am not defining the need for any kind of delimiter between expressions (save whitespace). I may retract this later, requiring new-lines (which will have to be reported instead of ignored by the lexer) or semicolons if you want to stack multiple expression on the same line. That's a decision for later, right now it is perfectly legal to do something like :

a = 10 print(a).

I don't use the term "statement" to describe expressions. I may recant this later, too, but I honestly don't see the difference. It makes things simpler if everything has a value, even if that value is null. Here is some more of the simpleyacc.py file:

def p_expression_number(p):
    'expression : NUMBER'
    p[0] = ('NUMBER', p[1])

def p_expression_string(p):
    'expression : STRING'
    p[0] = ('STRING', p[1])

def p_expression_identifier(p):
    'expression : IDENTIFIER'
    p[0] = ('IDENTIFIER', p[1])

def p_expression_binaryop(p):
    '''expression : expression STAR expression
    | expression SLASH expression 
    | expression PLUS expression 
    | expression EQUALS expression'''
    if(p[2] == '*'):
        p[0] = ('STAR', p[1], p[3])
    elif(p[2] == '/'):
        p[0] = ('SLASH', p[1], p[3])
    elif(p[2] == '+'):
        p[0] = ('PLUS', p[1], p[3])
    elif(p[2] == '='):
        p[0] = ('EQUALS', p[1], p[3])

def p_expression_print(p):
    'expression : PRINT LPAREN expression RPAREN'
    if(p[1] == 'print'):
        p[0] = ('PRINT', p[3])

def p_expression_parenthesized(p):
    'expression : LPAREN expression RPAREN'
    p[0] = p[2]

All of these rules create the same thing, expressions. There is plenty of recursion going on, as expressions are defined in terms of other expressions as well as terminals (tokens from the lexer). It's a lot of BNF and constructing a tree of tuples. So far so good, this code compiles and let's see the output from some input code. Here's the input: a = 10 + 2 b = a * 3 print(b) c = (b + 10) * 3 and this is the parse tree that is constructed:

('EXPLIST', 
    [('EQUALS', 
        ('IDENTIFIER', 'a'), 
        ('PLUS', ('NUMBER', 10), ('NUMBER', 2))),
    ('EQUALS', 
        ('IDENTIFIER', 'b'), 
        ('STAR', ('IDENTIFIER', 'a'), ('NUMBER', 3))),
    ('PRINT', ('IDENTIFIER', 'b')),
    ('EQUALS', 
        ('IDENTIFIER', 'c'), 
        ('STAR', 
            ('PLUS', ('IDENTIFIER', 'b'), ('NUMBER', 10)), 
            ('NUMBER', 3)))
    ]
)

The code to "run" this AST is pretty straight forward, iterate through expressions of the "explist" and get the values recursively, storing variables in a dictionary. That code will appear in later installments of this series when it is in a less embarrassing state. Let's look at some other things we can create with the parser and a problem I am having with my syntax.

When I got to this point in my code, I realized that I wanted to add functions. The astute reader would have already noticed the 'LCURLY', 'RCURLY' and 'FUNC' tokens in the lexer that aren't used by the parser. I don't want C-style, stand-alone functions, I want functions to be first class. I want them to be expressions, too. So I came up with this pretty straight-forward syntax:

f = func (x y) { z = 10 print(x + y + z) }

In this example, "func" is a reserved keyword kind of like "function" in JavaScript or "lambda" in Python. It is followed by a list of identifiers inside of parentheses which is followed by a curly-brace delimited block which contains a list of expressions. Here's the code for that:

def p_identlist(p):
    '''identlist :
    | identlist IDENTIFIER'''
    if(len(p) < 3):
        p[0] = ('IDENTLIST', [])
    else:
        p[1][1].append(p[2])
        p[0] = p[1]

def p_expression_block(p):
    'block : LCURLY explist RCURLY'
    p[0] = ('BLOCK', p[2])

def p_expression_function(p):
    'expression : FUNC LPAREN identlist RPAREN block'
    p[0] = ('FUNC', p[3], p[5])

The "block" could have just been part of the definition of function expression, but I made it separate as there is more than one use for a block, and I'll probably need it later. Also, if I want to complicate my life I may replace the curly brace syntax with a Python-style indentation system, or make it optional. Here's the parse tree for the trivial case (func () {}) :

('EXPLIST', [('FUNC', ('IDENTLIST', []), ('BLOCK', ('EXPLIST', [])))])

Here's where my initial plan shows some cracks. Let's say I add the following rule to add function calls to the language syntax:

def p_expression_func_call(p):
    'expression : IDENTIFIER LPAREN explist RPAREN'
    p[0] = ('FUNCCALL', p[1], p[3])

PLY complains: Generating LALR tables WARNING: 1 shift/reduce conflict The problem is obvious: ambiguous grammar. Right now this statement can be parsed two ways: f (x) As a single function call or two separate expressions, one of which is just parenthesized. One obvious solution is to have something that separates different expressions in an expression list other than just white space, like a new line or semicolon, that is how Python solves this dilemma. I can think of other solutions:

  • Use something other than parentheses to delimit function args : f<x>
  • Make a requirement that separate expressions have to have some white space between them : f(x) vs. f (x).
  • Some extra symbol or keyword for function calls : f->(x) or f.call(x).

I'll go with the arrow "->" operator. I'll leave adding that symbol to the lexer as an exercise for the reader. Here's the parse rule, redone:

def p_expression_func_call(p):
    'expression : expression ARROW LPAREN explist RPAREN'
    p[0] = ('FUNCCALL', p[1], p[4])

In addition to the new syntax, which is unambiguous, I changed it so you can make a function call from any expression, since any expression may contain the value of a function. That means that this: f = func (x) { print(x) } f->("Hello, World") can be replaced with this: func (x) { print(x) } -> ("Hello, World") I really need to change the "print" builtin for consistency to this: func (x) { print->(x) } -> ("Hello, World")

This is all good fun, however syntax is just the surface of a language. If Python had a C-like syntax on the surface, I'd still use it. It would still be duck typed, with powerful, built-in data structures, garbage collection, and a strong standard library. With Python you come for the syntax and stay for everything else. I don't think PHP is a poorer language because it is C-like or uses dollar signs at the beginning of variable names. I think it is poorer because of the weak typing, inconsistencies, and gotchas that permeate the language.

The programming language I am creating needs control flow, a type system, a runtime of some sort, and a basic library of essential functionality. Stay tuned for more on this.


blog comments powered by Disqus