A Simple Compiler, Part 2: Write My Own Lexer in Python
Posted by postfuturist on 2010-04-01 22:14:02

As you may remember from Part 1 of this series, I have embarked on the task of creating my own programming language. I wrote a lexer and parser for my theoretical language in Python using PLY. That is all well and good. The next part I need is a good runtime. Here are some of my options:

  • Interpret the parse tree in Python.
  • Generate machine code and build native executables.
  • Target LLVM.
  • Generate C code to be compiled by GCC.
  • Target the JVM.
  • Target the .NET/Mono CLR.
  • Build it with Parrot.
  • Build it with the Dynamic Language Runtime which sits on top of the CLR.
There are plenty of other options. I've already implemented a subset of the language directly in Python, the first one. I don't really want to spend a lot of time doing that because being interpreted inside Python is going to be laughably slow. I still might, as a reference implementation, but it doesn't sound like much fun. I don't really want to build a language that compiles natively, I want the language to be more dynamic than that. The best options look like JVM (weak dynamic language support, still), Parrot (a bit of a moving target), and the DLR. Now, I'm not a huge Microsoft fan, but they have certainly put a lot of effort into the .NET framework, and the DLR, to their credit, seems to be a fantastic bit of engineering. Since it is open source and available in the repos of some bleeding-edge Linux distros (I'm using Ubuntu 10.04 beta), I decided to go that route, or at least try.

I need to build my parse tree in some .NET language. Fortunately, IronPython exists to run my Python code inside the Mono environment. Unfortunately, PLY chokes when run on IronPython. I tried debugging the issue, and once I got past one, I found another. So, I've had to start over a bit. The first thing I did was rewrite the lexer in a subset of Python that also runs on CPython as well as IronPython. Here it is:

# lexer2.py | a simple lexer in python
import re

# Token class.  This class is stolen from PLY
class LexToken(object):
    def __str__(self):
        return "LexToken(%s,%r,%d,%d)" % (self.type,self.value,self.lineno,self.lexpos)
    def __repr__(self):
        return str(self)

class lexer():
    def __init__(self,input):
        self.input = input
        self.pos = 0
        self.input_length = len(input)
        self.lineno = 0
        self.charno = 0
        self.reserved = {
            'print': 'PRINT',
            'func': 'FUNC',
            'if': 'IF',
            'elif': 'ELIF',
            'else': 'ELSE',
            'for': 'FOR',
            'in': 'IN',
        }
        self.symbols = [
            ('->', 'ARROW'),
            ('=', 'EQUALS'),
            ('*', 'STAR'),
            ('/', 'SLASH'),
            ('+', 'PLUS'),
            ('-', 'MINUS'),
            ('(', 'LPAREN'),
            (')', 'RPAREN'),
            ('{', 'LCURLY'),
            ('}', 'RCURLY'),
        ]
        self.complex = [
            (r'[a-zA-Z_][a-zA-Z0-9_]*', 'IDENTIFIER'),
            (r'"[^"]*"', 'STRING'),
            (r'-?\d+', 'NUMBER'),
        ] 
        self.tokens = self.reserved.values() + [x[1] for x in self.symbols + self.complex]
        self.complex_compiled = [(re.compile(r), t) for r,t in self.complex]
        self.symbols_compiled = [(s,t,len(s)) for s,t in self.symbols]
        self.ignore = " \t\n"

    def getTokens(self):
        return self.tokens

    def token(self):
        # skip ignored characters
        while(self.pos < self.input_length and self.input[self.pos] in self.ignore):
            if(self.input[self.pos] == "\n"):
                self.lineno += 1
                self.charno = 0
            else:
                self.charno += 1
            self.pos += 1
        if(self.pos == self.input_length):
            return None
        for cr,ttype in self.complex_compiled:
            mo = cr.match(self.input, self.pos)
            if(mo):
                value = mo.group()
                pvalue, pttype = self.process(value, ttype)
                tok = LexToken()
                tok.value = pvalue
                tok.type = pttype
                tok.lineno = self.lineno
                tok.lexpos = self.pos
                self.pos += len(value)
                self.charno += len(value)
                return tok
        for sym,ttype,length in self.symbols_compiled:
            if self.pos + length < self.input_length and self.input[self.pos:self.pos + length] == sym:
                tok = LexToken()
                tok.value = sym
                tok.type = ttype
                tok.lineno = self.lineno
                tok.lexpos = self.pos
                self.pos += length
                self.charno += length
                return tok
        print "invalid character : ", self.input[self.pos], " at ", self.lineno, " , " , self.charno
        return None
    
    def process(self, text, ttype):
        if(ttype == 'IDENTIFIER'):
            ttype = self.reserved.get(text, 'IDENTIFIER')
        elif(ttype == 'NUMBER'):
            text = int(text)
        elif(ttype == 'STRING'):
            text = text[1:-1]
        return text, ttype
        

if __name__ == "__main__":
    data = '''
    a = 3 + 4 * 10
    b = a + -20 *2
    { print(300) }
    c = "hello"
    print(b)
    there->(now)
    '''

    # Give the lexer some input
    lex = lexer(data)

    # Tokenize
    while True:
        tok = lex.token()
        if not tok: break      # No more input
        print tok
It's a pretty simple lexer, but it produces the same exact output as the last one, so its good enough for now. I've had my head buried in compiler books and articles for the last few days. Parsing is more complex, so stay tuned for that.


blog comments powered by Disqus