As you may remember from Part 1 of this series, I have embarked on the task of creating my own programming language. I wrote a lexer and parser for my theoretical language in Python using PLY. That is all well and good. The next part I need is a good runtime. Here are some of my options:
- Interpret the parse tree in Python.
- Generate machine code and build native executables.
- Target LLVM.
- Generate C code to be compiled by GCC.
- Target the JVM.
- Target the .NET/Mono CLR.
- Build it with Parrot.
- Build it with the Dynamic Language Runtime which sits on top of the CLR.
I need to build my parse tree in some .NET language. Fortunately, IronPython exists to run my Python code inside the Mono environment. Unfortunately, PLY chokes when run on IronPython. I tried debugging the issue, and once I got past one, I found another. So, I've had to start over a bit. The first thing I did was rewrite the lexer in a subset of Python that also runs on CPython as well as IronPython. Here it is:
# lexer2.py | a simple lexer in python
import re
# Token class. This class is stolen from PLY
class LexToken(object):
def __str__(self):
return "LexToken(%s,%r,%d,%d)" % (self.type,self.value,self.lineno,self.lexpos)
def __repr__(self):
return str(self)
class lexer():
def __init__(self,input):
self.input = input
self.pos = 0
self.input_length = len(input)
self.lineno = 0
self.charno = 0
self.reserved = {
'print': 'PRINT',
'func': 'FUNC',
'if': 'IF',
'elif': 'ELIF',
'else': 'ELSE',
'for': 'FOR',
'in': 'IN',
}
self.symbols = [
('->', 'ARROW'),
('=', 'EQUALS'),
('*', 'STAR'),
('/', 'SLASH'),
('+', 'PLUS'),
('-', 'MINUS'),
('(', 'LPAREN'),
(')', 'RPAREN'),
('{', 'LCURLY'),
('}', 'RCURLY'),
]
self.complex = [
(r'[a-zA-Z_][a-zA-Z0-9_]*', 'IDENTIFIER'),
(r'"[^"]*"', 'STRING'),
(r'-?\d+', 'NUMBER'),
]
self.tokens = self.reserved.values() + [x[1] for x in self.symbols + self.complex]
self.complex_compiled = [(re.compile(r), t) for r,t in self.complex]
self.symbols_compiled = [(s,t,len(s)) for s,t in self.symbols]
self.ignore = " \t\n"
def getTokens(self):
return self.tokens
def token(self):
# skip ignored characters
while(self.pos < self.input_length and self.input[self.pos] in self.ignore):
if(self.input[self.pos] == "\n"):
self.lineno += 1
self.charno = 0
else:
self.charno += 1
self.pos += 1
if(self.pos == self.input_length):
return None
for cr,ttype in self.complex_compiled:
mo = cr.match(self.input, self.pos)
if(mo):
value = mo.group()
pvalue, pttype = self.process(value, ttype)
tok = LexToken()
tok.value = pvalue
tok.type = pttype
tok.lineno = self.lineno
tok.lexpos = self.pos
self.pos += len(value)
self.charno += len(value)
return tok
for sym,ttype,length in self.symbols_compiled:
if self.pos + length < self.input_length and self.input[self.pos:self.pos + length] == sym:
tok = LexToken()
tok.value = sym
tok.type = ttype
tok.lineno = self.lineno
tok.lexpos = self.pos
self.pos += length
self.charno += length
return tok
print "invalid character : ", self.input[self.pos], " at ", self.lineno, " , " , self.charno
return None
def process(self, text, ttype):
if(ttype == 'IDENTIFIER'):
ttype = self.reserved.get(text, 'IDENTIFIER')
elif(ttype == 'NUMBER'):
text = int(text)
elif(ttype == 'STRING'):
text = text[1:-1]
return text, ttype
if __name__ == "__main__":
data = '''
a = 3 + 4 * 10
b = a + -20 *2
{ print(300) }
c = "hello"
print(b)
there->(now)
'''
# Give the lexer some input
lex = lexer(data)
# Tokenize
while True:
tok = lex.token()
if not tok: break # No more input
print tok
It's a pretty simple lexer, but it produces the same exact output as the last one, so its good enough for now. I've had my head buried in compiler books and articles for the last few days. Parsing is more complex, so stay tuned for that.

