Category > geekstuff

Serving up Django with Tornado
Posted by postfuturist on 2010-06-06 20:35:55

This article is mainly of interest to web developers who might want to publish a site built with the Django web framework in a lightweight fashion (without involving Apache).

First, let me enumerate the stack that is serving this page to you right now.

  1. Hosting: VPS Hosting from ARP Networks
  2. Operating System: Ubuntu Server Edition 10.04
  3. Reverse proxy server (and file server): Nginx
  4. Web server: Tornado
  5. Web framework: Django
  6. Database: MySQL

Here is my Tornado script which I partially plagiarized from somewhere:

#! /usr/bin/env python

import os
import tornado.httpserver
import tornado.ioloop
import tornado.wsgi
import sys
import django.core.handlers.wsgi
sys.path.append('/some/folder/here/myapp')

def main():
    os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
    application = django.core.handlers.wsgi.WSGIHandler()
    container = tornado.wsgi.WSGIContainer(application)
    http_server = tornado.httpserver.HTTPServer(container)
    http_server.listen(8001, "127.0.0.1")
    tornado.ioloop.IOLoop.instance().start()

if __name__ == "__main__":
    main()

I'm serving up this script with Supervisor by creating a script in /etc/supervisor/conf.d/ called myapp.conf and it looks like this:

[program:myapp]
command=/some/folder/here/scripts/mytornadoscript.py

This is the nginx config:

server {
    listen    80;
    server_name deliciousrobots.com;
    rewrite    ^ http://blog.deliciousrobots.com$request_uri?;
}
server {
        listen       80;
    
        server_name blog.deliciousrobots.com;
        root /some/folder/here;

    	location / {
            proxy_pass      http://127.0.0.1:8001;
            proxy_set_header X-Real-IP $remote_addr;
        }

        location /static/ {
            expires 30d;
        }

        location /wp-content/ {
            expires 30d;
        }

        location /media/ {
            root /usr/share/pyshared/django/contrib/admin;
        }
}

I needed the wp-content from my old wordpress blog because of some image I had uploaded. That's it!

This Blog Now Django-Powered
Posted by postfuturist on 2010-06-03 23:53:25

This blog is now my first public site that I've built with the Python programming language and Django web framework. I needed some motivation to build a project that I could show people, so for the last week or so I've used a few bits of my personal time to build a Django-powered replacement for my WordPress / PHP blog. Django has proved to be simpler and more powerful than I had guessed. I spent likely no more than 15 or 20 hours total working on it, including time spent creating the style sheet and writing a database migration to get my old posts, categories, and comments into the new database.

It's still a bit rough around the edges, and missing a lot of the features of WordPress, but that just gives me things to work on when I'm on the bus to and from work (where most of the code for this blog was written.) I haven't added category pages, just the front page with the latest 10 posts, individual pages for each post with working comments (all comments are moderated for the time being so they will not show up right away), a sitemap.xml file for google's sake, and an RSS Feed.

Again, I've decided to shy away from Apache, even though it is the "easy" way to deploy webpages, especially Django apps. For the WordPress version of the blog, I had been running Nginx on the front, delegating requests to php-cgi instances. I kept Nginx as the frontend server, so all static files are served super fast, and regular page requests are proxied to a Tornado instance. It's bloody fast--even on this $10 a month VPS, and I'm not caching anything, yet. I'm using Supervisor to manage my Tornado script.

I know some things are broken, like images on old posts. I'll get to those, eventually. It's enough that I got it running for right now.

The Reluctant Sysadmin
Posted by postfuturist on 2010-04-25 12:12:04

As a software developer, I work on top of abstraction layers. There are a number of black boxes I build on top of. Compilers and servers fall into this category. The less time I think about the mundane details of how the code I write gets run, the more time I can spend dealing with the higher level abstractions, like "what is this application supposed to do, exactly?" That level of ignorance is helpful at times, but ultimately not healthy to maintain absolutely. That's one reason why I've been steeping myself in the black arts of compiler construction. The other seedy underworld I have placed myself in recently is that of server administration.

I'll cop to this: I suck as a sysadmin. My first act, after putting myself in charge of my very own server (a VPS, actually) was to lock myself out of administrative access. Well, the best way to learn something is by taking it apart and sometimes you break things this way. When I was just a young thing, I would quickly grow tired of playing with toys. Phillips screw drivers were my favorite tool, they allowed me to take apart almost any electronic toy or piece of equipment. Sometimes I broke things. I loved electric motors, I would take them out of my toys, and wire them up to batteries for fun or slightly nefarious purposes. But mostly I turned perfectly good toys into piles of parts. Once I figured out that my desk lamp had quite a bit of electricity flowing through it. I learned this by taking out the bulb and poking my finger into the bare socket. I had seen someone causing water to separate into hydrogen and oxygen (O2) gas by applying electricity, so I placed to wires into a glass of water and connected the other two to the parts of the same lamp socket. This produced a terrific flash of light, and melted some of the wires into balls of molten copper.

Me having a server is like a ten year old boy having 110 volts of electricity at his disposal. I don't really have a lot of experience with server security and I might get a shock here or there, but I probably won't burn the house down. Since you are reading this very blog, I have managed to successfully migrate it to the new server, as well as my wife's blog and a private git repo. I've even made some improvements. Apache isn't even installed on my server. I'm running Nginx which is directly interacting with php running through a CGI interface. I'm using normal WP Cache for page caching and APC for PHP opcode caching and it seems to be pretty snappy. Now that I have a server playground to work with, I can experiment with other technologies, like CouchDB, MongoDB, Node.js, Django, Rails, the oddly-named Hunchentoot, and whatever else I'd like.

All in all, I've had fun with the new server, setting up file permissions, making sure only the services I want running are running, no extra open ports, and all that. Getting Nginx to do my bidding is a bit of a challenge, but worth it compared to the massive hulk that is Apache. All this is possible through extremely inexpensive Linux VPS services provided by ARP Networks. Apparently, these guys don't spend any money advertising, they just are awesome and get business through word of mouth. That's probably why they are so inexpensive. The introductory level VPS is only $10 a month, compared to $20 for Linode or Slicehost for similar service.

Don't believe every blog you read - a cautionary tale.
Posted by postfuturist on 2010-04-19 19:45:02

Right now, I am using DreamHost shared hosting to host my blog. They pretty much do what they say they do. There is an advantage to easy-click installs in the land of web hosting. I'm really not a server administrator. I'm a programmer. I don't like to think about deploying to servers, or monitoring servers, or configuring servers. However, I decided that I wanted a VPS to host my stuff on. I want to run different things like a git-server, maybe a node-js server, OpenID and possibly even email. There is the idea that I shouldn't be beholden to any company for my personal data. I should be able to host the services that are important to me on a server that, for the most part, I control. After shopping around, I decided to go with ARP Networks. They seem to have really competitive pricing for Linux and Free/OpenBSD VPS hosting. So far, I feel like that is probably a good decision.

I didn't really feel like running full blown Apache on the VPS, and according to a little research, I could even port my Wordpress blog over to a VPS just with Nginx, PHP, and MySQL. I found a few blog entries detailing how they got this combination running. Well, there is one that I had gleamed some information from, but it turns out that information was wrong. Here was the command I was supposed to run, assuming that "jsmith" is my user name.

sudo usermod -G webmasters jsmith
That command is supposed to add my user to the webmasters group. Only problem is that it also removes you from every other group. The proper command is:
sudo usermod -a -G webmasters jsmith
That makes the group assignment an append, not a replace. Here is the important bit of the man page for usermod:
-G, --groups GROUP1[,GROUP2,...[,GROUPN]]] A list of supplementary groups which the user is also a member of. Each group is separated from the next by a comma, with no intervening whitespace. The groups are subject to the same restrictions as the group given with the -g option. If the user is currently a member of a group which is not listed, the user will be removed from the group. This behaviour can be changed via the -a option, which appends the user to the current supplementary group list.
OK, now here is the page with the faulty instructions. It's been up since August of 2008, that's nearly two years.

Hopefully, the folks at ARP Networks will be kind and help me out, or else I'm locked out of my VPS permanently. Don't believe every blog you read, even if looks legit and there are a bunch of comments at the bottom stating how useful and awesome the information is. Really, though, it is my fault for running commands that I didn't understand fully. Anything run as superuser should be carefully inspected.

Update: The ARP Networks folks bailed me out of my ignorance. Apparently, I could have fixed it myself with their out-of-band access tools. Today, I wear the scarlet N (for noob.)

A Simple Compiler, Part 3: Back to Flex and Bison and C and then Lisp?
Posted by postfuturist on 2010-04-15 19:29:52

My quixotic quest to build my own programming language or die trying has lead my in a few different directions. In my last installment, I was going to bring my Python-based lexer and parser into the .NET world with IronPython. After pursuing that angle for awhile I soon grew tired of waiting 10 seconds every time I wanted to run a Python script under IronPython. Every time. I don't really know what was going on each time I ran "ipy" in a terminal, but it would take about 10 seconds to do it. Regular old CPython might take a second the first time you fire it up, but after that you can run scripts from the command line with negligible delay. The more I thought about it, the less I wanted to tie myself to the DLR if it was going to be such a hog. I instead ported my lexer and parser back to Flex and Bison C-based code.

Here is the lexer. The syntax highlighter doesn't really handle flex/lex syntax, so it might be pretty bad.

%option noyywrap
%{
#include "gc/gc.h"
#include "simple.tab.h"
#include "simple.h"
%}
ID      [a-zA-Z_][a-zA-Z0-9_]*
QUOTE   \"[^\"\\]*(?:\\.[^\"\\]*)*\"
%%
-?[0-9]+    { yylval.i = atoi(yytext); return(NUMBER);   }
{QUOTE}     { yylval.c = GC_strdup(yytext); return(STRING);   }
print       { return(PRINT);    }
{ID}        { yylval.c = GC_strdup(yytext); return(getkeyword(yytext,IDENTIFIER));}
[ \t\n]+ /* blank, tab, new line: eat up whitespace */
->          { return(ARROW); }
\.          { return(DOT); }
\*          { return(STAR); }
\/          { return(SLASH); }
\+          { return(PLUS); }
\-          { return(MINUS); }
\(          { return(LPAREN); }
\)          { return(RPAREN); }
\{          { return(LCURLY); }
\}          { return(RCURLY); }
=           { return(EQUALS); }
%%

And the Bison parser.

%{
#include 
#include 
#include "simple.h"
// #include "block.h"

void yyerror(const char *str)
{
    fprintf(stderr,"error: %s\n",str);
}

%}

%union {
    struct node * n;
    int i;
    const char * c;
}

%token <i> NUMBER 
%token <c> STRING IDENTIFIER
%token PRINT FUNC IF ELIF ELSE FOR IN
%token EQUALS STAR SLASH PLUS MINUS LPAREN RPAREN LCURLY RCURLY ARROW DOT
%type <n> script expression explist block ifexp identlist

%right EQUALS
%left PLUS MINUS
%left STAR SLASH
%left DOT
%%

script : explist
    { scriptnode = $1 }
;
explist : /* empty */ 
    { $$ = newlistnode(nt_EXPLIST); }
| explist expression
    {  listnode_add($1,$2); $$ = $1; }
;
expression : NUMBER 
    { $$ = newnumber($1); }
| STRING 
    { $$ = newstring($1); }
| IDENTIFIER
    { $$ = newident($1); }
| expression STAR expression
    { $$ = newnode(nt_STAR,$1,$3); }
| expression SLASH expression
    { $$ = newnode(nt_SLASH,$1,$3); }
| expression PLUS expression
    { $$ = newnode(nt_PLUS,$1,$3); }
| expression MINUS expression
    { $$ = newnode(nt_MINUS,$1,$3); }
| expression EQUALS expression
    { $$ = newnode(nt_EQUALS,$1,$3); }
| LPAREN expression RPAREN
    { $$ = $2 }
| FUNC LPAREN identlist RPAREN block
    { $$ = newnode(nt_FUNC,$3,$5); }
| expression DOT LPAREN explist RPAREN
    { $$ = newnode(nt_FUNCCALL,$1,$4) }
| PRINT DOT LPAREN explist RPAREN
    { $$ = newnode(nt_PRINT,$4,NULL) }
| ifexp ELSE block
    { $$ = newnode(nt_ELSE, $1, $3); }
| ifexp
    { $$ = $1; }
| FOR identlist IN expression block
    { $$ = newnode(nt_FOR, newnode(nt_FOR_ITER, $2, $4), $5); }
;
identlist : /* empty */
    { $$ = newlistnode(nt_IDENTLIST); }
| identlist IDENTIFIER
    { listnode_add($1,newident($2)); $$ = $1; }
;
block : LCURLY explist RCURLY
    { $$ = newnode(nt_BLOCK, $2, NULL); }
;
ifexp : IF expression block
    { $$ = newnode(nt_IF, $2, $3); }
| ifexp ELIF expression block
    { $$ = newnode(nt_ELIF, $1, newnode(nt_IF,$3, $4)); }
;

There is a bunch of other C code that this parser is calling to build nodes in the syntax tree, mainly with the newnode() function which looks like this:

simplenode nn()
{
    simplenode n;
    n = GC_MALLOC(sizeof(struct node));
    n->L = NULL;
    n->R = NULL;
    return n;
}

simplenode newnode(int type,simplenode L, simplenode R)
{
    simplenode n = nn();
    n->type = type;
    n->L = L;
    n->R = R;
    return n;
}

There is of course a bunch of other code, but honestly, it is a bit tiring to do all this tree manipulation in C. It's a fine language, but a bit verbose and fiddly, what with all the pointer madness, which I've attempted to simplify as much as possible. In the preceding code, the simplenode type is really just a typedef of struct node *. And this is what a node really is:

struct node
{
    int type;
    union {
        simplenode L;
        Array arr;
        int number;
        simplestring string;
    };
    union {
        simplenode R;
        Block block;
    };
};

As you can see, I really just want a node to be a dynamic container that can hold other tree nodes, or data based on type, but then I feel like I am trying to turn C into something it isn't. I'm declaring types so I can use them, but it turns out that very little is enforced at compile time but more importantly run-time. I like my dynamic languages to have strong typing so when the code is running an integer isn't happily treated as a pointer, corrupting some random memory leading to hard to reproduce bugs. I have found some useful tools.

One of those tools is gdb the Gnu Debugger. As long as I compile with -g, I can set breakpoints or run the code from gdb and see what is happening when the program segfaults. Development of this code has lead to a lot of segfaults. It's a painful development style. I guess it stopped being fun, so I've set this whole thing aside to take a look at other code. That lead me to another runtime that is already doing something very similar, Lua. The Lua runtime is dynamic with strong typing, written in portable C, fast, as is a pretty decent language, with first class functions and other nice features. It has a lot of moving parts, like it's own lexer, parser, and garbage collection. I had already punted on all three, utilizing flex, bison, and Boehm GC.

Then, something else happened. I remembered all those unbelievably over-confident Lisp guys, even Paul Graham raves about Lisp. I remembered vaguely that it had garbage collection and bunch of other things decades before they became used in mainstream programming. I thought I had better see what all the fuss is about before I go off and try to build my own programming language. Well, I got more than I bargained for. Happily, all you need is an internet connection and a computer to find some highly recommended books on Lisp. I am currently working my way through Practical Common Lisp by Peter Seibel. The complete text is available on Seibel's webpage. Neat, huh?

I am about 9 chapters in and starting to get a feel for the language, but it hasn't been easy. I have this to say about Lisp before I say anything else, it is not very approachable. Even though Seibel is a great writer (loved Coders at Work, I own an ink-on-wood-pulp version), it takes intense concentration to understand what the heck is going on. Lisp is different. It's different in the way that animal species on Australia are different. It's like it grew up as a language in isolation from other species of language. The genetic lines are so different between Lisp and the Algol/C family of languages that I feel intensely uncomfortable at times. I have to stop, go back and reread a section when I get to certain code samples, or sit and stare and try to puzzle out what is going on in the code.

Most blobs of code are basically prose. You can read through them and get a feel for what is happening quickly, but Lisp is more like poetry. There is a lot of meaning packed into each line and symbol. The ability to stack up abstractions is mind boggling. The Common Lisp flavor of Lisp, which is basically a standardization of a bunch of production Lisps is not a pure functional language. It is far from it, though it does have a lot of functional bits and with some restraint can be used like one might use a functional language. It's data structures are all very mutable. Lisp allows a variety of programming styles and paradigms. The crazy syntax is just a generic nested list (tree) syntax making the data and code pretty much the same thing. This has some heavy implications for meta-programming. I'll leave it at that for now. I'm going to keep investigating this crazy Lisp world for awhile before I do anything else.

A Simple Compiler, Part 2: Write My Own Lexer in Python
Posted by postfuturist on 2010-04-01 22:14:02

As you may remember from Part 1 of this series, I have embarked on the task of creating my own programming language. I wrote a lexer and parser for my theoretical language in Python using PLY. That is all well and good. The next part I need is a good runtime. Here are some of my options:

  • Interpret the parse tree in Python.
  • Generate machine code and build native executables.
  • Target LLVM.
  • Generate C code to be compiled by GCC.
  • Target the JVM.
  • Target the .NET/Mono CLR.
  • Build it with Parrot.
  • Build it with the Dynamic Language Runtime which sits on top of the CLR.
There are plenty of other options. I've already implemented a subset of the language directly in Python, the first one. I don't really want to spend a lot of time doing that because being interpreted inside Python is going to be laughably slow. I still might, as a reference implementation, but it doesn't sound like much fun. I don't really want to build a language that compiles natively, I want the language to be more dynamic than that. The best options look like JVM (weak dynamic language support, still), Parrot (a bit of a moving target), and the DLR. Now, I'm not a huge Microsoft fan, but they have certainly put a lot of effort into the .NET framework, and the DLR, to their credit, seems to be a fantastic bit of engineering. Since it is open source and available in the repos of some bleeding-edge Linux distros (I'm using Ubuntu 10.04 beta), I decided to go that route, or at least try.

I need to build my parse tree in some .NET language. Fortunately, IronPython exists to run my Python code inside the Mono environment. Unfortunately, PLY chokes when run on IronPython. I tried debugging the issue, and once I got past one, I found another. So, I've had to start over a bit. The first thing I did was rewrite the lexer in a subset of Python that also runs on CPython as well as IronPython. Here it is:

# lexer2.py | a simple lexer in python
import re

# Token class.  This class is stolen from PLY
class LexToken(object):
    def __str__(self):
        return "LexToken(%s,%r,%d,%d)" % (self.type,self.value,self.lineno,self.lexpos)
    def __repr__(self):
        return str(self)

class lexer():
    def __init__(self,input):
        self.input = input
        self.pos = 0
        self.input_length = len(input)
        self.lineno = 0
        self.charno = 0
        self.reserved = {
            'print': 'PRINT',
            'func': 'FUNC',
            'if': 'IF',
            'elif': 'ELIF',
            'else': 'ELSE',
            'for': 'FOR',
            'in': 'IN',
        }
        self.symbols = [
            ('->', 'ARROW'),
            ('=', 'EQUALS'),
            ('*', 'STAR'),
            ('/', 'SLASH'),
            ('+', 'PLUS'),
            ('-', 'MINUS'),
            ('(', 'LPAREN'),
            (')', 'RPAREN'),
            ('{', 'LCURLY'),
            ('}', 'RCURLY'),
        ]
        self.complex = [
            (r'[a-zA-Z_][a-zA-Z0-9_]*', 'IDENTIFIER'),
            (r'"[^"]*"', 'STRING'),
            (r'-?\d+', 'NUMBER'),
        ] 
        self.tokens = self.reserved.values() + [x[1] for x in self.symbols + self.complex]
        self.complex_compiled = [(re.compile(r), t) for r,t in self.complex]
        self.symbols_compiled = [(s,t,len(s)) for s,t in self.symbols]
        self.ignore = " \t\n"

    def getTokens(self):
        return self.tokens

    def token(self):
        # skip ignored characters
        while(self.pos < self.input_length and self.input[self.pos] in self.ignore):
            if(self.input[self.pos] == "\n"):
                self.lineno += 1
                self.charno = 0
            else:
                self.charno += 1
            self.pos += 1
        if(self.pos == self.input_length):
            return None
        for cr,ttype in self.complex_compiled:
            mo = cr.match(self.input, self.pos)
            if(mo):
                value = mo.group()
                pvalue, pttype = self.process(value, ttype)
                tok = LexToken()
                tok.value = pvalue
                tok.type = pttype
                tok.lineno = self.lineno
                tok.lexpos = self.pos
                self.pos += len(value)
                self.charno += len(value)
                return tok
        for sym,ttype,length in self.symbols_compiled:
            if self.pos + length < self.input_length and self.input[self.pos:self.pos + length] == sym:
                tok = LexToken()
                tok.value = sym
                tok.type = ttype
                tok.lineno = self.lineno
                tok.lexpos = self.pos
                self.pos += length
                self.charno += length
                return tok
        print "invalid character : ", self.input[self.pos], " at ", self.lineno, " , " , self.charno
        return None
    
    def process(self, text, ttype):
        if(ttype == 'IDENTIFIER'):
            ttype = self.reserved.get(text, 'IDENTIFIER')
        elif(ttype == 'NUMBER'):
            text = int(text)
        elif(ttype == 'STRING'):
            text = text[1:-1]
        return text, ttype
        

if __name__ == "__main__":
    data = '''
    a = 3 + 4 * 10
    b = a + -20 *2
    { print(300) }
    c = "hello"
    print(b)
    there->(now)
    '''

    # Give the lexer some input
    lex = lexer(data)

    # Tokenize
    while True:
        tok = lex.token()
        if not tok: break      # No more input
        print tok
It's a pretty simple lexer, but it produces the same exact output as the last one, so its good enough for now. I've had my head buried in compiler books and articles for the last few days. Parsing is more complex, so stay tuned for that.

A Simple Compiler, Part 1: Parsing with Python and Ply
Posted by postfuturist on 2010-03-24 23:55:37

In an earlier blog post I encouraged folks to write their own programming language for fun and education. It may not end up being useful, but you'll probably have fun and almost certainly learn something along the way. That said, I've been making my own baby-steps towards creating a programming language of my own. I don't have a name for it yet (other than "simple," because that's what it is) but it is coming along. And, yes, this is the fun part, designing the syntax. A warning for all the ninja hackers who write DSL's in their sleep: this is noob territory, so don't expect any new theories on tracing JIT's or whatever is hip in the world of programming language nerds.

I'm using the terrific PLY, a Python implementation of Lex and Yacc. At first I started with Flex and Bison, the spiritual successors to Lex and Yacc, but I got bogged down with picking the right C data structures and related memory management. What a waste of time. The Python workflow is much faster. I ported my Flex and Bison code to PLY in an evening and then added code which builds a simple AST and code to "run" a sequence of expressions--all in the same evening. I am not really compiling anything, per se. I am just building an abstract syntax tree and then interpreting the tree. There is no machine code or even byte-code. Since it is Python interpreting the code, it is probably a couple orders of magnitude slower than would be reasonable for a production programming language. But it is a toy language and I'm just having fun, so here goes.

This is simplelex.py, the scanner (or lexer) which merely interprets a string of characters and emits a string of tokens based mostly on some regular expressions. Nothing here is new, most of it is straight out of the PLY documentation:

# simplelex.ply | a simple lexer in python using ply
import ply.lex as lex

reserved = {
    'print' : 'PRINT',
    'func' : 'FUNC',
}

tokens = [
    'IDENTIFIER',
    'STRING',
    'NUMBER',
    'EQUALS',
    'STAR',
    'SLASH',
    'PLUS',
    'LPAREN',
    'RPAREN',
    'LCURLY',
    'RCURLY',
] + list(reserved.values())

t_EQUALS        = r'='
t_STAR          = r'\*'
t_SLASH         = r'/'
t_PLUS          = r'\+'
t_LPAREN        = r'\('
t_RPAREN        = r'\)'
t_LCURLY        = r'\{'
t_RCURLY        = r'\}'

def t_STRING(t):
    r'"[^"]*"'
    t.value = t.value[1:-1]
    return t

def t_IDENTIFIER(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    t.type = reserved.get(t.value,'IDENTIFIER')
    return t

def t_NUMBER(t):
    r'-?\d+'
    t.value = int(t.value)
    return t

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

t_ignore = ' \t'

def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

lexer = lex.lex()

OK, I'm using some pretty naive string matching and silly naming of "*" and "/" operators, for fun. Remember, programming language design is fun. If you don't know what's going on in any of this code, look at the relevant PLY docs (the section on lexing) and it will all make sense. Basically, you can define tokens with a simple regular expression or a function that starts with a regular expression and can munge the input a bit before spitting it out, like stripping the double quotes off of a string literal.

Step 2 is a little more interesting, thought still relatively trivial. That's the parsing stage. I'll give one to you in chunks:

# simpleyacc.py | a simple parser in Python using PLY
import ply.yacc as yacc

from simplelex import tokens

precedence = (
    ('right', 'EQUALS'),
    ('left', 'PLUS'),
    ('left', 'STAR', 'SLASH'),
)

def p_error(e):
    print('error: %s'%e)

def p_explist(p):
    '''explist :
    | explist expression'''
    if(len(p) < 3):
        p[0] = ('EXPLIST', [])
    else:
        p[1][1].append(p[2])
        p[0] = p[1]

The parser needs to know about the tokens defined by the lexer, so those get pulled in from simplelex.py. The precedence bit gives PLY some knowledge about the binary operators (the ones that operate with two values). They are listed from lowest to highest priority. Enough about that.

The "explist" function defines the top-level nonterminal. A file or string of text will essentially be reduced to a list of expression. In this example, they are formed left recursively, starting with the empty set and adding expressions on to the right. The doc-string for the function is basically Yacc/Bison-style BNF. The function itself gets passed an array (or indexable sequence of some sort) which represents the bits of the rule. p[0] always represents the left side of the rule, and you can assign any value to it that you like, it is what the parser will return. For this code, I just am having the parser rules return tuples where the first element is a name and the other parts are the important pieces of the code. In this first rule, this code either constructs a tuple with an empty list, or adds the expression to the existing explist and assigns that as the value of the outer explist.

What's interesting about this is that I am not defining the need for any kind of delimiter between expressions (save whitespace). I may retract this later, requiring new-lines (which will have to be reported instead of ignored by the lexer) or semicolons if you want to stack multiple expression on the same line. That's a decision for later, right now it is perfectly legal to do something like :

a = 10 print(a).

I don't use the term "statement" to describe expressions. I may recant this later, too, but I honestly don't see the difference. It makes things simpler if everything has a value, even if that value is null. Here is some more of the simpleyacc.py file:

def p_expression_number(p):
    'expression : NUMBER'
    p[0] = ('NUMBER', p[1])

def p_expression_string(p):
    'expression : STRING'
    p[0] = ('STRING', p[1])

def p_expression_identifier(p):
    'expression : IDENTIFIER'
    p[0] = ('IDENTIFIER', p[1])

def p_expression_binaryop(p):
    '''expression : expression STAR expression
    | expression SLASH expression 
    | expression PLUS expression 
    | expression EQUALS expression'''
    if(p[2] == '*'):
        p[0] = ('STAR', p[1], p[3])
    elif(p[2] == '/'):
        p[0] = ('SLASH', p[1], p[3])
    elif(p[2] == '+'):
        p[0] = ('PLUS', p[1], p[3])
    elif(p[2] == '='):
        p[0] = ('EQUALS', p[1], p[3])

def p_expression_print(p):
    'expression : PRINT LPAREN expression RPAREN'
    if(p[1] == 'print'):
        p[0] = ('PRINT', p[3])

def p_expression_parenthesized(p):
    'expression : LPAREN expression RPAREN'
    p[0] = p[2]

All of these rules create the same thing, expressions. There is plenty of recursion going on, as expressions are defined in terms of other expressions as well as terminals (tokens from the lexer). It's a lot of BNF and constructing a tree of tuples. So far so good, this code compiles and let's see the output from some input code. Here's the input: a = 10 + 2 b = a * 3 print(b) c = (b + 10) * 3 and this is the parse tree that is constructed:

('EXPLIST', 
    [('EQUALS', 
        ('IDENTIFIER', 'a'), 
        ('PLUS', ('NUMBER', 10), ('NUMBER', 2))),
    ('EQUALS', 
        ('IDENTIFIER', 'b'), 
        ('STAR', ('IDENTIFIER', 'a'), ('NUMBER', 3))),
    ('PRINT', ('IDENTIFIER', 'b')),
    ('EQUALS', 
        ('IDENTIFIER', 'c'), 
        ('STAR', 
            ('PLUS', ('IDENTIFIER', 'b'), ('NUMBER', 10)), 
            ('NUMBER', 3)))
    ]
)

The code to "run" this AST is pretty straight forward, iterate through expressions of the "explist" and get the values recursively, storing variables in a dictionary. That code will appear in later installments of this series when it is in a less embarrassing state. Let's look at some other things we can create with the parser and a problem I am having with my syntax.

When I got to this point in my code, I realized that I wanted to add functions. The astute reader would have already noticed the 'LCURLY', 'RCURLY' and 'FUNC' tokens in the lexer that aren't used by the parser. I don't want C-style, stand-alone functions, I want functions to be first class. I want them to be expressions, too. So I came up with this pretty straight-forward syntax:

f = func (x y) { z = 10 print(x + y + z) }

In this example, "func" is a reserved keyword kind of like "function" in JavaScript or "lambda" in Python. It is followed by a list of identifiers inside of parentheses which is followed by a curly-brace delimited block which contains a list of expressions. Here's the code for that:

def p_identlist(p):
    '''identlist :
    | identlist IDENTIFIER'''
    if(len(p) < 3):
        p[0] = ('IDENTLIST', [])
    else:
        p[1][1].append(p[2])
        p[0] = p[1]

def p_expression_block(p):
    'block : LCURLY explist RCURLY'
    p[0] = ('BLOCK', p[2])

def p_expression_function(p):
    'expression : FUNC LPAREN identlist RPAREN block'
    p[0] = ('FUNC', p[3], p[5])

The "block" could have just been part of the definition of function expression, but I made it separate as there is more than one use for a block, and I'll probably need it later. Also, if I want to complicate my life I may replace the curly brace syntax with a Python-style indentation system, or make it optional. Here's the parse tree for the trivial case (func () {}) :

('EXPLIST', [('FUNC', ('IDENTLIST', []), ('BLOCK', ('EXPLIST', [])))])

Here's where my initial plan shows some cracks. Let's say I add the following rule to add function calls to the language syntax:

def p_expression_func_call(p):
    'expression : IDENTIFIER LPAREN explist RPAREN'
    p[0] = ('FUNCCALL', p[1], p[3])

PLY complains: Generating LALR tables WARNING: 1 shift/reduce conflict The problem is obvious: ambiguous grammar. Right now this statement can be parsed two ways: f (x) As a single function call or two separate expressions, one of which is just parenthesized. One obvious solution is to have something that separates different expressions in an expression list other than just white space, like a new line or semicolon, that is how Python solves this dilemma. I can think of other solutions:

  • Use something other than parentheses to delimit function args : f<x>
  • Make a requirement that separate expressions have to have some white space between them : f(x) vs. f (x).
  • Some extra symbol or keyword for function calls : f->(x) or f.call(x).

I'll go with the arrow "->" operator. I'll leave adding that symbol to the lexer as an exercise for the reader. Here's the parse rule, redone:

def p_expression_func_call(p):
    'expression : expression ARROW LPAREN explist RPAREN'
    p[0] = ('FUNCCALL', p[1], p[4])

In addition to the new syntax, which is unambiguous, I changed it so you can make a function call from any expression, since any expression may contain the value of a function. That means that this: f = func (x) { print(x) } f->("Hello, World") can be replaced with this: func (x) { print(x) } -> ("Hello, World") I really need to change the "print" builtin for consistency to this: func (x) { print->(x) } -> ("Hello, World")

This is all good fun, however syntax is just the surface of a language. If Python had a C-like syntax on the surface, I'd still use it. It would still be duck typed, with powerful, built-in data structures, garbage collection, and a strong standard library. With Python you come for the syntax and stay for everything else. I don't think PHP is a poorer language because it is C-like or uses dollar signs at the beginning of variable names. I think it is poorer because of the weak typing, inconsistencies, and gotchas that permeate the language.

The programming language I am creating needs control flow, a type system, a runtime of some sort, and a basic library of essential functionality. Stay tuned for more on this.

Please, you wretched cookie-cutter forums of the internet, you can do so much better.
Posted by postfuturist on 2010-02-15 02:10:26

Once upon a time I answered a question on LinuxQuestions, a slow, ugly website that uses vBulletin--your average, run-of-the-mill, commercial, PHP-based forum software. I couldn't just answer the question, I had to create an account, and give them an email address. Without warning, they started sending me weekly emails, reminding me how awesome they are, what's "happening" with the site and why I should return. Well, I've never been back to the site. There are a few reasons why I haven't been back to the site.

  1. It's ugly.
  2. It requires me to remember a password, or go through the hassle of resetting the password through email verification.
  3. I don't like to hang out on forums. Or, at least, I don't anymore.
  4. I actively avoid the site's links when they show up in search results.

vBulletin is bad. It powers a lot of forums in the world, and you can recognize it a mile away. I looked up vBulletin. Amazingly, people pay money for it. I could enumerate its sins for a long time, but I need to focus on one. In order to not receive weekly emails, you have to change your account settings. To change your account settings, you have to log into your account. If, like me, you don't remember passwords for sites you rarely visit, you have to reset the password through email. This is a familiar song and dance to many, but it angers me immensely. It is trivially easy to generate a single, safe, instant unsubscribe link for mailing list emails. Why don't LinuxQuestions emails have a one-click unsubscribe? Easy, they are idiots. They are stupid for using crappy, commercial forum software. They are idiots for spamming my inbox and forcing me to jump through a bunch of hoops to unsubscribe.

Sites like Stack Overflow use my OpenID login, which I know well. I can remember a single username / password for multiple sites without the fear of one site secretly stealing my password and using it to log in to another site as me. Trust me, more sites than you want to know store your password as plain text in their databases. Why am I picking on this one website? They are trying to get more visitors who provide more content, so they can extract advertising money or satisfaction that they are helping Linux users everywhere. They want to be a popular destination for a certain niche, but they completely fail to create a compelling experience in any way at all. It is so bad, I actively ignore search hits for their site.

Software matters. Usability matters. You can't just punt on the software. Invest in it. Pay a real designer and get a real programmer on your team. Unless you are independently wealthy, you need a software developer on your team. This software developer will want to make the site awesome, since he is personally invested in the success of the site. People want to know what makes a website successful. Ultimately, its the folks writing the code who make or break a site. The software developer needs access to the code, so don't even dream of buying a closed-source, commercial solution. Start with the best open source solution (let the software developer pick), pay for a good design (don't do a design competition, that's exploitative), and let the software developer merge the two. Your software developer uses social web software all the time. He knows what's good. Force him to eat his own dogfood (use the site himself), and he'll work harder at improving the experience. Elicit feedback from your users and implement popular features.

The tragedy is that these old, broken forums end up loaded with useful information donated by the users despite the shortcomings of the software the site rests on. Don't post useful information on these sites. Please, don't encourage them.

Compiler code generation: just flatten the tree
Posted by postfuturist on 2010-02-09 01:53:09

I spent some time tonight working on the compiler I am writing for my own education. I started this project while only understanding bits and pieces of what it takes to write a compiler, but I am quickly learning more. Various texts on compilers start with creating a simple calculator program that takes a string like "10 + 2 * 6 - 4 / 2" and calculates the result. For this example the problem can be broken down into a few simple steps which are merely transformations:

  1. Transform the input into a string of discrete tokens. "10 + 3 * 4" becomes "10", "+", "3", "*", "4". This is achieved with a scanner/lexer like lex, or the newer flex.
  2. Transfom the token stream into a parse tree using a BNF description of the grammar with a parser tool like yacc or the newer bison. Using infix notation, our stream becomes ("+", "10", ("*", "3", "4")).
  3. Reduce the tree by walking it and calculating the value of each node, depth first, until you have calculated the value of the root node which is the value of the whole tree.
    1. Evaluate left side of "+" operation. It is the constant value 10.
    2. Evaluate right side of "+" operation.
      1. Evaluate left side of "*" operation. It is the constant value 3.
      2. Evaluate right side of "*" operation. It is the constant value 4.
      3. Reduce: 3 * 4 is 12
      It is the numerical value 12.
    3. Reduce: 10 + 12 is 22

The whole thing is straight-forward and almost insultingly easy. The tools I've been using are Flex and Bison which generate the scanner and parser, respectively, in C. While writing these parts to generate a parse tree, which could also be called an abstract syntax tree or AST, I had a realization. Code generation, whether it is byte-code or machine-code, is simple. The code does the same tree-walk, in the same order, but instead of evaluating and reducing the elements, just generate byte-code or machine-code which will do those simple operations at runtime. The program spits out a flat sequence of code. You are merely writing code that flattens the tree.

To put it another way, the sequence of mathematical operations that reduce the parse tree to a single value is the same ordered sequence of operations that the compiled code should represent. It's so obvious, it's painful. I just hadn't thought about it before. I thought there would be some deep voodoo involved in compilation, but there isn't. Your program is likely to be a series of expressions, not just one. All you have to do is evaluate them in order, tacking the byte-code or machine code for each expression on to the end of the code.

Here is a series of transformations that does what I am talking about:

  1. Input : "x = 10 + y * 3"
  2. Scanner : "10", "+", "y", "*", "3"
  3. Parser : ("=", "x", ("+", "10", ("*", "y", "3")))
  4. Tree walk :
    1. Store the location of "x" in R0
    2. Store "10" in R1.
    3. Store the value at location "y" in R2
    4. Store "3" in R3
    5. Store the product of R2 and R3 in R4
    6. Store the product of R1 and R4 in R5
    7. Store R5 at the location specified in R0

It's just one more step to take the list of operations and turn them into a list of coded instructions. At that point you are done.

One of the things I wanted with my compiler is type inference, so when I get to an "=" or "assignment" node of my tree, I evaluate the right side first, which gives me a type, then I know what the type of the identifier on the left should be. The first time an identifier is used should always be an assignment, and the type of the identifier is set at that time, if another assignment to the same identifier is made of a different type, the program should fail to compile. Here is the messy / incomplete code that handles this right now:

int genAssignByteCode(simpleblock b, simplenode n)
{
    if(n->R == NULL) {
        printf("No R-value in assignment!\n");
        exit(EXIT_FAILURE);
    }
    if(n->L == NULL) {
        printf("No L-value in assignment!\n");
        exit(EXIT_FAILURE);
    }
    if(n->L->type != IDENTIFIER) {
        printf("L-value is not an identifier!\n");
        exit(EXIT_FAILURE);
    }
    int reg_num = genByteCode(b, n->R);
    if(reg_num == -1) {
        printf("The r-value has no value. Lame.\n");
        exit(EXIT_FAILURE);
    }
    enum reg_type rval_type = get_reg_type(b, reg_num);
    simplesymbol inmap = in_map(b, n->L->data.string);
    struct byte_code bc;
    if(inmap == NULL) {
        // new value with big, scary TYPE INFERENCE
        int newr = new_reg(b,rval_type);
        new_symbol(b, n->data.string, newr);
        if(rval_type == rt_int) {
            bc.t = BCI_STORE_INT;
        } else if(rval_type == rt_string) {
            bc.t = BCI_STORE_STR;
        }
        bc.reg_a = newr;
        bc.reg_b = reg_num;
    } else {
        // existing value, better do some static type checks
        enum reg_type lval_type = get_reg_type(b,inmap->reg_num);
        if(rval_type != lval_type) {
            printf("You got the wrong type, pal.\n");
            exit(EXIT_FAILURE);
        }
        if(rval_type == rt_int) {
            bc.t = BCI_STORE_INT;
        } else if (rval_type == rt_string) {
            bc.t = BCI_STORE_STR;
        }
        bc.reg_a = inmap->reg_num;
        bc.reg_b = reg_num;
    }
    new_code(b,bc);
    return bc.reg_a;
}

Basically, simpleblock is a typedef'ed pointer to a struct that represents a block of code. It keeps an array of virtual registers allocated, array of instructions, and linked list of symbols (identifiers). This code generates the byte-code for the assignment operation. It is called from the genByteCode function which it calls recursively to evaluate the right side of the statement. The line new_code(b,bc); actually appends the byte-code bc to the array of instructions in block instance b. This is C code, and as such, it is really ugly. Also, the error messages are unhelpful and quite rude. This is a rough-draft of sorts. There are a lot of parts not in the code that would help explain it, too.

While constructing this part, I realized that it is very easy to do type inference. Java and C# took years and years to acquire type inference of any kind, but I don't see what all the fuss is about, it is quite simple to implement, really. Compilers are not scary, go write one of your own already. Go create that perfect programming language. Do it now, or else you'll spend the rest of your life writing software in programming languages that you don't like and you'll only have yourself to blame.

Create your own programming language for great justice
Posted by postfuturist on 2010-02-07 21:52:12

In an earlier post I wrote my wish-list for the perfect programming language (for me.) I believe most programmers have a perfect programming language in mind. It usually goes something like this: "I really like the object model of language W and the functional aspects of language X and the syntax of language Y. It would be great if it ran on Z runtime / virtual machine, too." Then there are the language zealots who say that language Foo is perfect, or at least better than all the other commonly used languages and with such a large user base and library of reusable code, that even if they could fix the bits of the language they don't like, it wouldn't be worth it because it would break existing code. In a sense, they've settled, put their tent pegs in the ground and encourage others to set up camp where they are, so they can all share code. There is certainly a lot of value is just spending a lot of time one language so you can go deep with it and develop a strong set of idioms and patterns. It also leads to a large number of very exclusive camps. To be fair, many programmers dabble in multiple languages and paradigms, but most do not.

I often am offended by programming language zealots. Just like religious zealots, they are myopic, self-important and insulting towards outsiders. I love much of the writings of Paul Graham, but I find that his statements about Lisp and all that which is not Lisp to be difficult and often insulting. In his article "Beating the Averages," Graham talks about a hypothetical language Blub that a programmer gets attached to. This programmer sees less powerful programming languages for what they are, but doesn't understand the abstractions and power of better programming languages. The programmer thinks he is using the best programming language, but he isn't. He isn't using the best programming language, because... he isn't using Lisp. Period. Graham states explicitly, "Lisp is so great not because of some magic quality visible only to devotees, but because it is simply the most powerful language available. And the reason everyone doesn't use it is that programming languages are not merely technologies, but habits of mind as well, and nothing changes slower." Again, I like Paul Graham's writings for the most part, but saying that he is correct about this because he seems to be correct about so many other things would be a fallacy as much as saying that he is wrong because everything else he says is wrong.

I agree with many points in Graham's article, more or less, but I don't think that most programmers are small-minded Blub programmers. I agree that Lisp macros are powerful, that it is a powerful concept to write programs that write programs etc, but Lisp is hardly the only language that does that. Boo, a language that actually fulfills my wish-list, has macros which are very similar in functionality to Lisp macros. You get access to the actual parse tree and can create or change or augment language features in the language itself. It is a powerful feature, but it is not the most powerful feature, at least not for me. It's not that I have a small mind, either. I understand that Lisp macros are awesome, but it is not an abstraction I need in day to day programming to get things done. I'm a big fan of getting things done, and building systems to help get them done faster and better. Graham argues that he used Lisp with lots of macros in a business that did well, with software that delivered functionality not present in the competitor's projects, and that this was all because of Lisp macros, which comprised at least 20% of his product's code. I'm sure there was a correlation between having features and being successful, but the 20% macros thing may just be significant of the fact that he had to add a lot of things to Lisp to be able to build the software he wanted. It could also be significant of the fact that he knows how to write macros. Just because the competitors did not deliver the same features does not mean that they didn't or couldn't because they used some less powerful programming language. The software he wrote could likely have been written in any number of languages by any number of programmers. The programmers that toil away in so-called Blub languages, oftentimes are quite good at what they do, and can deliver features if they can think them up or are allowed to add them or whatever. Look at Facebook, they rock the world with the power of PHP, a pretty awful language by the estimation of many. I just doubt that Lisp was the reason for Paul Graham's success.

I said all that to say this: Lisp worship aside, it is a good idea for programmers to learn things about programming languages that they don't understand. It is healthy to stretch your mind with different perspectives. I believe that it is a universally accepted truth that looking at a problem from different mental perspectives causes us to have breakthroughs of new ideas and understanding. One excellent way to expand your mind as a programmer is to create your own programming language--not just a theoretical grammar, but an actual working compiler and/or interpreter.

I have to admit, I've been a little lackadaisical regarding hacking code in my free time. I think I haven't been able to get excited about a project. I've announced, prematurely, some projects in the past that have been since abandoned. Perhaps it is that I spend so much time at work "getting things done" that I relish the freedom to not get things done outside work. It's true, I like messing around with different programming languages and not committing to projects. However, since the programming language wish-list post, I've been drawn to my newest project over and over. I am working on a new programming language. It is really in the very earliest of toy language stages right now. I am really just learning things as I go. I am learning how to use flex and bison, the descendants of lex and yacc, what abstract syntax tree is, and many other things. I've read a lot of articles about different aspects of compiler / interpreter design and techniques. Some things have inspired me a great deal, like V8, the JavaScript engine that Google created. I've been interested and inspired by byte-code interpreters, JIT compilation, garbage collection, stack-less interpreters, coroutines, closures, and many other things.

As you can imagine, I have a laundry-list of things I would like to implement in my very own programming language, but this project is mostly about learning, not the end product. I know most folks with a BSCS have taken a course on compilers and have written a compiler of some form or other. With all the instruction on compilers that has taken place, why do we settle for such mediocre languages. Python 3, was a very small step forward from Python 2.x. It was such a small step, that I wonder what the point was. They cleaned up the syntax and some of the libraries a little bit, but for a backwards compatibility breaking change, they sure didn't do much to fix the inadequacies of the language. The "hot" programming languages these days are incredibly old--at least in internet years. Ruby is 15 years old. Python is 19 years old. Why, after developing a science around compilers, do we still use languages that require semicolons on nearly every line? If Ruby and Python are each at least a decade and a half old, do so many companies write code in C#, Java, PHP, and so many other languages that have vestigial syntax from C? Perhaps it is the slowness of businesses. I don't think so. I think it is laziness.

Software developers have the ability to write their own compilers. Yet, they do not. It is intellectual laziness. We are blacksmiths. We can forge our own tools, yet we use the crappy ones handed down to us from old. There are established techniques for creating new tools, yet we forge on with the tools given to us. For shame. What if every developer wrote their own programming language? Sure, most would fall by the way-side and we need developers to write libraries, too and to research other concerns like concurrency, that don't necessarily need to be solved at the language level. There are millions of software developers in the world. Python is dead. If you have the chance, fire up the python interpreter some time and type in "import this" and hit enter. You will discover a creed, a philosophy that underpins many of the decisions surrounding the creation and maintenance of Python. It is a philosophy I disagree with on many points. Not only that, but the Python language fails to fulfill a good number of them. "In the face of ambiguity, refuse the temptation to guess." Python does not allow type annotations, everything is guesswork. "foo = bar" could mean a lot of things, it could be changing the value referenced by "bar" or it could be creating a new reference. "bar" could be anything, a number, a string, an object. Looking at Python code is often a lot of guesswork. Duck typing is the only typing available. When you create a function, you can only name the inputs, not define the type. It seems to fail here. Here is another : "There should be one-- and preferably only one --obvious way to do it." That line gets a lot of deserved criticism. It's not even true. Python is a very powerful language that offers a very large number of way to solve any problem, and usually several of them could be considered obvious, depending on the style of code you normally write. Regardless, even if you could write a language that only had one obvious way to solve each problem, you would have a very poor language. "If the implementation is hard to explain, it's a bad idea." CPython has the GIL, both hard to explain, and a very bad idea.

I encourage every developer to write their own programming language. It will break us out of the complacency that allows us to toil day-in and day-out using the same leaky, broken abstractions. We can write our own tools. I've seen the HTTP 1.1 RFC, it isn't that complicated, and I think the tools we use now to implement server-side code for web development are quite broken. PHP is just awful. Python is a little better, as is Ruby. V8 / JavaScript / Node.js has some promise, but JavaScript is a pretty broken language, too. It's got that semi-colon problem, plus wierd object / array / function rules that lead to ambiguity and confusion. At the end of the day, you can't have everything in one language, though they are certainly trying to do that with Perl 6. We can have better choices, or at least break through the complacency.

I don't want to sound completely critical. Change is happening. People who are stuck with certain runtimes, like the JVM or .NET have created or ported better languages for the platform. There is quite a bit of excitement about new compiler tools, runtimes, front-ends, back-end etc. The LLVM project has gained a lot of attention, lately, for being completely self-hosting.

My own project is just barely getting going. I have the code constructing a complete AST, and I am in the process of writing the part that generates byte-code. I've learned so much already about grammars, tree manipulation, and parsing. I have ideas about register allocation and other things. It is quite the experience to get my hands dirty writing C code again. I am taking advantage of tools I didn't know about in the past, like the Boehm garbage collector. I've spent too long in garbage collected languages to have to worry about freeing every bit of memory by hand now. That and generous use of typedefs make the experience not entirely unpleasant.