How tosh is parsed

Nov 5 2015 • Filed under tosh

I mentioned that parsing is hard on the Scratch forums, and I was asked the following about tosh:

MegaApuTurkUltra wrote:

Lol, are you using some sort of parser generator or just making it from scratch? pun intended

The answer is complicated, and seemed like a good excuse for a longer post here.

Architecture

As I’ve said before, tosh is a difficult language to parse.

In particular, the syntax highlighter has to be based on the same parser as the language itself. Most languages use a simpler system for highlighting, based on regular expressions and/or state machines. But I want the different operations to reflect the colour of the block in Scratch, so I need a full parser.

The code editor pane uses the excellent CodeMirror library. To implement syntax highlighting, tosh presents a custom mode to CodeMirror. CodeMirror gives me a single line at a time; I split the line into tokens and return the colour of each one.

The syntax highlighting is nearly as colourful as the project itself.

So this informs the architecture of the parser: it must operate on a single line at a time, and it must use a tokenizer.

Here’s what it looks like at the moment:

The language is defined using a context-free grammar, split into two parts:
- The core grammar is defined by hand, and contains things like arithmetic.
- It’s then augmented with automatically-generated rules to define the rest of the Scratch blocks.
I did it this way to make sure I got operator precedence right, which turned out to require great care!

The grammar may also be augmented at runtime with rules for variables, parameters, and custom blocks.
Each line is first run through a tokenizer (so we can do syntax highlighting easily).
Here’s the fun part: I use Earley’s algorithm to parse the tokens using the grammar defined at initialisation. I would have used hardmath123’s nearley, which implements an Earley parser; but I need a tokenizer, and nearley is quite slow (sorry, hardmath!), so I wrote one myself instead.

Earley parsers have the nice property that they can parse any context-free grammar, with reasonable time complexity for most grammars. Writing the grammar for tosh was hard enough without trying to force it into LALR form suitable for yacc! And being able to add grammar rules at runtime is really nice; this would be harder with a parser generator.

The downside is that there’s no way to make sure that your grammar isn’t ambiguous. This is a bigger problem than it sounds: you really don’t want your language to have ambiguous parses!

It’s made worse since tosh allows user-defined variable names (which can include spaces!); it’s impossible to detect precisely whether including a variable name will interact with the rest of the Scratch blocks in a way which causes ambiguous parses. I have a plan for dealing with this, but it’s awkward and messy.
For syntax highlighting: Once I have a parse for the line, I walk down the AST and set the correct colour for each token. Then I pass the list of tokens back to CodeMirror.
For compiling: I collect up all the lines, and then run a simple recursive-descent parser over the list of ASTs. This deals with separating scripts, and builds up nested control structures.

So there’s an overview of this part of tosh’s implementation. Writing an Earley parser, and designing the language of tosh, have definitely been the most interesting bits of tosh: all I have left is boring UI programming!

Autocomplete is where things get super-interesting… but that’s a story for next time :-)