Making a Razor Compiler - The Tokeniser

21 November 2013

I recently evaluated SemanticMerge at work during a large merge. It was very good at understanding c# code files and determining what methods had been moved around and changed, but unfortunately didn't have any idea about WebForms or Razor markup files. A brief discussion with a co-worker about the difficulties involved in parsing a Razor file lead to me wanting to implement one. I decided to start it in TypeScript just because that's my flavour of the month. It also means I can gear it towards being a compiler for a templating engine.

The components

There are three main components involved in the compiler, the tokeniser, the parser, and the compiler. The tokeniser is responsible for presenting the razor markup as individual tokens ready for the parser to consume and understand. The parser constructs a logical view of razor markup which the compiler then takes to generate the executable code that will produce the final rendered markup.

I've hit a snag

At the moment the tokeniser, or token iterator, consumes whitespace and presents everything else as individual tokens.

Given the razor: <div>Hello. I'm @model.name! Nice to meet you.</div>

The tokens presented are:
< div > Hello . I ' m @ model . name ! Nice to meet you . < / div >

For the HTML and code side of things this is great, whitespace is pretty much meaningless. Unfortunately in blocks of text it does have meaning and that is lost in this whitespace consuming token iterator. I naively tried to resolve this by having the parser join the tokens together with a single space character, but then the output becomes Hello . I'm with some extra spaces in there - not great.

I started to think about how I could get the iterator to include the full text phrase so that the following tokens would be presented:
< div > Hello. I'm  @ model . name ! Nice to meet you. < / div >

The problem with this is that the iterator can't judge whether or not to include more than one word in a token. As far as it's aware meet could be a variable name, or model could just be a bit of plain text. It's not the iterators job know what things mean but rather to present small tokens to the parser which has a full understanding of the token's context in relation to the rest of the markup.

Therefore a solution is to have the iterator present the following:
< div > Hello .   I ' m   @ model . name !   Nice   to   meet   you . < / div >

The interface to the iterator will have to be extended to allow the parser to ask for either the next token, or the next non-whitespace token, but this is a straightforward change. The next question I need to answer is whether all whitespace should be preserved in the output, or only whitespace that has meaning, I'm leaning towards the latter.