Grammar Parser Control Flow

These blog posts are getting larger all the time. What I’m essentially doing here is a brain dump of my thought process. I blog about it because anyone doing the same would sure appreciate learning from what someone else did before them. Very rarely do I find that people blog about their work as it progresses. Think of this as my current journal. It’s very much a snapshot of what happens as the thing unfolds in my mind.

I’ve found it very helpful to approach the problem from both ends. I constantly switch between what it would look like in code and what representation would be most suitable to solve this other problem over here. The design that I want needs to make the problem simpler to deal with otherwise it’s a useless design.

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

A quote from Albert Einstein commonly phrased: everything should be as simple as it can be, but not simpler. Though, I believe it is important to emphasize the not simpler part. Since this along with Occam’s razor is often popularized as the simplest solution is often the right one which is just absurd.

What follows here is me spending my evening trying to work out a representation in data along with an algorithm to reject ambiguous grammar but more importantly disambiguate production rules that only appear ambigous but are infact not.

a b
a (b c)+
a b? c d*
a (b | c | d)

The above 4 simple rules are just arbitrary rules that I just made up to work this out in my mind. They probably don’t even make any practical sense but they will do for now.

The first observation to make is that there’s an implied logical AND operation occurring in between these rules.

a & b
a & (b & c)+
a & b? & c & d*
a & (b | c | d)

This logical AND operation is what we also refer to as a short-circuit evaluation operator. The implied meaning is that if any expression on the left-side is not true we do not evaluate the right-side.

The second observation to make is that the production rules are Boolean expressions. This observation is only possible if we consider the original simple expression grammar and note that the recursive decent parser implementation only has two types of function calls. Accept and production evaluation (where Accept is simply a production rule for a single token).

For is reason, we use a Boolean return value for production rules and do the syntax tree construction elsewhere (private stack). Given what I’ve learned here, I’d say this is the better way to build a recursive decent parser because the grammar has the most straight-forward representation in code.

The third observation is about control flow and ambiguity.

a & b
a & (b & c)+
a & b? & c & d*
a & (b | c | d)

Let’s say we were to implement this as a recursive decent parser (this is where nothing makes sense any longer). Well, the first is easy.

bool Test() {
  if (a()) {
    // ...
  return false;

OK, that’s a. For simplicity reason here we just use these for single letters to represent the production rules. Let’s see if we can fit b into this.

bool Test() {
  if (a()) {
    if (b()) {
      Reduce(2, "1");
      return true;
  return false;

OK, that took care of the first production a & b. Since this is the first production we call that 1 Reduce(2, "1"). Now what? This is going to spiral out of control.

bool Test() {
  if (a()) {
    if (b()) {
      if (c()) {
        int n = 3;
        while (b()) {
          if (!c()) {
            throw SyntaxError();
          n += 2;
        Reduce(n, "2");
        return true;
      Reduce(2, "1");
      return true;
  return false;

That was really messy and there’s more, take a look at the third and forth rule. They are ambiguous.

Given the input “a b” the following two rules are matches:

a & b
a & (b | c | d)

Given the input “a b c” the following two rules are matches (since d is optional):

a & (b & c)+
a & b? & c & d*

Given the input “a c” the following two rules are matches (since b and d are optional):

a & b? & c & d*
a & (b | c | d)

The issue here isn’t so much to understand that this can happen (that it won’t work if we do it like this) but that we have to have a representation that allows us to quickly see that this is going to be an issue and this isn’t something that inherently obvious right now.

OK, the only thing I can conclude from the above is that it is nonsensical (thus is represents a grammar that we’d have to reject), moving on.

How do we figure out if we are to reject a grammar then?

Again, this is Boolean algebra in some form or another. If we start out with the simple rule a and it is the only rule a. Then we can say that the production is sound. It has a single unambiguous interpretation.

Now, let’s extend the original example with a second rule a & b. We now have to rules.

a & b

If we work our way from left-to-right we can see that applying a results in an accepting state and the choice of applying b. We need to try b before we can return but it’s clear to see how we do this.

a & b
a & c

Still clear what to do. This only adds another choice to the first accepting state that follows a.

a & b
a & c
a & (b | c)

Now what? How do we identify that that fourth construction is impossible (as is).

This is what we have.

a & b
a & c

Get rid of the a. (for reference, they way we get rid of the a is of course the same way we get rid of every other term).

(b | c)

We now have b and c and a candidate rule (b | c).

If the current candidate rule (b | c) matches any of the other rules that we’ve already accepted we must reject the candidate rule if it results in an accepting state.

OK, that’s easy with pair of simple expressions. What about when things get more involved?

Note: at this point, I’m fairly sure that I’m about to invent my own algebra. I apologize in advance for headaches that this will inevitably create.

I need to change up the notation a bit. We need to define a few things before we continue. When we talk about applying a production rule we’re talking about a function call in a recursive decent parser. This function call can call other functions that represent other production rules or it may accept (conditionally read) a token. These are the only two ways we can advance the parsing.

To continue this reasoning, we can generalize the accept operation as a production rule. A production rule that can accept only a single kind of token. And this is important because a production rule (as with a function) will hide underneath some behavior that may be inherently complex. The generalization here is that a production rule will only be successfully applied on a given set of token kinds. Let’s take this idea and apply it to our small expression grammar.

number = ("0" - "9") + ;
operator = "+" | "-" | "*" | "/" ;
lparen = "(" ;
rparen = ")" ;

Expression = PrimaryExpression
  | PrimaryExpression operator Expression
PrimaryExpression = number
  | lparen Expression rparen

Let’s take the two production rules Expression and PrimaryExpression. The pre-condition for them to be successfully applied is the tokens that they accept at the top level (currently we only examine the top-level). Expression just applies PrimaryExpression so we need to perform dependency resolution and then start with PrimaryExpression. PrimaryExpression accepts either number or lparen. The set of tokens that gate the PrimaryExpression are thus number and lparen.

Let’s get back to the definition part. Any production rule has an associated set of the tokens that it accepts at the top-level. These tokens form the top-level guard set. A lonely accept operation is an implicit production rule with a single guard token.

Now, let’s get back to our example.

As we build our grammar we have the following two production rules:

A [X, Y]
B [Z]

We now introduce a third production rule that we imaginatively call C [Y]. A and B are both infinitely complicated but we can tell from our gate sets that B and C both share the gate token Y. If C would result in a new accepting state (a state that is a complete match for a production rule) this grammar rule would be ambiguous.

This process is applicable regardless of optional or repeated constructs since they don’t change the contents of the guard sets.

There’s another reason why we care about this as well and that’s unification. By this we mean the ability to realize that two otherwise distinct production rules share some progression through a shared token. Here’s an example of what unification may look like.

Here we have two production rules A and B that share the same top-level guard set but are otherwise not ambiguous since accepting x does not result in an accepting state.

A = x y w ;
B = x z w ;
C = A | B ;

From the grammatical construction we have two productions A and B that disambiguate by their second term. If we initially thought of them as disjoint we’d probably not notice that they actually disambiguate by their second term which is why unification is necessary. The name that we associate with the production rule is first and foremost a reference.

Let’s introduce a different notation that’s probably more representative of what we aim to achieve here. Note that C = A | B is just an alias for A or B (it’s not important).

x y w => A;
  z w => B;

This way we work our way from the far left to the far right and when we reach the accepting state => we apply our projection. This projection is as of right now inferred from the production rule but can be used to build any syntax tree of choice (abstract or otherwise).

We could very well end up with a grammar that’s like this.

x y w => A;
  z w => B;
    w => C;

Where w as a single token is a valid production somewhere. The way this would work is that we would have a control flow where a production is simply an entry point into the directed control flow graph. For each directed edge in the graph there is an associated guard set.

(x | y) z => A
x       w => B
y       w => C

The above three rules exemplify the way guard sets associated with edges in the control flow graph can change as we build up the grammar control flow graph.

Initially, we have a root node with the an edge going towards z with the associated guard set [x, y]. We then introduce B and since each path must lead to a unique production we need to split the guard set associated with the existing edge. The existing edge remains but the guard set no longer contains x. A new edge is introduced with the guard set x that goes to a new node with an edge for z and w. Only a node that is an accepting state results in the application of a production.

Remember that the guard set is the accept (or apply) actions that we need to do in order to advance the parser. If we ever end up in a node where no edge can be taken, we have a syntax error. As soon as we reach an accept state we can yield a syntax tree (but only if there are no other choices to be made, our algorithm is a greedy one).

alt text

You can see this in the drawing here. The accepting states are labeled as such and the intermediate states are just labeled sequentially (those don’t matter, other than if we get stuck there we have a syntax error). Note how the existing edges have to be copied when we split a guard set in two. When s1 is introduced we have to duplicate the [z] edge going back to the production A. And hey, what do you know, no new algebra necessary. All we needed was some good old graph theory. This also happens to be one of the first instances where I’ve found labeled edges in a directed graph to be of use.

The goal here is not to be able to support superfluous grammars. People are not going to write grammars that run into these gnarly corner cases all the time but when they do we have a singular interpretation of what to do. The theory is used to define constraints on the solution to successfully arrive at the answer. And honestly, I cannot say that I’d arrive at the correct answer without it.

Syntax Tree Construction

This update will be about the syntax tree construction. In this case I’m trying to map out how to do it without using on return values. I put up a working example in C# real quick. Take a look.

I’m introducing a new concept which I call reduction, or just reduce. Where I pop a number of syntax tree nodes from a specific stack and then add them as child nodes to a new syntax tree node that replaces these nodes on the stack. The twist is that if we unwind the stack as-is, we’ll end up with the syntax tree nodes in reverse order, so we use a list with random-access capabilities to reverse the reversal in place. Take a look at the Reduce method in the C# example.

In past efforts, I’ve used the return value to propagate syntax tree nodes. Optional (or conditional) production rules have then been implemented like this:

var result = SomeProduction();
if (result != null) {
  // combine with some other production rule

If the return value is not null we can proceed with additional production rules that expect some other production rule to proceed it. Something which isn’t yet clear to me is to how to treat the ambient stack when a production rule does not end up modifying the stack. As an example we could have reductions in the stack that end up not changing the stack for the casual observer. A way around that is to use a version scheme, any stack modification will increment the version. Different versions implies return value propagation.

The whole point behind this is to eliminate stack frames but this is not going to be possible if we need to support optional or repeated production rules. These would modify the stack in a way that are not predictable. Lists are a primary example where we need to count each production that’s being applied in a loop. Lists appear in some variations but generally follow this pattern:

for (;;) {
  var result = SomeProduction();
  if (result == null) {

Looking at the Read/Accept pair in the C# example there’s a couple of things I’d like to point out.

private Token _token;

private void Read() {
  if (_inputStream.MoveNext()) {
    _token = _inputStream.Current;
  } else {
    _token = default(Token);

protected bool Accept(TokenType type) {
  if (_token.Type == type) {
    _stack.Add(new TokenNode(_token)); // push
    return true;
  return false;

The tokens get pushed onto the syntax tree evaluation stack as we accept the tokens. Traditionally, I’ve written these kinds of recursive decent parsers like this:

var token1 = token_;
if (Accept(Token.SomeToken)) {
  var token2 = token_;
  if (Accept(Token.SomeOtherToken)) {

That’s rather annoying becuase you have to introduce a lot of local variables to keep track of each token within the respective production rule. Since the parc execution environment does not have support for local variables we’ll maintain a seperate evaluation stack for syntax tree construction.

Here’s the primary rule form the same C# example. It’s very clean in that the way we reduce the syntax tree stack is to group together a bunch of syntax tree nodes based on a specific state in the parser. This works really well in practice (and we don’t depend on return value propagation). The trick is of course how to determine where to put the reduce calls.

public void Primary()
  if (Accept(TokenType.Number))
    Reduce(1, "NumberLiteral");
  if (Accept(TokenType.LeftParen))
    if (!Accept(TokenType.RightParen))
      throw new InvalidOperationException();
    Reduce(3, "EvalExpression");

As things stand right now the instruction set (or byte code) we use is composed of the following instructions:

  • accept <token> push 1 if token is read otherwise push 0.
  • beq <target> if top of evaluation stack is non-zero, transfer control to target
  • jmp <target> transfer control to target
  • call <target> push current address as return address, transfer control to target
  • ret pop return address, transfer control to return address
  • err raise syntax error

I’m going to add a reduce which does what the C# example does and I’m going to add additional side-effects to accept (those that modify the syntax tree evaluation stack).

I’m also introducing fixed-size stack frame that changes with respect to the call and ret instructions. As things stand, there will be need for some local state to keep track of production rules that read a variable number of tokens.

Control Flow And Code Generation

I’ve been here before.

What do you do when you want to write code to execute more code? How do you (in code) represent code that you can execute inside your other code and what does that look like? Essentially what we’re talking about is this. How does a compiler take a text file and produce something that the microprocessor can execute and can we write that (as well as the microprocessor) in code?

Another way to pose this question is, how do you run a syntax tree?

I wrote this grammar:

void Expression() {
  if (Accept(kTokenOperator)) {

void PrimaryExpression() {
  if (Accept(kTokenNumber)) {
  if (Accept(kTokenLeftParenthesis)) {

Drew this control flow graph:

alt text

And packed everything in a data structure:

enum {
enum {
struct ControlFlowNode {
  int op_code_;
  int token_;
  ControlFlowNode* next_;
  ControlFlowNode* tran_;

And modeled the data as such:

auto primary = new ControlFlowNode;
auto lparen = new ControlFlowNode;
auto eval = new ControlFlowNode;
auto rparen = new ControlFlowNode;
auto expr = new ControlFlowNode;
auto binary = new ControlFlowNode;
auto expr2 = new ControlFlowNode;

rparen->op_code_ = kOpCodeAccept;
rparen->token_ = kTokenRightParenthesis;
rparen->tran_ = // new kReturn node...;

eval->op_code_ = kOpCodeApply;
eval->tran_ = expr;
eval->next_ = rparen;

lparen->op_code_ = kOpCodeAccept;
lparen->token_ = kTokenLeftParenthesis;
lparen->tran_ = eval;

primary->op_code_ = kOpCodeAccept; 
primary->token_ = kTokenNumber;
primary->tran_ = // new kOpCodeReturn node...
primary->next_ = lparen;

expr2->op_code_ = kOpCodeApply
expr2->tran_ = expr;
expr2->next_ = // new kOpCodeReturn node...

binary->op_code_ = kOpCodeAccept;
binary->token_ = kTokenOperator;
binary->tran_ = expr2;
binary->next_ = // new kOpCodeReturn node...

expr->op_code_ = kOpCodeApply;
expr->tran_ = primary;
expr->next_ = binary;

And eventually had a eureka moment.

What if I simply followed along the edges (tran_ pointer) in the control flow graph and emitted byte code until all nodes where visited. I would follow branches (kAccept) or function calls (kApply) first, then see if there was a next_ pointer.

bool Bind(node, map, va) {
  if (map->TryGet(node, va)) {
    return false;
  map->Add(node, *va = map->GetSize());
  return true;

void Emit(node, map) {
  int node_va;
  Bind(node, map, &node_va);
  int tran_va;
  if (node->tran_ && Bind(node->tran_, &tran_va)) {
    Emit(node->tran_, map);
  switch (node->op_code_) {
    case kOpCodeAccept: {
    case kOpCodeApply: {
    case kOpCodeReturn: {
  if (node->next_) {
    int next_va;
    Bind(node->next_, &next_va);
    if (node_va + 1 != next_va) {
    Emit(node->next_, map);

I implemented this algorithm to generate the byte code (ran it on paper) then took the design and wrote a stack machine for it, in the size of ~100 lines of code. This made me very comfortable in believing that the direction that this is going is actually going to pan out.

The result was something like this:

  accept   kTokenNumber
  beq      3
  jmp      4
  call     Expression
  jmp      6
  accept   kTokenRightParenthesis
  beq      7
  jmp      8
  accept   kTokenLeftParenthesis
  beq      5
  jmp      9
  call     PrimaryExpression
  jmp      10
  call     Expression
  jmp      12
  accept   kTokenOperator
  beq      11
  jmp      13

A complete mess, I know but the microprocessor don’t care. What’s intresting is that there’s no explicit function definition. A function simply happens where it’s used and is then referred to from everywhere else. What’s even more intresting is that this byte code (or intermediate representation) can be rewritten in the form of the LLVM instruction set. If you do that, then we can ask LLVM to compile (and optimize) the byte code to whatever target needed.

A second pass over this byte code to reorganize and remove the unnecessary labels might be prudent but it’s completely optional (this will work).

Dfa Construction

A couple of things before I get started.

parc is my parser compiler framework. It’s a virtual execution environment for parc grammar files. parc takes a grammar file and transforms text input into a syntax tree. You can read more about it here.

I’ve done this sort of thing in the past by hand and I find myself going back to this problem every now and then. This time, I’m atempting a more formal approach using deterministic finite automaton (DFA) construction for the lexical analysis and deterministic pushdown automaton (DPA) for recursive descent parsing.

Note: for unfamiliar reader, lexical analysis is the step in which a sequence of characters gets grouped together as tokens. The actual parsing is then done based on a stream of tokens. Any sequence of characters that does not conform to a token is illegal.

Much to my furstation I find that there’s a gap between the formal mathematical definition and the thing that I can implement in code. I not sure why that is but to me, a lot of math is this ambighous mess (crazy, right?). I don’t know why that is but part of it has to do with the fact that I never enjoyed math on it’s own. Many mathematical constructions are just nonsensical to me but the code isn’t. I’m much more comfortable reverse engineering an idea from code than math. Anyway, the formal definitions for these state machines are pretty much the same (small variations of the same concept).

The formal definition of a state machine is essentialy 5 things:

  • A finite set of states
  • A finite alphabet
  • A transition function
  • An initial state
  • A set of accept states

Math vs computation. All things math run in O(1), constant-time and disregards that fact that practical computation is based on a finite amount of bandwidth and/or memory. The theory of computation is in itself a mathematical model but the bigger the divide between the formal definiton and the practical computation the harder it is to utilize.

Let’s start with an example regarding the definition of the alphabet. An alphabet is a convinent way of saying, we have a finite set of acceptable characters. We know which these are and label each individually. OK, great. Let’s define a finite universe as the Unicode code space (is this resonable? most programming languages use ASCII only). As long as the number of characters we use are finite we’re good but as soon as we say, accept anything but this character we invite the entire Unicode code space into our transition table (that is, if we assume that we have a transition table that we can use with direct addressing). We’re talking megabytes of overhead per state in that table. The math disregards this problem entierly (I’m assuming that that’s exactly the point from a mathematician’s standpoint but I’d love to see more examples of math and practial computation work together).

Here’s the first example using parc grammar (you can look up that parc grammar grammar file for reference)):

token1 = "A" - "Z" ;

Which is a single character token using the label token1 that accepts any upper case ASCII letter. Now, this is easy becase we know here that our language only accepts these characters but what about your typical delimited string?

string = "\"" ( ^ "\"" ) * "\"" ;

This second example poses a very real practical problem. It specifies the allowed character set (or alphabet) as the inverse of the delimiter effectivly making the whole Unicode code space part of our alphabet. That’s a lot of symbols.

If we chose to represent our DFAs as transition tables we need to have the input alphabet in one dimension and each state in the other. In this case that’s 1 MiB (of basically worthless state information) for each state (we’d ought to have very sparse transition tables).

Math is easy becuase it allows you to simplify at any point. Running speed and memory constraints is of little concern when you just dictate the rules.

Now, we can chose an arbitrary lambda function as the transition function and simply “abstract” the problem (which is what math does) but if we do so we end up with a linear search for the transition function and we lose the ability to anaylize the finite state nature of our lexical analysis.

I guess we need to talk a bit about the actual computation. The computation models we use simply work better on a nice continous (linear) memory space and this is why transition tables are appealing (fast lookup). However, they fail to represnt negation expressions efficently.

We can implement the lexer ourselves and write the code. It’s just a lot of boiler plate code. If we do we can work backwards from the goal to construct the basic idea of the representation that we may want to do what we need here.

if (ch == '\"') {
  while ((ch = next()) != '\"') {
  ch = next();

The above code is a direct translation of the regular expression string. In total, we have three branches so it’s reasonable to think that we have to come up with a minimal DFA with at least three states in it.

if (('A' <= ch) & (ch <= 'Z')) {
// ...

Then there’s this, the equivalent token1 code. How do we combine them?

Intuitively, we can establish that there’s no overlap between the A-Z range and the qoute character but we need a representation that allows us to easily compute set operations over character classes (in borrowing from regular expression vocabulary a character class is simply a range of characters).

This is where we either construct artifical edge cases or think of a better example to test our assumptions with. I belive the latter is of greater value.

Let’s take the ECMAScript 5 NumericLiteral as an example (here in parc grammar).

  = decimalLiteral
  | hexIntegerLiteral
  = decimalIntegerLiteral "." decimalDigits* exponentPart?
  | "." decimalDigits* exponentPart?
  | decimalIntegerLiteral exponentPart?
  = "0"
  | nonZeroDigit decimalDigits*
  = "0" - "9"

The actual definition of the Numeric Literals (section 7.8.3) is a bit verbose (I’ve left out some parts). Intrestingly, decimal numbers cannot have leading zeroes but integers can (tested in Chrome).

The hexIntegerLiteral definition is also a nonsensical recursive mess (it’s either that or supposed to signify repetition).

  = "0x" hexDigit
  | "0X" hexDigit
  | hexIntegerLiteral hexDigit

Yeah, OK. Whatever. That’s just downright confusing.

Note (added 2015-05-15): I just learned that this is called left recursion. Left recursive grammars are more difficult to manage (for example, they cannot be parsed by just a recursive descent parser) but in my experience there is no need for left recursion. You simply rewrite your grammar in a right-recursive form which has a straight forward implementation.

You’d never do that in a parc grammar. You’d write.

  = "0x" hexDigit+
  | "0X" hexDigit+


  = "0" ("x" | "X") hexDigit+

It’s the same thing.

Let’s list some examples that need match the above numeric literal grammar.


What we want to be able to do is to look at the first character of any rule and check that it represents an unambigous choice.

0 . -> decimalDigits
  x -> hexDigit
.   -> decimalDigits

So far so good, we can start with either a zero or period and then proceed with digits, period or x. Which is a small decision tree.

The thing is that this decision tree only happens if we combine the token rules. Initially we have many distinct token rules (even implicit ones that you don’t define).

Let’s simplify some.

  = ("0" - "9")+ 
  = ("0" - "9")+ "." ("0" - "9")+ 
  = "0x" ("0" - "9" | "A" - "F")+

What we have here are three distinct state machines that (when combined) represents the tokenization rules for our regular language (of numbers).

  = ("0" - "9")+ => integer
  | ("0" - "9")+ "." ("0" - "9")+ => decimal
  | "0x" ("0" - "9" | "A" - "F")+ => hex

Product machine construction is crazy. We get N-dimensional tuples as lookup keys and exponential growth of the number of states (just saying that the state transition table representation is bad for combining state machines).

The regular graph builder algorithm.

Initially the graph is empty.

First we add the integer token. It has one transition and one accept state.

a : ['0', '9'] => integer
a > a

Then we add the hex token. This requires us to split the closed interval, 0-9 and add two transitions.

b :: '0' => integer
b -> a
b -> c
a :: ['0', '9'] => integer
a -> a
c :: 'x'
c -> ['0', '9'], ['A', 'F'] => hex
c -> c

This all maps very well to code but it does not describe how to dynamically execute this. We need to render a state transition table to be able to run this.

I guess we need an artifical example to actually get by this.

  = "\"" a : (^ "\"")* "\"" => string

The above expression cannot be translated into a transition table because it would blow up size wise. However, on closer inspection the expression only has two outcomes. Either the character is acceptable or it isn’t. In this case that rule is inequality. In truth, it only defines two alphabet characters, acceptable and non-accpetable (our graph or tree representations can represent this exactly).

Given s0 and "\"" we enter s1, from here we apply a predicate to determine if next character is acceptable or not. If it is, we consume that character otherwise it’s an error. If the character is "\"" we transition to s2 which is our accept state.

Somewhere around here I ran into issues with the DFA construction. Eventually I switched to NFA construction which was more straight forward but the only thing that happen computationally was that I created two new problems that needed sovling. The formal nature of this problem is a nice property but I can’t roll with it effectivly.

I’ve already decided that parc is going to be deterministic. Thus any tokenization rule that is non-deterministic is illegal. And from experience writing lexers I know that its mostly all about non-overlapping intervals. parc tokenization rules map very well to a non-continous set of closed intervals. Sometimes the interval is the complement of some closed interval (negation) but otherwise this representation is computationally sound.

Instead of doing NFA/DFA and minifaction we can construct a minimal decsion tree (and turn the lexical analysis into a seach problem). This operation is pretty much a linear ordeal and can the be used as a basis for further optimization.

Running the lexiographic analysis from this representation is slow but straight forward. Computationally the constraints are very easy to reason about.



Reflection cache -- faster reflection with caching

I’ve been working on extending the .NET type system to support customization of data contracts. I ended up doing this thing like this.

class Metadata : Attribute
  // metadata goes here...

abstract class Abstract<T>
  public static readonly Metadata M;
  static Abstract()
    M = (Metadata)Attribute.GetCustomAttribute(typeof(Metadata), false);

class Concrete : Base<Concrete>

You can then write code like this, to access the metadata:


While at run-time, it does allow you to attach validation logic to the reflection part of getting at the metadata. The validation will occur when someone tries to use the Concrete type in code (not before that).

While I used a metadata attribute to illustrate this point it can be used to with any type of reflection. Then conveniently, the strongly typed model is used to reflect based upon some type.

You can for example, create a generic helper method that rely on this information to do things:

public static void Do<T>()
  where T : Base<T>
  var m = Base<T>.M; // access metadata

This ought to be faster than reflection since the cost of the reflection is taken once, when the type initialize for Base<T> is run which will happen once for each distinct type parameter T.

You can’t do this with interfaces (it would defeat the purpose entirely) because using an interface implies that you have an instance of a type to work with. Inheritance isn’t required but it does allow inherited base members to be accessed in the context of the derived class, which has some usefulness (depending on what you’re doing).

Lua grammar shenanigans

I’ve been working on some Visual Studio tooling for Lua over the week and just ran into my pet peeve of language design. The first time I ran into this problem was back at the university while writing my first parser. I took me a long time to understand why this didn’t work…

The Lua 5.2 grammar in question is this

var  ::= Name | pexp '[' exp ']' | pexp '.' Name
pexp ::= var | fcall | '(' exp ')'

From the Lua 5.2 Reference Manaual. I’ve changed prefixexp to pexp and functioncall to fcall for brevity.

What so nasty about this? pexp is defined in terms of var and var in turn is defined in terms of pexp, all without consuming input. As a human this may make sense, as a computer this is a infinite loop (or stack overflow) bug waiting to happen. Moreover since both fcall and '(' exp ')' starts with left parenthesis this is an ambiguity. This is a bad grammar to to implement a parser from.

When you design your grammar, keep this in mind that you want to be able to determine the next state of the parser based on simply looking at the top symbol, commonly refered to as LL(1) grammars.

Now, var and pexp may be reused throughout the grammar but rewriting them isn’t necesarily impossible and I suppose this is what any decent parser generator will figure out to ensure that the parser isn’t ambigous or never ending (like here).

This sort of recursive grammar is really dubious. What’s more is that it’s meant to convay that you can reapply it to allow the syntax such as a.b["c"].d() to be accapted. Is that easy to understand? What we need here is a non-left-recursive grammar.

Taking a closer look at the Lua C source you’ll find this grammar (and not the one in the reference manual)

pexp = pexp2 {pexp3}
pexp2 = '(' exp ')' | Name
pexp3 = '.' Name | '[' exp ']' | ':' Name args | Name args

Notice how each production here is preceeded by the consuming of an input token. In this form there can be no infinite loops (unless you introduce infinite input). Notice the order of evaluation. I couldn’t tell how Lua resolved the perceived ambiguity between a function call and eval until I realized that a function call must be preeceeded by a name, which isn’t at all clear from the BNF grammar.

A clear cut case of declarative shenanigans (sometimes imperative is better). Even if the expressive nature of more declarative constructs are powerful and productive tools they are difficult to understand.

printf and type inference

A fair amount of bugs and security related issues can be tracked two misaligned printf format strings. An #AltDevBlog post by Jaewon Jung caught my eye, he took a type safe library approach to the problem. I think it’s a sound approach to the problem though I was initially looking for something simpler. After all, it’s is a 995 lines long header file. I’m also forced to target a considerably older c++ toolchain.

What did I do then?

While we can solve for the general case, there’s really only a handful types that we are interested in. These are:

long long
const char*
const wchar_t*
const void*

I’ve done away with smaller integral types, (i.e. short) and decided that if you really wanted to only print a single char you would do so by using a string containing only a single char.

So with this limited set of types we create a union type Value. Like this:

union Value 
	int                m_Int;
	unsigned int       m_Uint;
	long               m_Long;
	unsigned long      m_Ulong;
	long long          m_LongLong;
	unsigned long long m_UlongLong;
	double             m_Double;
	const char*        m_Str;
	const wchar_t*     m_Wstr;
	const void*        m_Ptr;

We then introduce the wrapper class Arg.

class Arg 
    inline Arg(int value)
        : m_Type(kType_Int) 
		m_Value.m_Int = value;
	Value m_Value;
	char m_Type;

A bunch of varargs equivalents.

void Format(const char* format, const Arg& arg0)
    char ts[1] = { arg0.m_Type };
    Value vs[1] = { arg0.m_Value };
    Format(format, vs, ts);
void Format(const char*, const Arg&);
void Format(const char*, const Arg&, const Arg&);
void Format(const char*, const Arg&, const Arg&&, const Arg&);

This is obviously the most verbose step but we typically almost never need more than a few varargs per format call. I’ve settled for 5 in my implementation.

And the final format function that does the work:

template <int Length>
void Format(const char* format, const char(&ts)[Length], const Value(&vs)[Length]);

The way we now implement Format is just a tad bit different. Traditional printf format strings use the percent % character followed by a type specifier to tell us where to format the result but with this change the type specifier is redundant.

We still have to scan the string for percent % characters but we simply forward what’s between the percent % character and the type specifier and substitute the type specifier for our own.

A Format call like this…

Format("%s", 42)

Would internally be resolved as…

Format("%i", 42)

Which is safe.

Internally we just call snprintf for each placeholder with the correct type specifier and value to render the string. We simply infer the type specifier based on the value. While this is not considered type safe what cannot happen (assuming correct implementation) is that you have a format string that results in undefined behavior (bugs/exploits).

Remember, the general idea is to impose reasonable limits and write less bugs not to solve the general case. You can always argue that you should test for these bugs (or optionally use good tools for this) but honestly how many of you really test the fail condition of every external API call?

if (!CreateFile(...))
    Log("%s", GetLastError()); // Oops!
    return FALSE;

Not saying that this isn’t something that you shouldn’t be testing but in my experience this is where the format bug will take down your program or worse be exploited and wreak havoc.

As to the performance implications of this I consider them negligible (there’s going to be copy overhead) but I’d like to compare the assembly of this approach with other type safe variants and run some tests to get a better picture. The good bits are that we relying on the standard library for most of the heavy lifting. This results in a smaller code base and we do offer the exact same functionality.

Format("%.3s", 1 / 2.0)
         ^ So, in this particular instance we'd keep the `.3` between the `%` and `%s` but change `%s` to `%f`.

These are a bit more subtle:

Format("%d", 1L)    // d -> il
Format("%d", 1LL)   // d -> ill
Format("%d", 1U)    // d -> u
Format("%d", 1UL)   // d -> ul
Format("%d", 1ULL)  // d -> ull

In summary, we don’t have to re-implement the formatting stuff just fix up the mishaps (when the type specifier is not compatible with the value).

Just run the numbers!

I decided to do run some tests. I have two versions of the DJB2 hash function. It’s a crafty small piece of code.

C# version:

public static int Djb2Hash(byte[] bytes)
    int hash = 5381;
    for (int i = 0, length = bytes.Length; i < length; i++)
        hash = (hash * 33) + bytes[i];
    return hash;
public static int Djb2HashFaster(byte[] bytes)
    int hash = 5381;
    for (int i = 0, length = bytes.Length; i < length; i++)
        int temp = hash;
        hash = ((temp << 5) + temp) + bytes[i];
    return hash;

C++ version:

int Djb2Hash(const char* bytes)
    int hash = 5381;
    for (int val = *bytes; val; val = *++bytes)
        hash = (hash * 33) + val;
    return hash;
int Djb2HashFaster(const char* bytes)
    int hash = 5381;
    for (int val = *bytes; val; val = *++bytes)
    	const int temp = hash;
        hash = ((temp << 5) + temp) + val;
    return hash;

I ran the tests for 10 seconds. The code ran as many times as it could manage and I took the number of iterations to compute a throughput measurement.

C#                          C++
----                        ----
> test
7c9e6865  5.906 1000ops/µs  7c9e6865 0.015047 1000ops/µs
7c9e6865  4.342 1000ops/µs  7c9e6865 0.014373 1000ops/µs
1.360x                      1.046909x
> The quick brown fox jumps over the lazy dog
34cc38de 45.780 1000ops/µs  34cc38de 0.014252 1000ops/µs
34cc38de 36.901 1000ops/µs  34cc38de 0.014036 1000ops/µs
1.241x                      1.015400x

Some intresting observations. Optimized C# code ran faster than unoptimized C++ code but C++ code compiled with compiler optimizations enabled ran between 300x-3000x faster than anything C# could muster. This test does favour native code due to the heavy array index overhead in managed code. It’s sad though becuase you could easily invent abstractions that allow you to express a computation over a range of values in a type safe manner.

I had a teacher once, he told me that what you do with code doesn’t matter, as long as the algorithm isn’t good. Well that is not bad advice but this complete disregard of what big a difference the choice of platform can make. A factor of 1000x isn’t a negliable gain.

I will say this though, micro benchmarks are freaky, they just don’t converge on anything unless stressed hard enough. So when you measure this stuff, run for a longer period of time, and construct a throughput measurement, use that.

LevelDB does this in it’s benchmark suite. Andrei Alexandrescu also talked about something similar in his Going Native 2013 talk Writing Quick Code in C++, Quickly, you should watch it.

Game Development

I don’t know how I ended up reading through most of a blog post from early 2011 about game development but I did and think reddit was somehow involved. Anyway, I thought it would be an intriguing read. What followed was a hefty comment about how absurdly out of place that blog post was/is.

It wasn’t until the next day I discovered that the post was a two years old. I thought I’d publish some of my own observations on what might be different, at well managed development studio, as supposed to your average software company and why that is.

First, there are people who know the industry and there are those who don’t. The people I’m talking about have been around awhile and are among the most respectful and insightful people I’ve had the opportunity to learn from or know of. These are the people that give great presentations at conferences such as GDC and help the industry move forward as a whole.

With this in mind (and mind you, I’m not one of these people) I’d like to share some of the things that I’ve found to come into stark contrast with my existing beliefs as a more traditional software engineer.

Game code is disposable, means to an end. You’re not going to maintain that code for that long, the platform will change, the design will change. Having great people around that can write great code, is much more valuable to a game studio than having the right process or discipline. This does not mean that it does not exist, it’s a different beast altogether but what would make sense in your run-in-the-mill company doesn’t necessarily translate so well, if you have to ship your game on a specific date.

Console development, specifically PS3 development is rather absurd from a performance perspective. The code that run well on a PS3 is very different from your general purpose software. These people know what the hardware is and they design all their work around that. You have to understand that the more you can get out of a console, the more you can spend on things that will make your game great, this will give you a competitive edge. This is also the reason why game code will shock some people, it was not written to be perfectly readable it was written that way because it got the job done and it was fast.

My last point, is a bit harder for me to argue objectively about but game development grew from a culture that wasn’t motivated by greed (simply making a game for the purpose of making money). These people wanted to make a game because it was part what they believed they should do. A lot of the traditional churn that plague software development does not exist when your reason for doing what you do is so well bespoken. These people are self-motivated talented people and they get to do what they love while many of us don’t. These people make rational decisions about what they need to do next to move forward, they don’t waste a lot of time in meetings. If they did, they’d run out of business, because they need to ship the next game.

If you want to know more about the process of game development I urge you to read some of the publications available from EA DICE or Insomniac Games (they talk about this stuff becuase it make a lot of sense in the context of what they do!) I can personally recommend #GDC12 Ron Pieket: Developing Imperfect Software: The Movie.

That said, game development is not traditional software development, if you think that, you will fail at it. As to whether they are ahead or not, I’m pretty sure they’ve been ahead the rest of us from the start. You just need to know where to look.

I was wrong...

I’d like to think that more often than not, that I’m right. I’d like to think that. It doesn’t mean that I am but I can’t help to think that it caters to some primordial need within my human mind. The need to be right, or of strong conviction is within us and our reasons to be what we are, to protect and defend ourselves from whatever is foreign, be it an idea or something more concrete is instinct.

This is one thing we are. Resistant to change, as if to appear, immutable.

I don’t believe, I know people can change, and people change all the time but they do not do it in the open. Some people spend countless hours in debate, arguing this and that and to my recollection, I’ve never been persuaded right there and then that the beliefs I held were wrong (or of no significant value).

Yet, it still happens and whether we like it or not we are affected by it and we can only resist change to a point. Beyond that point, resistance is futile (to quote the Borg) and it will only serve to make us unhappy.

The value of a good idea will always outweigh the bad and no amount legislation or bureaucracy can prevent a good idea from spreading. Nothing may change but the idea is out there and that’s a start.

So, I’m a Software Engineer and of course, I write from that point of view.

Actions speak louder than words, talk is cheap, show me the code. Unfortunately, I don’t know yet how to code human beings but I do know code and lately I’ve been less worried about being wrong and more focused on what’s moves us forward. I’d recommend anyone reading this to check out Chet Faliszek presentation at Eurogamer Expo 2012 ( becuase that’s when it dawned up on me that my worries about being wrong had paralized me to the point where it’s inhibiting my productivity. I still get things done, it’s just that I’m not that adamant about getting the design right the first time, I’m wrong more often but we move at a faster pace.

To tie this together with another favoured quote from Gabe Newell during his talk at The LBJ School of Public Affairs (

By talent – which is a word I hate – I mean the ability to be productive.

I’m rationalizing a lot of my behavior around the idea that it’s better overall, if I can be productive even though there is a larger risk of making a bad design decision somewhere and what I try to do, is to make sure that these things happen in isolation, as to not have to much of an avalanche effect.

And with that I’d like to summarize with a wonderful web comic which I think also resonates very well with what I’ve been trying to say here. But before you click the next link, read the next paragraph.

Scientists (as not to generalize the entire community) tend to simplify their problem. For good reason but to a point that it becomes difficult to fit real world problems to solutions that make sense in theory.

Feel free to follow the link now (

It’s refreshing to see someone bash on scientists for all of their conundrums. And hopefully I do not misrepresent the comic when I say that theory is nice but it not helpful in general and generalizations in general is not helpful either. And that when we approach a problem there’s more value in being prepared to do whatever, to be able to twist and turn to put one foot in front of the other.

And in doing so, we will have to concede on some points but what we realize is, that what we thought was essential, wasn’t. We can’t know this from the start.


ParC is a recursive descent parser compiler, it’s not finished but it’s currently what I’m working on.

I was really exicted when Microsoft announced Oslo, I thought, finally! A generic tool I can use to play with text. But, they never got any further than CTP.

M was a nice language and the whole experience created with IntelliPad was actually nice to work with, you’d play interactivly with your DSL and grammar. This was great fun. Now, I find my self needing tools like this again becuase writing the lexer, parser and syntax tree your self just blows. It’s just a lot of grunt work.

So, as I was writing that lexer, AGAIN, and I thought – maybe I can solve this problem (at least to my own satisfaction) once and for all. Thus ParC was born.

What ParC will be able to do, once finished, is to take a grammar (similar to how you’d write grammar in M) and generate the lexer, parser and syntax tree (with a visitor template) in C++. You are then free to do what you will with the syntax tree.

Most tools I know of have taken the path of mixing the grammar and code and introduce what is sometimes referred to as semantic actions. ParC does not do this, it simply focuses on building that syntax tree. The benefit of going down this road is that ParC is really, language agnostic. While the run-time and compiler will be in C++ it’s possible to target any language as long as you have a generator for that language (and there will be some plug-in architecture for this).

I personally like the idea of thinking of the parser compiler as a function that takes an input in the form of text and returns a structured data set. This is what I wish to do with ParC and every now and then I find myself wishing I had such a tool. To quickly be able to do something meaningful with just text.

SOLID Software Engineering Principles

Back in 2009, the StackOverflow podcast #39, Joel Spolsky started openly bashing the SOLID principles and TDD (Test Driven Development) and claims they were put together by someone who doesn’t really write a lot of code.

This made me a bit upset.

I listened through the Hanselminutes podcast in question and what I heard was quite different from what I think Joel heard because Joel went on about the adverse effect of writing unit tests, that it’s, probably time better spent doing something else, and the absurdity of turning everything into an interface.

I think this story is really interesting, and Joel actually invites Robert Martin (Uncle Bob) to the show, after he published an open letter to Joel and Jeff, which became the topic of podcast #41. And Joel starts the podcast by apologizing.

Experienced software engineers realize when they are wrong.

But it’s important to point out that Joel is also right, in that he creates a situation where you have a programmer that’s at 99% code coverage and then presented with three choices:

  1. Get to 100% code coverage
  2. Deliver a key feature to a key customer
  3. Dramatically improve usability

Only one of the above choices is wrong.

At this point I’m 50 minutes in to the follow up podcast and I’m quite confident that they won’t actually be addressing the core issues. Joel has been going on and on about GUI testing and how unfruitful that is. And I’m tired of examples in which he appears to be trying to test layout quirks rather than stuff that matters.

It feels as if the whole discussion was geared towards using SOLID and TDD where it’s not appropriate and it’s unfortunate because they effectively fail to highlight the importance of SOLID software engineering principles.

The Pragmatic Programmer is a great read, but I think the key takeaway for any such book, is that as a programmer you should be pragmatic.

To answer the question, why SOLID and TDD? You only need take a look at the real world and make a simple observation.

Things change.

What SOLID is all about is understanding the underpinnings of SOLID software engineering principles. Why modular code better responds to change and how you avoid friction while adhering to principal object-oriented design practices. It’s a lot of fancy wording to say that you can design code to not be tightly coupled and still deliver on schedule, all while quickly adapting to change in the real world.

I did say object-oriented, but actually SOLID has very little to do with object-oriented programming, it fits very well into any paradigm, because it’s good advice and you’re not a SOLID zombie, just because you care about system wide interactions.

Now, I tend to forget that the minds of people all work differently and don’t typically share the same history. I often argue my point without realizing that people might not understand where I’m coming from. So, I’d like to drive this home with, why I value, the SOLID principles.

Why SOLID designs matters

This all started with me trying to get into the habit of writing tests. I slowly realized that I couldn’t test my code effectively because the implementation was in most cases tightly coupled. There was simply no way for me to pull out the important parts and isolate them in a controlled manner, and then run meaningful tests. At the time I couldn’t motiviate why I had to put up with the additional work.

To enable the above scenario I had to learn new principles and through practice and hardship I eventually got to understand the SOLID principles. With that came a new understanding and in hindsight I was already using tests to test my code just not in a very structured way.

The point of doing SOLID is that you’re gearing up to be testable by design. But to accept SOLID you don’t have to accept the entirety of TDD, you simply have to acknowledge that there’s value in being able to do automatic testing, something which you cannot do if you don’t design for it.

Testing is a way to drive SOLID designs, which is one way to tackle the constant change we have to deal with every day.

This is me

John Leidegren

Click here for a copy of my resume. There's also tonnes of information on this blog and in relation to my profile over at StackOverflow if you wanna find out even more about what I do.

Also, please keep in mind; the opinions expressed here are my own, not necessarily those of my employer.