V8 Optimizations ⚗️ 6 June 2018

Optimizations done by the V8 compiler to keep in mind

You can’t optimize what you don’t understand and V8 is a very complex piece of machinery. Understanding V8 is not easy.

I work a lot with React.js, and something I’ve been considering for a while now is all the closures that get created. These should put memory pressure on your code. And starting with the React.js hooks API we have more closures than ever. Yet, from the official documentation we can read this shouldn’t be an issue. No in-depth technical answer is given though. I’m making an attempt to answer how you can investigate this.

The V8 d8 debug shell is accessible through the Node.js binary. You can dump the d8 command line options using the --v8-options option. For me, this happens to be the simplest way to actually get a V8 d8 shell binary. (Couldn’t find binaries for the d8 tool.)

Here’s how you can use it.

Use --allow_natives_syntax to enable builtins like %OptimizeFunctionOnNextCall.

node --allow_natives_syntax

Use --print_opt_code to dump optimized code.

node --allow_natives_syntax --print_opt_code d8.js

Then use the V8 native %OptimizeFunctionOnNextCall to ensure that the function is optimized. If you don’t do this, you won’t get optimized code. V8 doesn’t optimize all code by default. If the code is not optimized you don’t get any output.

[d8.js]:

function hello() {
// ...
}

hello();

%OptimizeFunctionOnNextCall(hello)

hello();

This will yield a dump of the optimized assembly. We can use this to test hypotheses that we have about the code we write.

In my research, I conclude the following.

Escape analysis is used by V8 to avoid memory allocation
Function inlining can be used by V8 to minimize closure creation

These two optimization should go well together.

Given React.js code like this

[example.tsx]:

function render() {
  return (
    <ContentContainer
      id="7st9vn8gi1cu66o4mlqood1d5s80o53o"
      render={({ controller }) => (
        // using deconstruction (implies construction)
        <ContentBlockProp
          blockId="5fhq274k"
          path="0"
          controller={controller}
          render={value => (
            // using controller from a parent scope
            <input
              data-block="5fhq274k"
              data-path="0"
              value={valueAsString(value)}
              onChange={controller.handleChange}
            />
          )}
        />
      )}
    />
  );
}

Since we’re only capturing variables that don’t survive the invocation of the render function when/if all these function calls get inlined we in theory can eliminate the creation of closures and memory allocations.

This is done thanks to function inlining and escape analysis. These are optimizations that V8 can use. I suspect this adds to the appeal for this kind of code and that the possible performance degradation from the added memory pressure is small. As long as you stick to simple forwarding of temporary variables from parent scopes.

I’m not proficient enough with assembly to be able to understand exactly what is going on. But if you are, this is how you can verify that your code is being optimized in a way that is acceptable to you.

The science of MSBuild ⚗️ 6 June 2018

How to express build dependencies using MSBuild Targets, Inputs and Outputs

The science of MSBuild ⚗️

Add this to the list of things I did not expect to be writing…

Dependency graph

I prefer Tundra over MSBuild because it does not hide the fact that the build system is using a directed acyclic graph (DAG) to figure out when to build what.

With MSBuild each target is a node with inputs and outputs. If you define a target you define a DAG node, if you want MSBuild to invalidate this target when dependant inputs change you have to list them in the Inputs attribute to the target. As a quirk, MSBuild will require that you specify Outputs as well (to be able to support incremental builds).

Targets without input and output information are wonky, MSBuild won’t invoke a code generation target correctly without them.

Code generation tool + code generation + code

Let’s say we have two projects in a solution. Project A (PA) is a code generation tool and Project B (PB) depends on code generated by this tool.

The build should progress like this:

build PA
run PA
build PB

The way we accomplish this with MSBuild is to extend the BeforeBuild target, like this [CodeGeneration.targets]:

<Project>
  <PropertyGroup>
    <DataCompiler>$(SolutionDir)sql-data-model-compiler\bin\$(Configuration)\net461\sql-data-model-compiler.exe</DataCompiler>
    <DataCompilerOutput>$(SolutionDir)generated\db.generated.cs</DataCompilerOutput>
  </PropertyGroup>

  <Target Name="BeforeGenerateCode"
          BeforeTargets="GenerateCode"
          Condition="!Exists($(DataCompiler))">
    <Error Text="data compiler '$(DataCompiler)' not found" />
  </Target>

  <Target Name="GenerateCode"
          BeforeTargets="BeforeBuild"
          Inputs="$(DataCompiler)"
          Outputs="$(DataCompilerOutput)">
    <Exec Command="$(DataCompiler) -conn &quot;$(SolutionDir)connection-string.txt&quot; -output &quot;$(DataCompilerOutput)&quot;" />
  </Target>
</Project>

☠️ The careful reader will note that I did not put the file $(SolutionDir)connection-string.txt in the target Inputs, it should be there. This is just me not caring and it will not work correctly if I modify connection-string.txt. If I do modify connection-string.txt I will have to do a full rebuild.

We can then put an Import directive in our original project file to make the file editable without having to unload/load the project file from within Visual Studio.

<Project Sdk="Microsoft.NET.Sdk">
  <Import Project="CodeGeneration.targets" />
  <PropertyGroup>
    <TargetFramework>net461</TargetFramework>
  </PropertyGroup>
</Project>

⚠️ NOTE: DO NOT FORGET TO FIX THE PROJECT DEPENDENCIES… RIGHT-CLICK THE SOLUTION IN VISUAL STUDIO AND OPEN THE ‘PROJECT DEPENDENCIES…’ DIALOG. MAKE SURE THE CODE GENERATION TOOL IS A DEPENDENCY OF THE PROJECT REFERENCING THE OUTPUT.

If you forget to do this, MSBuild will run the code generation tool build in parallel with the project consuming the output and you will be invoking the code generation tool from the previous build (not fresh). You will again find yourself, needlessly, spamming that rebuild button.

With all of this configured MSBuild will build the code generation tool, the generated code and the project that depends on the generated code as needed. A lot of this has to with the fact that we are building the code generation tool from within the same solution that it is also being used to generate code elsewhere. This is something MSBuild doesn’t do very well.

📝 Edit: Oh, and none of this works from within Visual Studio, you have to run msbuild on the command line without the /m option. If anyone knows of a way to fix this please contact me.

The mindfuck that is MSBuild

if a Target has no Inputs, i.e. inputs fail to resolve to an existing file, you get nothing. The target is not run and MSBuild does not generate a warning message. The error information is hidden in a 3K log file that you only get if you run msbuild with additional command-line options /v:d /fl1 /noconlog.
MSBuild looks like XML but it is definitely not XML. It has it’s own special characters that require additional escaping. Arbitrary XML element names are used to introduce MSBuild concepts like items and properties, making schema validation and statement completion difficult and/or impossible.
The order of evaluation matters and it is a complete mystery to me how MSBuild successfully arrives at any order of evaluation, at all. For example, when adding build extension targets, using BeforeTargets="BeforeBuild", project metadata, such as ProjectDir is blank until the Target is run. How then, do I refer to files relative ProjectDir when I have to define Inputs and Outputs for code generation targets to actually work? The only answer I managed to find was, use MSBuildProjectDirectory, because apparently, it has a value… 😞

All this nonsense can be traced back to XML. MSBuild is a mindfuck because declarative alone is insufficient to represent a complex build. Declarative is the right idea but XML is/was the wrong approach. Here Tundra stands a part because it is both declarative and imperative with well defined semantics because it is in-part built with Lua scripting language.

Azure Resource Manger (ARM) 24 September 2017

What is the Azure Resource Manager and how do you use it?

The Azure Resource Manger looks to be the public API for Azure. It is to Azure what the Win32 CreateFile API is to Windows. Whatever you need to in Azure, there’s a Resource Manager (RM) API for it.

A possibly lesser known way to browse Azure is throught the https://resources.azure.com/ resource explorer. It’s an enorumous tree which offers various ways to access your resources but you can also perform actions through the resouce explorer. All the Azure portal is, is a wrapper around the RM APIs, the same can be said for the Azure PowerShell cmdlets and Azure CLI 2.0 tool. They are all just abstractions built on top of the Azure RM APIs that teaches you a different way of doing things in Azure. The near metal experience is the Azure RM REST APIs. And there’s really only one tool which offers a native experience and that is the ARMClient which is an unoffical Microsoft tool built by some offical Microsoft people.

I decided to write this post to make sense of all the resources out there. I’m going to use the Azure CLI 2.0 az command because it is convinent.

Azure RM has providers, providers have versions, different versions, different capabilities and sometimes even schema. To figure out what is what you can query the RM APIs like this.

az provider show --namespace Microsoft.Web --query "resourceTypes[*].resourceType"
[
  "sites/extensions",
  "sites/slots/extensions",
  "sites/instances",
  "sites/slots/instances",
  "sites/instances/extensions",
  "sites/slots/instances/extensions",
  "sites/publicCertificates",
  "sites/slots/publicCertificates",
  "publishingUsers",
  "ishostnameavailable",
  "validate",
  "isusernameavailable",
  "sourceControls",
  "availableStacks",
  "listSitesAssignedToHostName",
  "sites/hostNameBindings",
  "sites/domainOwnershipIdentifiers",
  "sites/slots/hostNameBindings",
  "operations",
  "certificates",
  "serverFarms",
  "serverFarms/workers",
  "sites",
  "sites/slots",
  "runtimes",
  "sites/metrics",
  "sites/metricDefinitions",
  "sites/slots/metrics",
  "sites/slots/metricDefinitions",
  "serverFarms/metrics",
  "serverFarms/metricDefinitions",
  "sites/recommendations",
  "recommendations",
  "georegions",
  "sites/premieraddons",
  "hostingEnvironments",
  "hostingEnvironments/multiRolePools",
  "hostingEnvironments/workerPools",
  "hostingEnvironments/metrics",
  "hostingEnvironments/metricDefinitions",
  "hostingEnvironments/multiRolePools/metrics",
  "hostingEnvironments/multiRolePools/metricDefinitions",
  "hostingEnvironments/workerPools/metrics",
  "hostingEnvironments/workerPools/metricDefinitions",
  "hostingEnvironments/multiRolePools/instances",
  "hostingEnvironments/multiRolePools/instances/metrics",
  "hostingEnvironments/multiRolePools/instances/metricDefinitions",
  "hostingEnvironments/workerPools/instances",
  "hostingEnvironments/workerPools/instances/metrics",
  "hostingEnvironments/workerPools/instances/metricDefinitions",
  "deploymentLocations",
  "functions",
  "deletedSites",
  "ishostingenvironmentnameavailable",
  "classicMobileServices",
  "connections",
  "customApis",
  "locations",
  "locations/managedApis",
  "locations/apiOperations",
  "connectionGateways",
  "locations/connectionGatewayInstallations",
  "checkNameAvailability",
  "billingMeters",
  "verifyHostingEnvironmentVnet"
]

It will return a list of resources types that we can use. Too many to go through here but you can drill into each one like this:

az provider show --namespace Microsoft.Web --query "resourceTypes[?resourceType=='sites'].apiVersions | [0]" --out table
Result
------------------
2016-08-01
2016-03-01
2015-08-01-preview
2015-08-01
2015-07-01
2015-06-01
2015-05-01
2015-04-01
2015-02-01
2014-11-01
2014-06-01
2014-04-01-preview
2014-04-01

Pick an API version you’re intrested in exploring and goto the GitHub Azure RM schema repo:

https://raw.githubusercontent.com/Azure/azure-resource-manager-schemas/master/schemas/2016-08-01/Microsoft.Web.json

For actions you can check the azure-rest-api-specs repo in a similar manner. For example,

https://raw.githubusercontent.com/Azure/azure-rest-api-specs/current/specification/web/resource-manager/Microsoft.Web/2016-08-01/WebApps.json

Though, I don’t know where that WebApps name is coming from, in this particular case the repo only has a single JSON file, so it’s self evident that it’s the right one but itsn’t predictable. Here’s en example of an operation you can perform. It’s a Swagger spec.

"/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Web/sites/{name}/slots/{slot}/slotsswap": {
  "post": {
    "tags": [
      "WebApps"
    ],
    "summary": "Swaps two deployment slots of an app.",
    "description": "Swaps two deployment slots of an app.",
    "operationId": "WebApps_SwapSlotSlot",
    "consumes": [
      "application/json",
      "text/json",
      "application/xml",
      "text/xml",
      "application/x-www-form-urlencoded"
    ],
    "parameters": [
      {
        "$ref": "#/parameters/resourceGroupNameParameter"
      },
      {
        "name": "name",
        "in": "path",
        "description": "Name of the app.",
        "required": true,
        "type": "string"
      },
      {
        "name": "slotSwapEntity",
        "in": "body",
        "description": "JSON object that contains the target slot name. See example.",
        "required": true,
        "schema": {
          "$ref": "#/definitions/CsmSlotEntity"
        }
      },
      {
        "name": "slot",
        "in": "path",
        "description": "Name of the source slot. If a slot is not specified, the production slot is used as the source slot.",
        "required": true,
        "type": "string"
      },
      {
        "$ref": "#/parameters/subscriptionIdParameter"
      },
      {
        "$ref": "#/parameters/apiVersionParameter"
      }
    ],
    "responses": {
      "200": {
        "description": "OK."
      },
      "202": {
        "description": "Operation is in progress."
      }
    },
    "x-ms-long-running-operation": true
  }
}

Now, I wish there was a more convinent way of navigating these JSON schema files, havent found one but the details are there. What properties you can use and at least a line or two about what it does. You’ll have to reverse engineer the rest but at least this is as transparent and low-level as it gets, if you cannot find what you’re looking for here, then it cannot be done or there’s no API or setting to facilitate what you are looking for.

The Azure resource explorer ties together these two pices of information to build a catalog of everything that is and everything that can be done. While the Azure resource explorer can show you API version information, I’m going to assume it’s only showing you the latest really.

An overview of the things outline here for the Azure portal can be found here.

A final note, the APIs appear to have changed quite a bit over the years. If you look at the resource manager APIs from 2 years ago (2015 era) those are vastly different from the APIs we have now (2017). I have found the more recent APIs to be much better organized and easier to navigate, the schemas appear to be well written, credit to Microsoft for that. If you’re using older APIs, I suggest you just upgrade to more recent APIs but be mindful of breakage and test throughly.

Redux & react-router 24 March 2016

Redux is a library for managing application state, react-router is a library for building single-page applications in React.js when you try to combine these you may run into problems. I've outlined what those are and how I've chosen to deal with them here.

To my knowledge there are two packages that we could go for.

I’ve opted to use only the history API from react-router-redux and here’s why.

const routes = {
  path: '/',
  component: App,
  indexRoute: { component: Dashboard },
  childRoutes: [
    { path: 'about', component: About },
    {
      path: 'inbox',
      component: Inbox,
      childRoutes: [
        { 
          path: 'messages/:id', // <-- note the abscence of a component definition here!
          onEnter: ({ params }, replace) => replace(`/messages/${params.id}`) // rewrite/redirect...
        } 
      ]
    },
    {
      // <-- note the abscence of a path definition here!
      component: Inbox,
      childRoutes: [
        { path: 'messages/:id', component: Message }
      ]
    }
  ]
}

Drawing of reaact router topology

If you are wondering the way react-router builds the component tree is from inside and out using the reduceRight function.

If you take a close look at the above configuration you’ll notice that either the component or path was omitted in one place or another. This allows for any number of components to be created as we decend a particular route. Some of the components that we create may serve as only handlers for location information and navigation events.

It would be a mistake to miss out on this feature and only mount components that render something at each route.

Redux URL actions

To navigate in a single-page application we need access to the browser history API. We don’t want to pass around that object and instead use the history API Redux middleware from the react-router-redux package for this. This way we can dispatch actions that change the URL state and in turn trigger react-router to update our application.

To be able to do this we require the excellent redux-thunk package.

import { push } from 'react-router-redux'

export function myUrlActionCreator({params}) {
  return (dispatch, getState) => {
   // immediately sync with store
    dispatch(setParams({ params }))
    
    // changes of above action now visible in store
    const { state } = getState() 
    
    // reflect the URL of the action/intent
    const url = getActionUrl(state)
    
    // we pass along a bit of state to simplify componentWillReceiveProps logic
    const state = { ...params } 
    
    // navigate!
    dispatch(push({ pathname: url.pathname, query: url.query, state }))
  }
}

The two scenarios that we need to cover are.

componentWillMount for when we navigate to the application from an external application (or page reload):

componentWillMount() {
  const { query } = this.props.location
  // parse query, and dispatch actions to setup app
}

componentWillReceiveProps for when navigation happen within the application.

componentWillMount() {
  const { state } = this.props.location
  // No need to parse query. The information we need 
  // was passed in state (intent SHOULD be clear).
  // There's also no need to sync state because we 
  // already did that in the URL action creators.
  // The only thing that needs to happen here is possibly 
  // a dispatch of some action based on the intent
  // of the route, you SHOULD be able to parse this from 
  // the state that you passed with your navigation event.
}

Additionally there is one thing we could do in the spirit of higher-order components and testability and that’s to decouple the react-router property injection from our components.

function routeParamsHandler(routePropsChanged) {
  var RouteParamsHandler = class extends Component {
    static propTypes() {
      return {
        children: React.PropTypes.element.isRequired
      }
    }
    componentWillMount() {
      routePropsChanged && routePropsChanged(this.props.dispatch, this.props)
    }
    componentWillReceiveProps(nextProps) {
      routePropsChanged && routePropsChanged(nextProps.dispatch, nextProps)
    }
    render() {
      return React.Children.only(this.props.children)
    }
  }
  return connect()(RouteParamsHandler)
}

This we get to define both how to match and map route properties into the store in one go.

const routes = [
  ...,
  {
    path: 'url/:paramVal',
    component: routeParamsHandler((dispatch, props) => {
      dispatch(doSomething({ val: props.routeParams.paramVal }))
    }),
    indexRoute: { component: SomeOtherComponent }
  }
]

All we really need to do here is to either parse URL information or dispatch an action based on route state information. We can create a decorator function to provide this logic and configure this our routes.js file (or whereever you store your routes). I think it makes sense to keep the things that depend on each other as close together as possible.

Overlapping intervals and complement 1 March 2016

As I was researching deterministic finite state machines as a method of lexical analysis I noticed that the superset construction worked very poorly with large set of symbols, which is the result of negation within the Unicode codespace. These things are in the end, implemented in the form of a search problem with intervals. Modelling this using intervals we end up with a practical set of symbols and not state tables having 1,114,112 code points.

We’re going to use the following [a,b] notation to denote a closed interval between a and b. Where a and b is included in the interval, equivalent to the following expression a <= x && x <= b.

We’re also going to need the complement of an interval. We denote this as ]a, b[ and it is the equivalent of x < a && b < x. Note how the complement is an open interval and that we don’t include a and b in the interval. Effectivly, there can be no overlap between an interval [a,b] and it’s complement ]a,b[

While an interval can span a great distance there is no such requirement as long as a <= b holds, we’re good. This implies that we can define an interval of a single step as [a,a].

Why the complement is important

Let’s say we have a token rule for a string literal where qoutes are used to mark the begining and end of such a literal. We would simply do this.

str = "\"" ^ "\"" * "\"" ;

In English, this translates to " followed by any number of characters that are not " followed by ".

This is a very simple string literal and in parctice string literals support escape sequence and generally do not permit new lines (if not properly escaped). So, we do this.

str2 
  = "\"" 
  (
    ((^ "\"") - ("\n" | "\r" | "\\"))
    | ("\\" ("n" | "r" | "\\"))
  ) * 
  "\"" 
  ;

Note really sure about how to indent that but I think that’s better than a one liner.

In English, this starts out the same way with a " followed by any number of characters that are not ", \n, \r, or \\ but if we cannot match any of these we accept \\ and any of the permitted escape sequence characters n, r or \\. Ultimately ended by ".

The way this is actually handled is that we first define the complement of " to be ]","] when the subtract from this \n, \r and \\. This leaves us with a new set of intervals to consider.

]","] - [\n,\n] = [0,\t], [\v,!], [#,U+10FFFF]

What happened here? In order to create new intervals by subtracting from the complement we had to create intervals that span the entire Unicode codespace. This is where I relize that we don’t need complement. Awesome!

Writing is nature’s way of letting you know how sloppy your thinking is. - Guindon

Thank you Leslie Lamport. Although he attributes the qoute to this Guindon. I know it from watching Leslie Lamport’s lectures and this process that I’m running with here is in thanks to him.

As I was testing our functions to work with intervals I sighed over the fact that I would probably need 4 variants of each operation. Since I had to consider how a closed interval would interact with the open complement. This bothered me. However, the complement here is nothing more than a set of two non-overlapping intervals and when you subtract one interval from another you might need to split an interval in two. This in conjuction with the writing I’m doing here gave me the idea to not work directly with the complement of an interval.

Backing up a bit, complement is a useful tool and so is the set minus operation. It’s very impractical if you are building state tables but not if we can express our stuff with intervals (the number of intervals are still few).

How complement can be expressed as a set of two intervals

Given that we know that the Unicode codespace is finite, [0,U+10FFFF] we can define the complement of as two non-overlapping intervals. ]","[ thus becomes [0,!], [#,U+10FFFF] since ! is the character immediately preceeding " while # is the character immediately following ". Even if the Unicode codespace wasn’t finite we could use this approach, we would just replace 0 and U+10FFFF with -Infinity and +Infinity respectivly.

Continuing down this path we would eventually see that we could replace the complement with the set minus operator. ^ "\"" would namly be equivalent to [0, U+010FFFF] - [","] which again would result in [0,!], [#,U+10FFFF].

Cool.

I put up a JSFiddle to test this out, if you wanna take a closer look at my atempt of capturing this appraoch in terms of intervals.

Let's make things simpler 8 February 2016

We can't bash on stuff just because we feel it's too complicated. We can only make the case that simplicity is hard but more impactful/meaningful

I’m currently doing a lot of Node.js work and it’s great! but there are things in this world that are not so great and I would like to address a few of those things in this post. The things I write on this blog are highly opinionated and as such you should keep this in mind when you read this. This is not an exact answer to a specific question.

Let’s start with lodash (there’s nothing wrong with lodash, use it if it makes your life easier) this is, from what I can tell, a huge library with lots of auxiliary functions. With tools for doing stuff like currying and Object.assign.

To be fair, Object.assign doesn’t appear to be supported in IE and was just recently added to the language. I’m a bit late to the game so some of this may come as a surprise but I thought we had polyfills for this sort of stuff.

What does not make any sense though is the function _.partial this function takes away from readability. As new person arriving in this world _.partial does currying in a very unintuitive way.

I would much rather prefer it if we just wrote the following code becuase the intent is much clearer and not obscrued by the intricasies of _.partial. I don’t think there’s anything wrong about being explicit.

function AB(a, b) {
}
function B(b) {
  return function() {
    return AB('a', b)
  }
}

Also, if you really do want to curry N arguments left/right you should really consider if your design really needs this or could you do with just passing an argument as an array?

Intervals

(min <= a && a <= max)

The above is a straightfoward way to check if a value falls within a specifc range. It’s the equiavlent of a SQL BETWEEN. Where the lower and upper bound in inclusive. There are many ways to express the above intent but I find this to be the most clear.

Error conditions

if (!true)

Whenever I want to test for an error condition I tend to do the following. I express the condition which is an error in the form of a Boolean expression and then negate it. if you see if (!condition) the condition is the error condition and the if-statement body is the error handling code. This allows us to parse the condition without considering negation (for the most part). Again, I do feel as if this leads to cleaner code with less intricacies to deal with and it’s a mostly unviersal approach that would work in any language.

Bootstrap 26 January 2016

What if I told you that you never have to write CSS again. Would that be something that you would be intrested in?

Not saying it’s a one stop shop for everyone but if you take a look what’s been added to version 4, I think a lot of people will be pleased.

I’ve written a lot of CSS. The hardest part is making something that’s simple yet consistent amoungst browsers. By not doing any of this yourself and simply sticking to what works out of the box with bootstrap you’re saving yourself from a lot of rainy days. And I cannot stress this enough. Every time you think (or told to) do something if there isn’t a straightforward way to do it. Don’t.

Always figure out a redicously simple way to do it first the when you’re done (and only when you’re done) may you consider further changes. I call this a win. You’ve successfully prevented yourself from making something which could have been really complicated.

I’m not really going to be blogging web development as much as I’m using this blog post as an excuse to test Bootstrap.

The idea here was to bring in less (as in less stuff). By switching to Bootstrap (I’m using version 3 right now) I can clean up my HTML (somewhat) and delete my CSS. I still use some JavaScript hackery to get the Twitter feed in the way I like it but that might go away in the future.

Without further ado, I present to you an incremental update to blog where I replaced my own CSS with something like Bootstrap. And it has been great! I also added an edit button (for myself of course) but anyone can really use it and if you feel as if something is really poorly worded or should be changed I wouldn’t mind merging a pull request if the changes are in the style of a note so that it’s clear that it’s an addendum (and maybe that I didn’t write it). Feedback is welcome.

The C programming language 1 January 2016

Last year (2015) I applied for a job at Google and while I didn’t get it, I learned one thing in the process. I had become overly reliant on tools. I actually did OK but I found myself in a situation where the code I’ve written in Google Docs didn’t really solve the problem and I had to troubleshoot the issue. The only problem here is that Google Docs is not a compiler, let alone a debugger.

With this in mind I’ve set out to rewire my brain by doing a project outside of my comfort zone. For this project, I’ve been using a simple text editor and a simple programming language.

Why C? Well, I’ve never really wholeheartedly tried it before. But mainly because I’m tired of all the buzz. C doesn’t really support fancy things. Instead, it functions as a constraint that you can push against to test your design. Any design which is simple can be written in C, no problem but it leaves me with more time to focus on problems not so much how to make my code elegant.

At every corner I’m now trying to get away with as little effort as possible. And it doesn’t mean I don’t take great care of how I do things, quite the opposite. I’m trying to be a lot more conscious about what I should be doing rather than letting the auto-pilot (refactor?) take over.

Another thing that I’ve found is that I’ve taken great care to write (mostly) short descriptive functions that do small things. I also try to keep things modular. And by this I simply mean minimize header dependencies. The implication of this is that I repeat myself. There are certainly several functions that are very similar and could probably be rolled into a singular general purpose function but if I did that I would just complicate things (except when the model is wrong) for the sake of code reuse. The fact is that the time it takes me to write a function is really short and I’m really trying to adhere to the open for extension, but closed for modification principle here. So, as requirements come up I don’t change stuff that is already written and tested. And yeah, I avoid encapsulation whenever it hurts my testing.

You can follow my progress here.

Oh, and happy new year!

Grammar & parser control flow 8 June 2015

These blog posts are getting larger all the time. What I’m essentially doing here is a brain dump of my thought process. I blog about it because anyone doing the same would sure appreciate learning from what someone else did before them. Very rarely do I find that people blog about their work as it progresses. Think of this as my current journal. It’s very much a snapshot of what happens as the thing unfolds in my mind.

I’ve found it very helpful to approach the problem from both ends. I constantly switch between what it would look like in code and what representation would be most suitable to solve this other problem over here. The design that I want needs to make the problem simpler to deal with otherwise it’s a useless design.

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.

A quote from Albert Einstein commonly phrased: everything should be as simple as it can be, but not simpler. Though, I believe it is important to emphasize the not simpler part. Since this along with Occam’s razor is often popularized as the simplest solution is often the right one which is just absurd.

What follows here is me spending my evening trying to work out a representation in data along with an algorithm to reject ambiguous grammar but more importantly disambiguate production rules that only appear ambigous but are infact not.

a b
a (b c)+
a b? c d*
a (b | c | d)

The above 4 simple rules are just arbitrary rules that I just made up to work this out in my mind. They probably don’t even make any practical sense but they will do for now.

The first observation to make is that there’s an implied logical AND operation occurring in between these rules.

a & b
a & (b & c)+
a & b? & c & d*
a & (b | c | d)

This logical AND operation is what we also refer to as a short-circuit evaluation operator. The implied meaning is that if any expression on the left-side is not true we do not evaluate the right-side.

The second observation to make is that the production rules are Boolean expressions. This observation is only possible if we consider the original simple expression grammar and note that the recursive decent parser implementation only has two types of function calls. Accept and production evaluation (where Accept is simply a production rule for a single token).

For is reason, we use a Boolean return value for production rules and do the syntax tree construction elsewhere (private stack). Given what I’ve learned here, I’d say this is the better way to build a recursive decent parser because the grammar has the most straight-forward representation in code.

The third observation is about control flow and ambiguity.

a & b
a & (b & c)+
a & b? & c & d*
a & (b | c | d)

Let’s say we were to implement this as a recursive decent parser (this is where nothing makes sense any longer). Well, the first is easy.

bool Test() {
  if (a()) {
    // ...
  }
  return false;
}

OK, that’s a. For simplicity reason here we just use these for single letters to represent the production rules. Let’s see if we can fit b into this.

bool Test() {
  if (a()) {
    if (b()) {
      Reduce(2, "1");
      return true;
    }
    //...
  }
  return false;
}

OK, that took care of the first production a & b. Since this is the first production we call that 1 Reduce(2, "1"). Now what? This is going to spiral out of control.

bool Test() {
  if (a()) {
    if (b()) {
      if (c()) {
        int n = 3;
        while (b()) {
          if (!c()) {
            throw SyntaxError();
          }
          n += 2;
        }
        Reduce(n, "2");
        return true;
      }
      Reduce(2, "1");
      return true;
    }
    //...
  }
  return false;
}

That was really messy and there’s more, take a look at the third and forth rule. They are ambiguous.

Given the input “a b” the following two rules are matches:

a & b
a & (b | c | d)

Given the input “a b c” the following two rules are matches (since d is optional):

a & (b & c)+
a & b? & c & d*

Given the input “a c” the following two rules are matches (since b and d are optional):

a & b? & c & d*
a & (b | c | d)

The issue here isn’t so much to understand that this can happen (that it won’t work if we do it like this) but that we have to have a representation that allows us to quickly see that this is going to be an issue and this isn’t something that inherently obvious right now.

OK, the only thing I can conclude from the above is that it is nonsensical (thus is represents a grammar that we’d have to reject), moving on.

How do we figure out if we are to reject a grammar then?

Again, this is Boolean algebra in some form or another. If we start out with the simple rule a and it is the only rule a. Then we can say that the production is sound. It has a single unambiguous interpretation.

Now, let’s extend the original example with a second rule a & b. We now have to rules.

a
a & b

If we work our way from left-to-right we can see that applying a results in an accepting state and the choice of applying b. We need to try b before we can return but it’s clear to see how we do this.

a
a & b
a & c

Still clear what to do. This only adds another choice to the first accepting state that follows a.

a
a & b
a & c
a & (b | c)

Now what? How do we identify that that fourth construction is impossible (as is).

This is what we have.

a
a & b
a & c

Get rid of the a. (for reference, they way we get rid of the a is of course the same way we get rid of every other term).

b
c
(b | c)

We now have b and c and a candidate rule (b | c).

If the current candidate rule (b | c) matches any of the other rules that we’ve already accepted we must reject the candidate rule if it results in an accepting state.

OK, that’s easy with pair of simple expressions. What about when things get more involved?

Note: at this point, I’m fairly sure that I’m about to invent my own algebra. I apologize in advance for headaches that this will inevitably create.

I need to change up the notation a bit. We need to define a few things before we continue. When we talk about applying a production rule we’re talking about a function call in a recursive decent parser. This function call can call other functions that represent other production rules or it may accept (conditionally read) a token. These are the only two ways we can advance the parsing.

To continue this reasoning, we can generalize the accept operation as a production rule. A production rule that can accept only a single kind of token. And this is important because a production rule (as with a function) will hide underneath some behavior that may be inherently complex. The generalization here is that a production rule will only be successfully applied on a given set of token kinds. Let’s take this idea and apply it to our small expression grammar.

--tokens
number = ("0" - "9") + ;
operator = "+" | "-" | "*" | "/" ;
lparen = "(" ;
rparen = ")" ;

--productions
Expression = PrimaryExpression
  | PrimaryExpression operator Expression
  ;
PrimaryExpression = number
  | lparen Expression rparen
  ;

Let’s take the two production rules Expression and PrimaryExpression. The pre-condition for them to be successfully applied is the tokens that they accept at the top level (currently we only examine the top-level). Expression just applies PrimaryExpression so we need to perform dependency resolution and then start with PrimaryExpression. PrimaryExpression accepts either number or lparen. The set of tokens that gate the PrimaryExpression are thus number and lparen.

Let’s get back to the definition part. Any production rule has an associated set of the tokens that it accepts at the top-level. These tokens form the top-level guard set. A lonely accept operation is an implicit production rule with a single guard token.

Now, let’s get back to our example.

As we build our grammar we have the following two production rules:

A [X, Y]
B [Z]

We now introduce a third production rule that we imaginatively call C [Y]. A and B are both infinitely complicated but we can tell from our gate sets that B and C both share the gate token Y. If C would result in a new accepting state (a state that is a complete match for a production rule) this grammar rule would be ambiguous.

This process is applicable regardless of optional or repeated constructs since they don’t change the contents of the guard sets.

There’s another reason why we care about this as well and that’s unification. By this we mean the ability to realize that two otherwise distinct production rules share some progression through a shared token. Here’s an example of what unification may look like.

Here we have two production rules A and B that share the same top-level guard set but are otherwise not ambiguous since accepting x does not result in an accepting state.

A = x y w ;
B = x z w ;
C = A | B ;

From the grammatical construction we have two productions A and B that disambiguate by their second term. If we initially thought of them as disjoint we’d probably not notice that they actually disambiguate by their second term which is why unification is necessary. The name that we associate with the production rule is first and foremost a reference.

Let’s introduce a different notation that’s probably more representative of what we aim to achieve here. Note that C = A | B is just an alias for A or B (it’s not important).

x y w => A;
  z w => B;

This way we work our way from the far left to the far right and when we reach the accepting state => we apply our projection. This projection is as of right now inferred from the production rule but can be used to build any syntax tree of choice (abstract or otherwise).

We could very well end up with a grammar that’s like this.

x y w => A;
  z w => B;
    w => C;

Where w as a single token is a valid production somewhere. The way this would work is that we would have a control flow where a production is simply an entry point into the directed control flow graph. For each directed edge in the graph there is an associated guard set.

(x | y) z => A
x       w => B
y       w => C

The above three rules exemplify the way guard sets associated with edges in the control flow graph can change as we build up the grammar control flow graph.

Initially, we have a root node with the an edge going towards z with the associated guard set [x, y]. We then introduce B and since each path must lead to a unique production we need to split the guard set associated with the existing edge. The existing edge remains but the guard set no longer contains x. A new edge is introduced with the guard set x that goes to a new node with an edge for z and w. Only a node that is an accepting state results in the application of a production.

Remember that the guard set is the accept (or apply) actions that we need to do in order to advance the parser. If we ever end up in a node where no edge can be taken, we have a syntax error. As soon as we reach an accept state we can yield a syntax tree (but only if there are no other choices to be made, our algorithm is a greedy one).

alt text

You can see this in the drawing here. The accepting states are labeled as such and the intermediate states are just labeled sequentially (those don’t matter, other than if we get stuck there we have a syntax error). Note how the existing edges have to be copied when we split a guard set in two. When s1 is introduced we have to duplicate the [z] edge going back to the production A. And hey, what do you know, no new algebra necessary. All we needed was some good old graph theory. This also happens to be one of the first instances where I’ve found labeled edges in a directed graph to be of use.

The goal here is not to be able to support superfluous grammars. People are not going to write grammars that run into these gnarly corner cases all the time but when they do we have a singular interpretation of what to do. The theory is used to define constraints on the solution to successfully arrive at the answer. And honestly, I cannot say that I’d arrive at the correct answer without it.

Syntax tree construction 6 June 2015

This update will be about the syntax tree construction. In this case I’m trying to map out how to do it without using on return values. I put up a working example in C# real quick. Take a look.

I’m introducing a new concept which I call reduction, or just reduce. Where I pop a number of syntax tree nodes from a specific stack and then add them as child nodes to a new syntax tree node that replaces these nodes on the stack. The twist is that if we unwind the stack as-is, we’ll end up with the syntax tree nodes in reverse order, so we use a list with random-access capabilities to reverse the reversal in place. Take a look at the Reduce method in the C# example.

In past efforts, I’ve used the return value to propagate syntax tree nodes. Optional (or conditional) production rules have then been implemented like this:

var result = SomeProduction();
if (result != null) {
  // combine with some other production rule
}

If the return value is not null we can proceed with additional production rules that expect some other production rule to proceed it. Something which isn’t yet clear to me is to how to treat the ambient stack when a production rule does not end up modifying the stack. As an example we could have reductions in the stack that end up not changing the stack for the casual observer. A way around that is to use a version scheme, any stack modification will increment the version. Different versions implies return value propagation.

The whole point behind this is to eliminate stack frames but this is not going to be possible if we need to support optional or repeated production rules. These would modify the stack in a way that are not predictable. Lists are a primary example where we need to count each production that’s being applied in a loop. Lists appear in some variations but generally follow this pattern:

for (;;) {
  var result = SomeProduction();
  if (result == null) {
    break;
  }
}

Looking at the Read/Accept pair in the C# example there’s a couple of things I’d like to point out.

private Token _token;

private void Read() {
  if (_inputStream.MoveNext()) {
    _token = _inputStream.Current;
  } else {
    _token = default(Token);
  }
}

protected bool Accept(TokenType type) {
  if (_token.Type == type) {
    _stack.Add(new TokenNode(_token)); // push
    Read();
    return true;
  }
  return false;
}

The tokens get pushed onto the syntax tree evaluation stack as we accept the tokens. Traditionally, I’ve written these kinds of recursive decent parsers like this:

var token1 = token_;
if (Accept(Token.SomeToken)) {
  var token2 = token_;
  if (Accept(Token.SomeOtherToken)) {
    //...
  }
}

That’s rather annoying becuase you have to introduce a lot of local variables to keep track of each token within the respective production rule. Since the parc execution environment does not have support for local variables we’ll maintain a seperate evaluation stack for syntax tree construction.

Here’s the primary rule form the same C# example. It’s very clean in that the way we reduce the syntax tree stack is to group together a bunch of syntax tree nodes based on a specific state in the parser. This works really well in practice (and we don’t depend on return value propagation). The trick is of course how to determine where to put the reduce calls.

public void Primary()
{
  if (Accept(TokenType.Number))
  {
    Reduce(1, "NumberLiteral");
    return;
  }
  if (Accept(TokenType.LeftParen))
  {
    Expression();
    if (!Accept(TokenType.RightParen))
    {
      throw new InvalidOperationException();
    }
    Reduce(3, "EvalExpression");
    return;
  }
}

As things stand right now the instruction set (or byte code) we use is composed of the following instructions:

accept <token> push 1 if token is read otherwise push 0.
beq <target> if top of evaluation stack is non-zero, transfer control to target
jmp <target> transfer control to target
call <target> push current address as return address, transfer control to target
ret pop return address, transfer control to return address
err raise syntax error

I’m going to add a reduce which does what the C# example does and I’m going to add additional side-effects to accept (those that modify the syntax tree evaluation stack).

I’m also introducing fixed-size stack frame that changes with respect to the call and ret instructions. As things stand, there will be need for some local state to keep track of production rules that read a variable number of tokens.

Control flow and code generation 2 June 2015

I’ve been here before.

What do you do when you want to write code to execute more code? How do you (in code) represent code that you can execute inside your other code and what does that look like? Essentially what we’re talking about is this. How does a compiler take a text file and produce something that the microprocessor can execute and can we write that (as well as the microprocessor) in code?

Another way to pose this question is, how do you run a syntax tree?

I wrote this grammar:

void Expression() {
  PrimaryExpression();
  if (Accept(kTokenOperator)) {
    Expression();
  }
}

void PrimaryExpression() {
  if (Accept(kTokenNumber)) {
    return;
  }
  if (Accept(kTokenLeftParenthesis)) {
    Expression();
    Expect(kTokenRightParenthesis);
  }
}

Drew this control flow graph:

alt text

And packed everything in a data structure:

enum {
  kOpCodeAccept,
  kOpCodeApply,
  kOpCodeReturn
};
enum {
  kTokenNumber,
  kTokenLeftParenthesis,
  kTokenRightParenthesis,
  kTokenOperator
};
struct ControlFlowNode {
  int op_code_;
  int token_;
  ControlFlowNode* next_;
  ControlFlowNode* tran_;
};

And modeled the data as such:

auto primary = new ControlFlowNode;
auto lparen = new ControlFlowNode;
auto eval = new ControlFlowNode;
auto rparen = new ControlFlowNode;
auto expr = new ControlFlowNode;
auto binary = new ControlFlowNode;
auto expr2 = new ControlFlowNode;

rparen->op_code_ = kOpCodeAccept;
rparen->token_ = kTokenRightParenthesis;
rparen->tran_ = // new kReturn node...;

eval->op_code_ = kOpCodeApply;
eval->tran_ = expr;
eval->next_ = rparen;

lparen->op_code_ = kOpCodeAccept;
lparen->token_ = kTokenLeftParenthesis;
lparen->tran_ = eval;

primary->op_code_ = kOpCodeAccept; 
primary->token_ = kTokenNumber;
primary->tran_ = // new kOpCodeReturn node...
primary->next_ = lparen;

expr2->op_code_ = kOpCodeApply
expr2->tran_ = expr;
expr2->next_ = // new kOpCodeReturn node...

binary->op_code_ = kOpCodeAccept;
binary->token_ = kTokenOperator;
binary->tran_ = expr2;
binary->next_ = // new kOpCodeReturn node...

expr->op_code_ = kOpCodeApply;
expr->tran_ = primary;
expr->next_ = binary;

And eventually had a eureka moment.

What if I simply followed along the edges (tran_ pointer) in the control flow graph and emitted byte code until all nodes where visited. I would follow branches (kAccept) or function calls (kApply) first, then see if there was a next_ pointer.

bool Bind(node, map, va) {
  if (map->TryGet(node, va)) {
    return false;
  }
  map->Add(node, *va = map->GetSize());
  return true;
}

void Emit(node, map) {
  int node_va;
  Bind(node, map, &node_va);
  
  int tran_va;
  if (node->tran_ && Bind(node->tran_, &tran_va)) {
    Emit(node->tran_, map);
  }
  
  EmitLabel(node_va);
  switch (node->op_code_) {
    case kOpCodeAccept: {
      EmitAccept(node->token_);
      EmitBranchOnEqual(tran_va);
      break;
    }
    case kOpCodeApply: {
      EmitApply(tran_va);
      break;
    }
    case kOpCodeReturn: {
      EmitReturn();
      break;
    }
  }
  
  if (node->next_) {
    int next_va;
    Bind(node->next_, &next_va);
    if (node_va + 1 != next_va) {
      EmitBranch(next_va);
    }
    Emit(node->next_, map);
  }
}

I implemented this algorithm to generate the byte code (ran it on paper) then took the design and wrote a stack machine for it, in the size of ~100 lines of code. This made me very comfortable in believing that the direction that this is going is actually going to pan out.

The result was something like this:

3:
  return
PrimaryExpression:
  accept   kTokenNumber
  beq      3
  jmp      4
5:
  call     Expression
  jmp      6
7:
  return
6:
  accept   kTokenRightParenthesis
  beq      7
  jmp      8
8:
  error
4:
  accept   kTokenLeftParenthesis
  beq      5
  jmp      9
9:
  return
Expression:
  call     PrimaryExpression
  jmp      10
11:
  call     Expression
  jmp      12
12:
  return
10:
  accept   kTokenOperator
  beq      11
  jmp      13
13:
  return

A complete mess, I know but the microprocessor don’t care. What’s intresting is that there’s no explicit function definition. A function simply happens where it’s used and is then referred to from everywhere else. What’s even more intresting is that this byte code (or intermediate representation) can be rewritten in the form of the LLVM instruction set. If you do that, then we can ask LLVM to compile (and optimize) the byte code to whatever target needed.

A second pass over this byte code to reorganize and remove the unnecessary labels might be prudent but it’s completely optional (this will work).

DFA construction (or should I say minification?) 3 May 2015

A couple of things before I get started.

parc is my parser compiler framework. It’s a virtual execution environment for parc grammar files. parc takes a grammar file and transforms text input into a syntax tree. You can read more about it here.

I’ve done this sort of thing in the past by hand and I find myself going back to this problem every now and then. This time, I’m atempting a more formal approach using deterministic finite automaton (DFA) construction for the lexical analysis and deterministic pushdown automaton (DPA) for recursive descent parsing.

Note: for unfamiliar reader, lexical analysis is the step in which a sequence of characters gets grouped together as tokens. The actual parsing is then done based on a stream of tokens. Any sequence of characters that does not conform to a token is illegal.

Much to my furstation I find that there’s a gap between the formal mathematical definition and the thing that I can implement in code. I not sure why that is but to me, a lot of math is this ambighous mess (crazy, right?). I don’t know why that is but part of it has to do with the fact that I never enjoyed math on it’s own. Many mathematical constructions are just nonsensical to me but the code isn’t. I’m much more comfortable reverse engineering an idea from code than math. Anyway, the formal definitions for these state machines are pretty much the same (small variations of the same concept).

The formal definition of a state machine is essentialy 5 things:

A finite set of states
A finite alphabet
A transition function
An initial state
A set of accept states

Math vs computation. All things math run in O(1), constant-time and disregards that fact that practical computation is based on a finite amount of bandwidth and/or memory. The theory of computation is in itself a mathematical model but the bigger the divide between the formal definiton and the practical computation the harder it is to utilize.

Let’s start with an example regarding the definition of the alphabet. An alphabet is a convinent way of saying, we have a finite set of acceptable characters. We know which these are and label each individually. OK, great. Let’s define a finite universe as the Unicode code space (is this resonable? most programming languages use ASCII only). As long as the number of characters we use are finite we’re good but as soon as we say, accept anything but this character we invite the entire Unicode code space into our transition table (that is, if we assume that we have a transition table that we can use with direct addressing). We’re talking megabytes of overhead per state in that table. The math disregards this problem entierly (I’m assuming that that’s exactly the point from a mathematician’s standpoint but I’d love to see more examples of math and practial computation work together).

Here’s the first example using parc grammar (you can look up that parc grammar grammar file for reference)):

token1 = "A" - "Z" ;

Which is a single character token using the label token1 that accepts any upper case ASCII letter. Now, this is easy becase we know here that our language only accepts these characters but what about your typical delimited string?

string = "\"" ( ^ "\"" ) * "\"" ;

This second example poses a very real practical problem. It specifies the allowed character set (or alphabet) as the inverse of the delimiter effectivly making the whole Unicode code space part of our alphabet. That’s a lot of symbols.

If we chose to represent our DFAs as transition tables we need to have the input alphabet in one dimension and each state in the other. In this case that’s 1 MiB (of basically worthless state information) for each state (we’d ought to have very sparse transition tables).

Math is easy becuase it allows you to simplify at any point. Running speed and memory constraints is of little concern when you just dictate the rules.

Now, we can chose an arbitrary lambda function as the transition function and simply “abstract” the problem (which is what math does) but if we do so we end up with a linear search for the transition function and we lose the ability to anaylize the finite state nature of our lexical analysis.

I guess we need to talk a bit about the actual computation. The computation models we use simply work better on a nice continous (linear) memory space and this is why transition tables are appealing (fast lookup). However, they fail to represnt negation expressions efficently.

We can implement the lexer ourselves and write the code. It’s just a lot of boiler plate code. If we do we can work backwards from the goal to construct the basic idea of the representation that we may want to do what we need here.

if (ch == '\"') {
  while ((ch = next()) != '\"') {
    ;
  }
  ch = next();
}

The above code is a direct translation of the regular expression string. In total, we have three branches so it’s reasonable to think that we have to come up with a minimal DFA with at least three states in it.

if (('A' <= ch) & (ch <= 'Z')) {
// ...
}

Then there’s this, the equivalent token1 code. How do we combine them?

Intuitively, we can establish that there’s no overlap between the A-Z range and the qoute character but we need a representation that allows us to easily compute set operations over character classes (in borrowing from regular expression vocabulary a character class is simply a range of characters).

This is where we either construct artifical edge cases or think of a better example to test our assumptions with. I belive the latter is of greater value.

Let’s take the ECMAScript 5 NumericLiteral as an example (here in parc grammar).

numericLiteral 
  = decimalLiteral
  | hexIntegerLiteral
  ;
decimalLiteral
  = decimalIntegerLiteral "." decimalDigits* exponentPart?
  | "." decimalDigits* exponentPart?
  | decimalIntegerLiteral exponentPart?
  ;
decimalIntegerLiteral
  = "0"
  | nonZeroDigit decimalDigits*
  ;
decimalDigits
  = "0" - "9"
  ;

The actual definition of the Numeric Literals (section 7.8.3) is a bit verbose (I’ve left out some parts). Intrestingly, decimal numbers cannot have leading zeroes but integers can (tested in Chrome).

The hexIntegerLiteral definition is also a nonsensical recursive mess (it’s either that or supposed to signify repetition).

hexIntegerLiteral 
  = "0x" hexDigit
  | "0X" hexDigit
  | hexIntegerLiteral hexDigit
  ;

Yeah, OK. Whatever. That’s just downright confusing.

Note (added 2015-05-15): I just learned that this is called left recursion. Left recursive grammars are more difficult to manage (for example, they cannot be parsed by just a recursive descent parser) but in my experience there is no need for left recursion. You simply rewrite your grammar in a right-recursive form which has a straight forward implementation.

You’d never do that in a parc grammar. You’d write.

hexIntegerLiteral 
  = "0x" hexDigit+
  | "0X" hexDigit+
  ;

--or...

hexIntegerLiteral 
  = "0" ("x" | "X") hexDigit+
  ;

It’s the same thing.

Let’s list some examples that need match the above numeric literal grammar.

0
0.
0x0
0.0
.0

What we want to be able to do is to look at the first character of any rule and check that it represents an unambigous choice.

0 . -> decimalDigits
  x -> hexDigit
.   -> decimalDigits

So far so good, we can start with either a zero or period and then proceed with digits, period or x. Which is a small decision tree.

The thing is that this decision tree only happens if we combine the token rules. Initially we have many distinct token rules (even implicit ones that you don’t define).

Let’s simplify some.

integer 
  = ("0" - "9")+ 
  ;
decimal 
  = ("0" - "9")+ "." ("0" - "9")+ 
  ;
hex 
  = "0x" ("0" - "9" | "A" - "F")+
  ;

What we have here are three distinct state machines that (when combined) represents the tokenization rules for our regular language (of numbers).

number 
  = ("0" - "9")+ => integer
  | ("0" - "9")+ "." ("0" - "9")+ => decimal
  | "0x" ("0" - "9" | "A" - "F")+ => hex
  ;

Product machine construction is crazy. We get N-dimensional tuples as lookup keys and exponential growth of the number of states (just saying that the state transition table representation is bad for combining state machines).

The regular graph builder algorithm.

Initially the graph is empty.

First we add the integer token. It has one transition and one accept state.

a : ['0', '9'] => integer
a > a

Then we add the hex token. This requires us to split the closed interval, 0-9 and add two transitions.

b :: '0' => integer
b -> a
b -> c
a :: ['0', '9'] => integer
a -> a
c :: 'x'
c -> ['0', '9'], ['A', 'F'] => hex
c -> c

This all maps very well to code but it does not describe how to dynamically execute this. We need to render a state transition table to be able to run this.

I guess we need an artifical example to actually get by this.

string 
  = "\"" a : (^ "\"")* "\"" => string
  ;

The above expression cannot be translated into a transition table because it would blow up size wise. However, on closer inspection the expression only has two outcomes. Either the character is acceptable or it isn’t. In this case that rule is inequality. In truth, it only defines two alphabet characters, acceptable and non-accpetable (our graph or tree representations can represent this exactly).

Given s0 and "\"" we enter s1, from here we apply a predicate to determine if next character is acceptable or not. If it is, we consume that character otherwise it’s an error. If the character is "\"" we transition to s2 which is our accept state.

Somewhere around here I ran into issues with the DFA construction. Eventually I switched to NFA construction which was more straight forward but the only thing that happen computationally was that I created two new problems that needed sovling. The formal nature of this problem is a nice property but I can’t roll with it effectivly.

I’ve already decided that parc is going to be deterministic. Thus any tokenization rule that is non-deterministic is illegal. And from experience writing lexers I know that its mostly all about non-overlapping intervals. parc tokenization rules map very well to a non-continous set of closed intervals. Sometimes the interval is the complement of some closed interval (negation) but otherwise this representation is computationally sound.

Instead of doing NFA/DFA and minifaction we can construct a minimal decsion tree (and turn the lexical analysis into a seach problem). This operation is pretty much a linear ordeal and can the be used as a basis for further optimization.

Running the lexiographic analysis from this representation is slow but straight forward. Computationally the constraints are very easy to reason about.

References

http://www.cs.cmu.edu/~./FLAC/pdf/FSMAlgorithms-6up.pdf
http://www.cs.nuim.ie/~jpower/Courses/Previous/parsing/node9.html

Reflection cache -- faster reflection with caching 18 May 2014

I’ve been working on extending the .NET type system to support customization of data contracts. I ended up doing this thing like this.

class Metadata : Attribute
{
  // metadata goes here...
}

abstract class Abstract<T>
{
  public static readonly Metadata M;
  static Abstract()
  {
    M = (Metadata)Attribute.GetCustomAttribute(typeof(Metadata), false);
  }
}

[Metadata]
class Concrete : Base<Concrete>
{
}

You can then write code like this, to access the metadata:

Base<Concrete>.M

While at run-time, it does allow you to attach validation logic to the reflection part of getting at the metadata. The validation will occur when someone tries to use the Concrete type in code (not before that).

While I used a metadata attribute to illustrate this point it can be used to with any type of reflection. Then conveniently, the strongly typed model is used to reflect based upon some type.

You can for example, create a generic helper method that rely on this information to do things:

public static void Do<T>()
  where T : Base<T>
{
  var m = Base<T>.M; // access metadata
}

This ought to be faster than reflection since the cost of the reflection is taken once, when the type initialize for Base<T> is run which will happen once for each distinct type parameter T.

You can’t do this with interfaces (it would defeat the purpose entirely) because using an interface implies that you have an instance of a type to work with. Inheritance isn’t required but it does allow inherited base members to be accessed in the context of the derived class, which has some usefulness (depending on what you’re doing).

Lua grammar shenanigans 2 January 2014

I’ve been working on some Visual Studio tooling for Lua over the week and just ran into my pet peeve of language design. The first time I ran into this problem was back at the university while writing my first parser. I took me a long time to understand why this didn’t work…

The Lua 5.2 grammar in question is this

var  ::= Name | pexp '[' exp ']' | pexp '.' Name
pexp ::= var | fcall | '(' exp ')'

From the Lua 5.2 Reference Manaual. I’ve changed prefixexp to pexp and functioncall to fcall for brevity.

What so nasty about this? pexp is defined in terms of var and var in turn is defined in terms of pexp, all without consuming input. As a human this may make sense, as a computer this is a infinite loop (or stack overflow) bug waiting to happen. Moreover since both fcall and '(' exp ')' starts with left parenthesis this is an ambiguity. This is a bad grammar to to implement a parser from.

When you design your grammar, keep this in mind that you want to be able to determine the next state of the parser based on simply looking at the top symbol, commonly refered to as LL(1) grammars.

Now, var and pexp may be reused throughout the grammar but rewriting them isn’t necesarily impossible and I suppose this is what any decent parser generator will figure out to ensure that the parser isn’t ambigous or never ending (like here).

This sort of recursive grammar is really dubious. What’s more is that it’s meant to convay that you can reapply it to allow the syntax such as a.b["c"].d() to be accapted. Is that easy to understand? What we need here is a non-left-recursive grammar.

Taking a closer look at the Lua C source you’ll find this grammar (and not the one in the reference manual)

pexp = pexp2 {pexp3}
pexp2 = '(' exp ')' | Name
pexp3 = '.' Name | '[' exp ']' | ':' Name args | Name args

Notice how each production here is preceeded by the consuming of an input token. In this form there can be no infinite loops (unless you introduce infinite input). Notice the order of evaluation. I couldn’t tell how Lua resolved the perceived ambiguity between a function call and eval until I realized that a function call must be preeceeded by a name, which isn’t at all clear from the BNF grammar.

A clear cut case of declarative shenanigans (sometimes imperative is better). Even if the expressive nature of more declarative constructs are powerful and productive tools they are difficult to understand.

printf and type inference 17 November 2013

A fair amount of bugs and security related issues can be tracked two misaligned printf format strings. An #AltDevBlog post by Jaewon Jung caught my eye, he took a type safe library approach to the problem. I think it’s a sound approach to the problem though I was initially looking for something simpler. After all, it’s is a 995 lines long header file. I’m also forced to target a considerably older c++ toolchain.

What did I do then?

While we can solve for the general case, there’s really only a handful types that we are interested in. These are:

int
long
long long
double
const char*
const wchar_t*
const void*

I’ve done away with smaller integral types, (i.e. short) and decided that if you really wanted to only print a single char you would do so by using a string containing only a single char.

So with this limited set of types we create a union type Value. Like this:

union Value 
{
	int                m_Int;
	unsigned int       m_Uint;
	long               m_Long;
	unsigned long      m_Ulong;
	long long          m_LongLong;
	unsigned long long m_UlongLong;
	double             m_Double;
	const char*        m_Str;
	const wchar_t*     m_Wstr;
	const void*        m_Ptr;
};

We then introduce the wrapper class Arg.

class Arg 
{
public:
    inline Arg(int value)
        : m_Type(kType_Int) 
    {
		m_Value.m_Int = value;
	}
	...
	Value m_Value;
	char m_Type;
};

A bunch of varargs equivalents.

void Format(const char* format, const Arg& arg0)
{
    char ts[1] = { arg0.m_Type };
    Value vs[1] = { arg0.m_Value };
    Format(format, vs, ts);
}
void Format(const char*, const Arg&);
void Format(const char*, const Arg&, const Arg&);
void Format(const char*, const Arg&, const Arg&&, const Arg&);
...

This is obviously the most verbose step but we typically almost never need more than a few varargs per format call. I’ve settled for 5 in my implementation.

And the final format function that does the work:

template <int Length>
void Format(const char* format, const char(&ts)[Length], const Value(&vs)[Length]);

The way we now implement Format is just a tad bit different. Traditional printf format strings use the percent % character followed by a type specifier to tell us where to format the result but with this change the type specifier is redundant.

We still have to scan the string for percent % characters but we simply forward what’s between the percent % character and the type specifier and substitute the type specifier for our own.

A Format call like this…

Format("%s", 42)

Would internally be resolved as…

Format("%i", 42)

Which is safe.

Internally we just call snprintf for each placeholder with the correct type specifier and value to render the string. We simply infer the type specifier based on the value. While this is not considered type safe what cannot happen (assuming correct implementation) is that you have a format string that results in undefined behavior (bugs/exploits).

Remember, the general idea is to impose reasonable limits and write less bugs not to solve the general case. You can always argue that you should test for these bugs (or optionally use good tools for this) but honestly how many of you really test the fail condition of every external API call?

if (!CreateFile(...))
{
    Log("%s", GetLastError()); // Oops!
    return FALSE;
}

Not saying that this isn’t something that you shouldn’t be testing but in my experience this is where the format bug will take down your program or worse be exploited and wreak havoc.

As to the performance implications of this I consider them negligible (there’s going to be copy overhead) but I’d like to compare the assembly of this approach with other type safe variants and run some tests to get a better picture. The good bits are that we relying on the standard library for most of the heavy lifting. This results in a smaller code base and we do offer the exact same functionality.

Format("%.3s", 1 / 2.0)
         ^ So, in this particular instance we'd keep the `.3` between the `%` and `%s` but change `%s` to `%f`.

These are a bit more subtle:

Format("%d", 1L)    // d -> il
Format("%d", 1LL)   // d -> ill
Format("%d", 1U)    // d -> u
Format("%d", 1UL)   // d -> ul
Format("%d", 1ULL)  // d -> ull

In summary, we don’t have to re-implement the formatting stuff just fix up the mishaps (when the type specifier is not compatible with the value).

Just run the numbers! 2 November 2013

I decided to do run some tests. I have two versions of the DJB2 hash function. It’s a crafty small piece of code.

C# version:

public static int Djb2Hash(byte[] bytes)
{
    int hash = 5381;
    for (int i = 0, length = bytes.Length; i < length; i++)
    {
        hash = (hash * 33) + bytes[i];
    }
    return hash;
}
 
public static int Djb2HashFaster(byte[] bytes)
{
    int hash = 5381;
    for (int i = 0, length = bytes.Length; i < length; i++)
    {
        int temp = hash;
        hash = ((temp << 5) + temp) + bytes[i];
    }
    return hash;
}

C++ version:

int Djb2Hash(const char* bytes)
{
    int hash = 5381;
    for (int val = *bytes; val; val = *++bytes)
    {
        hash = (hash * 33) + val;
    }
    return hash;
}
 
int Djb2HashFaster(const char* bytes)
{
    int hash = 5381;
    for (int val = *bytes; val; val = *++bytes)
    {
    	const int temp = hash;
        hash = ((temp << 5) + temp) + val;
    }
    return hash;
}

I ran the tests for 10 seconds. The code ran as many times as it could manage and I took the number of iterations to compute a throughput measurement.

C#                          C++
----                        ----
> test
7c9e6865  5.906 1000ops/µs  7c9e6865 0.015047 1000ops/µs
7c9e6865  4.342 1000ops/µs  7c9e6865 0.014373 1000ops/µs
1.360x                      1.046909x
> The quick brown fox jumps over the lazy dog
34cc38de 45.780 1000ops/µs  34cc38de 0.014252 1000ops/µs
34cc38de 36.901 1000ops/µs  34cc38de 0.014036 1000ops/µs
1.241x                      1.015400x

Some intresting observations. Optimized C# code ran faster than unoptimized C++ code but C++ code compiled with compiler optimizations enabled ran between 300x-3000x faster than anything C# could muster. This test does favour native code due to the heavy array index overhead in managed code. It’s sad though becuase you could easily invent abstractions that allow you to express a computation over a range of values in a type safe manner.

I had a teacher once, he told me that what you do with code doesn’t matter, as long as the algorithm isn’t good. Well that is not bad advice but this complete disregard of what big a difference the choice of platform can make. A factor of 1000x isn’t a negliable gain.

I will say this though, micro benchmarks are freaky, they just don’t converge on anything unless stressed hard enough. So when you measure this stuff, run for a longer period of time, and construct a throughput measurement, use that.

LevelDB does this in it’s benchmark suite. Andrei Alexandrescu also talked about something similar in his Going Native 2013 talk Writing Quick Code in C++, Quickly, you should watch it.

Game Development 7 May 2013

I don’t know how I ended up reading through most of a blog post from early 2011 about game development but I did and think reddit was somehow involved. Anyway, I thought it would be an intriguing read. What followed was a hefty comment about how absurdly out of place that blog post was/is.

It wasn’t until the next day I discovered that the post was a two years old. I thought I’d publish some of my own observations on what might be different, at well managed development studio, as supposed to your average software company and why that is.

First, there are people who know the industry and there are those who don’t. The people I’m talking about have been around awhile and are among the most respectful and insightful people I’ve had the opportunity to learn from or know of. These are the people that give great presentations at conferences such as GDC and help the industry move forward as a whole.

With this in mind (and mind you, I’m not one of these people) I’d like to share some of the things that I’ve found to come into stark contrast with my existing beliefs as a more traditional software engineer.

Game code is disposable, means to an end. You’re not going to maintain that code for that long, the platform will change, the design will change. Having great people around that can write great code, is much more valuable to a game studio than having the right process or discipline. This does not mean that it does not exist, it’s a different beast altogether but what would make sense in your run-in-the-mill company doesn’t necessarily translate so well, if you have to ship your game on a specific date.

Console development, specifically PS3 development is rather absurd from a performance perspective. The code that run well on a PS3 is very different from your general purpose software. These people know what the hardware is and they design all their work around that. You have to understand that the more you can get out of a console, the more you can spend on things that will make your game great, this will give you a competitive edge. This is also the reason why game code will shock some people, it was not written to be perfectly readable it was written that way because it got the job done and it was fast.

My last point, is a bit harder for me to argue objectively about but game development grew from a culture that wasn’t motivated by greed (simply making a game for the purpose of making money). These people wanted to make a game because it was part what they believed they should do. A lot of the traditional churn that plague software development does not exist when your reason for doing what you do is so well bespoken. These people are self-motivated talented people and they get to do what they love while many of us don’t. These people make rational decisions about what they need to do next to move forward, they don’t waste a lot of time in meetings. If they did, they’d run out of business, because they need to ship the next game.

If you want to know more about the process of game development I urge you to read some of the publications available from EA DICE or Insomniac Games (they talk about this stuff becuase it make a lot of sense in the context of what they do!) I can personally recommend #GDC12 Ron Pieket: Developing Imperfect Software: The Movie.

That said, game development is not traditional software development, if you think that, you will fail at it. As to whether they are ahead or not, I’m pretty sure they’ve been ahead the rest of us from the start. You just need to know where to look.

I was wrong... 8 April 2013

I’d like to think that more often than not, that I’m right. I’d like to think that. It doesn’t mean that I am but I can’t help to think that it caters to some primordial need within my human mind. The need to be right, or of strong conviction is within us and our reasons to be what we are, to protect and defend ourselves from whatever is foreign, be it an idea or something more concrete is instinct.

This is one thing we are. Resistant to change, as if to appear, immutable.

I don’t believe, I know people can change, and people change all the time but they do not do it in the open. Some people spend countless hours in debate, arguing this and that and to my recollection, I’ve never been persuaded right there and then that the beliefs I held were wrong (or of no significant value).

Yet, it still happens and whether we like it or not we are affected by it and we can only resist change to a point. Beyond that point, resistance is futile (to quote the Borg) and it will only serve to make us unhappy.

The value of a good idea will always outweigh the bad and no amount legislation or bureaucracy can prevent a good idea from spreading. Nothing may change but the idea is out there and that’s a start.

So, I’m a Software Engineer and of course, I write from that point of view.

Actions speak louder than words, talk is cheap, show me the code. Unfortunately, I don’t know yet how to code human beings but I do know code and lately I’ve been less worried about being wrong and more focused on what’s moves us forward. I’d recommend anyone reading this to check out Chet Faliszek presentation at Eurogamer Expo 2012 (https://www.youtube.com/watch?v=tdwzvdZFxVM) becuase that’s when it dawned up on me that my worries about being wrong had paralized me to the point where it’s inhibiting my productivity. I still get things done, it’s just that I’m not that adamant about getting the design right the first time, I’m wrong more often but we move at a faster pace.

To tie this together with another favoured quote from Gabe Newell during his talk at The LBJ School of Public Affairs (https://www.youtube.com/watch?v=t8QEOBgLBQU)

By talent – which is a word I hate – I mean the ability to be productive.

I’m rationalizing a lot of my behavior around the idea that it’s better overall, if I can be productive even though there is a larger risk of making a bad design decision somewhere and what I try to do, is to make sure that these things happen in isolation, as to not have to much of an avalanche effect.

And with that I’d like to summarize with a wonderful web comic which I think also resonates very well with what I’ve been trying to say here. But before you click the next link, read the next paragraph.

Scientists (as not to generalize the entire community) tend to simplify their problem. For good reason but to a point that it becomes difficult to fit real world problems to solutions that make sense in theory.

Feel free to follow the link now (http://abstrusegoose.com/406).

It’s refreshing to see someone bash on scientists for all of their conundrums. And hopefully I do not misrepresent the comic when I say that theory is nice but it not helpful in general and generalizations in general is not helpful either. And that when we approach a problem there’s more value in being prepared to do whatever, to be able to twist and turn to put one foot in front of the other.

And in doing so, we will have to concede on some points but what we realize is, that what we thought was essential, wasn’t. We can’t know this from the start.

ParC 9 December 2012

ParC is a recursive descent parser compiler, it’s not finished but it’s currently what I’m working on.

I was really exicted when Microsoft announced Oslo, I thought, finally! A generic tool I can use to play with text. But, they never got any further than CTP.

M was a nice language and the whole experience created with IntelliPad was actually nice to work with, you’d play interactivly with your DSL and grammar. This was great fun. Now, I find my self needing tools like this again becuase writing the lexer, parser and syntax tree your self just blows. It’s just a lot of grunt work.

So, as I was writing that lexer, AGAIN, and I thought – maybe I can solve this problem (at least to my own satisfaction) once and for all. Thus ParC was born.

What ParC will be able to do, once finished, is to take a grammar (similar to how you’d write grammar in M) and generate the lexer, parser and syntax tree (with a visitor template) in C++. You are then free to do what you will with the syntax tree.

Most tools I know of have taken the path of mixing the grammar and code and introduce what is sometimes referred to as semantic actions. ParC does not do this, it simply focuses on building that syntax tree. The benefit of going down this road is that ParC is really, language agnostic. While the run-time and compiler will be in C++ it’s possible to target any language as long as you have a generator for that language (and there will be some plug-in architecture for this).

I personally like the idea of thinking of the parser compiler as a function that takes an input in the form of text and returns a structured data set. This is what I wish to do with ParC and every now and then I find myself wishing I had such a tool. To quickly be able to do something meaningful with just text.

SOLID Software Engineering Principles 2 May 2012

Back in 2009, the StackOverflow podcast #39, Joel Spolsky started openly bashing the SOLID principles and TDD (Test Driven Development) and claims they were put together by someone who doesn’t really write a lot of code.

This made me a bit upset.

I listened through the Hanselminutes podcast in question and what I heard was quite different from what I think Joel heard because Joel went on about the adverse effect of writing unit tests, that it’s, probably time better spent doing something else, and the absurdity of turning everything into an interface.

I think this story is really interesting, and Joel actually invites Robert Martin (Uncle Bob) to the show, after he published an open letter to Joel and Jeff, which became the topic of podcast #41. And Joel starts the podcast by apologizing.

Experienced software engineers realize when they are wrong.

But it’s important to point out that Joel is also right, in that he creates a situation where you have a programmer that’s at 99% code coverage and then presented with three choices:

Get to 100% code coverage
Deliver a key feature to a key customer
Dramatically improve usability

Only one of the above choices is wrong.

At this point I’m 50 minutes in to the follow up podcast and I’m quite confident that they won’t actually be addressing the core issues. Joel has been going on and on about GUI testing and how unfruitful that is. And I’m tired of examples in which he appears to be trying to test layout quirks rather than stuff that matters.

It feels as if the whole discussion was geared towards using SOLID and TDD where it’s not appropriate and it’s unfortunate because they effectively fail to highlight the importance of SOLID software engineering principles.

The Pragmatic Programmer is a great read, but I think the key takeaway for any such book, is that as a programmer you should be pragmatic.

To answer the question, why SOLID and TDD? You only need take a look at the real world and make a simple observation.

Things change.

What SOLID is all about is understanding the underpinnings of SOLID software engineering principles. Why modular code better responds to change and how you avoid friction while adhering to principal object-oriented design practices. It’s a lot of fancy wording to say that you can design code to not be tightly coupled and still deliver on schedule, all while quickly adapting to change in the real world.

I did say object-oriented, but actually SOLID has very little to do with object-oriented programming, it fits very well into any paradigm, because it’s good advice and you’re not a SOLID zombie, just because you care about system wide interactions.

Now, I tend to forget that the minds of people all work differently and don’t typically share the same history. I often argue my point without realizing that people might not understand where I’m coming from. So, I’d like to drive this home with, why I value, the SOLID principles.

Why SOLID designs matters

This all started with me trying to get into the habit of writing tests. I slowly realized that I couldn’t test my code effectively because the implementation was in most cases tightly coupled. There was simply no way for me to pull out the important parts and isolate them in a controlled manner, and then run meaningful tests. At the time I couldn’t motiviate why I had to put up with the additional work.

To enable the above scenario I had to learn new principles and through practice and hardship I eventually got to understand the SOLID principles. With that came a new understanding and in hindsight I was already using tests to test my code just not in a very structured way.

The point of doing SOLID is that you’re gearing up to be testable by design. But to accept SOLID you don’t have to accept the entirety of TDD, you simply have to acknowledge that there’s value in being able to do automatic testing, something which you cannot do if you don’t design for it.

Testing is a way to drive SOLID designs, which is one way to tackle the constant change we have to deal with every day.

This is me

I enjoy "low-level" systems programming in C++ but I program in and advocate for a style that is more akin to C than C++ template metaprogramming. I like making interactive simulations whatever the form.

I do maintain a traditional resume but if you're anything like me, you'll want to look at some code.

Also, do keep in mind; this is a place where I express my opinions on the Internet, those do not necessarily reflect those of my employer.