¶ Function-based State Machines

AKA "Monadic Pushdown Automata"

Writing inflate for js-git posed a difficult problem: how to write an asynchronous byte-wise parser that is A) fast and B) doesn't blow out the stack on large contiguous chunks of bytes. Writing your parser to prefer synchronous calls exposes you to the danger of running out of stack space, though it'll be relatively faster than the alternative -- clearing the stack between each loop iteration using setImmediate, process.nextTick or setZeroTimeout. And as we've seen, function calls themselves aren't free approach) is much slower than synchronous approaches, but prevents your parser from running out of stack space; while prefering a synchronous approach by keeping track of the stack using a counter is tricky at best and usually still error-prone.

In the process of writing inflate, I stumbled onto a pattern which creationix later refined (as he is so talented at doing!) which I've taken to calling the "function based state machine" pattern. It's great at handling asynchronous, stateful parsing of binary input while fulfilling the constraints of being A) fast and B) not blowing out the stack.

¶ Backing up

Before I go into what a FBSM looks like, I'd like to illustrate the progression of the problem. First, here's a very traditional, completely synchronous tokenizer:

var STATE_READY = 0
  , STATE_OPERATOR = 1
  , STATE_STRING = 2
  , STATE_NUMBER = 3
  // and so on...

function tokenize(input) {
  var state = STATE_READY
    , output = []

  var escaped = false
    , current = []

  for(var i = 0, ch = input[i]; i < input.length; ++i) {
    switch(state) {
      case STATE_STRING:
        if(ch === '\\') escaped = true
        if(ch === '"' && !escaped) {
          output.push(current.join(''))
          current = []
          state = STATE_READY
        } else {
          current.push(ch)
        }
      break
      case STATE_READY:
        --i
        state = determine(ch)   // switch to an appropriate state
      break
    }
  }

  return output
}

States are represented as constants; there's a loop that iterates over every byte of the input, and performs behavior based on the current state. A list of tokens is produced and returned. This is sufficient when the input size is small enough that it can fit entirely into memory. However, requiring the entire input to produce output is inefficient and potentially dangerous:

The parser could be doing work on the CPU while whatever input source is buffering up enough data to produce the next chunk.
Maliciously sized input could cause the parser to run out of memory.

¶ A Tiny FBSM

Function-based state machines are far more composable. To start, let's define a basic machine for parsing hexadecimal input:

function hex(expect_len, emit) {
  var num = 0

  return take

  function take(byte) {
    var base = 0

    if(byte >= 48 && byte <= 57) {  // 0-9
      byte -= 48
    } else {
      base = 10
      if(byte >= 65 && byte <= 70) {  // A-F
        byte -= 65
      } else if(byte >= 97 && byte <= 102) { // a-f
        byte -= 97
      } else {
        throw new Error("unexpected " + String.fromCharCode(byte))
      }
    }

    // multiply num by 16, add base + byte. 
    num = (num << 4) + byte + base

    if(!--expectLen) {
      return emit(num)
    }

    return take
  }
}

This is a state machine that can take a pluggable path when it exits. It both delivers the parsed value via emit as well as sends the next state to the runner. A simple runner would look something like this:

var current = hex(4, function(value) {
  console.log(value)
  return hex
})

while(input.length) {
  current = current(input.shift())
}

Which, assuming input is an array of bytes representing hexadecimal characters, would print out the integer value of each 4 bytes. Even better, as part of a streaming system, we have control over when we send bytes to the hex machine: if we get oddly sized chunks of byte data, the hex parser doesn't have to care about that!

Here's an example of embedding our hex state machine into a larger state machine:

function string(until_byte, emit) {
  var accum = []

  return chars

  function chars(byte) {
    // `until_byte` will be one of `"` or `'`
    if(byte === until_byte) {
      return emit(accum.join(''))
    }

    if(byte === 92) { // \
      return escaped
    }
  }

  // Most of the time, we'll simply be storing bytes so long as they're not our terminal
  // byte, or the backslash escape. When we do encounter a backslash:
  function escaped(byte) {

    // "\\" or "\'"
    if(byte === 92 || byte === until_byte) {
      accum.push(byte)

      return chars
    }

    if(byte === 117) { // u
      return hex(4, _on_codepoint)
    }

    if(byte === 120) { // x
      return hex(2, _on_codepoint)
    }

    if(byte === 110) { // n
      accum.push('\n')

      return chars
    }

    // handle \t, \r, and friends
    // ...

    throw new Error("unexpected "+byte)
  }

  // We switch on the value of the next character and store the appropriate item --
  // or we enter the `hex` machine with a custom emit value! Once `hex` emits, it'll
  // call `_on_codepoint`, and we'll simply store the new value and continue on.
  function _on_codepoint(value) {
    accum.push(String.fromCharCode(value))

    return chars
  }
}

¶ A More Intense Application

Often, byte-oriented parsers will define a number of bytes or bits to take before returning to a new state.