¶ Function-based State Machines
AKA "Monadic Pushdown Automata"
Writing inflate for js-git posed a difficult problem: how to
write an asynchronous byte-wise parser that is A) fast and B) doesn't
blow out the stack on large contiguous chunks of bytes. Writing your
parser to prefer synchronous calls exposes you to the danger of running
out of stack space, though it'll be relatively faster than the alternative --
clearing the stack between each loop iteration using
setImmediate
, process.nextTick
or setZeroTimeout
.
And as we've seen, function calls themselves aren't free
approach) is much slower than synchronous approaches, but prevents your
parser from running out of stack space; while prefering a synchronous approach
by keeping track of the stack using a counter is tricky at best and usually
still error-prone.
In the process of writing inflate, I stumbled onto a pattern which creationix later refined (as he is so talented at doing!) which I've taken to calling the "function based state machine" pattern. It's great at handling asynchronous, stateful parsing of binary input while fulfilling the constraints of being A) fast and B) not blowing out the stack.
¶ Backing up
Before I go into what a FBSM looks like, I'd like to illustrate the progression of the problem. First, here's a very traditional, completely synchronous tokenizer:
var STATE_READY = 0
, STATE_OPERATOR = 1
, STATE_STRING = 2
, STATE_NUMBER = 3
// and so on...
function tokenize(input) {
var state = STATE_READY
, output = []
var escaped = false
, current = []
for(var i = 0, ch = input[i]; i < input.length; ++i) {
switch(state) {
case STATE_STRING:
if(ch === '\\') escaped = true
if(ch === '"' && !escaped) {
output.push(current.join(''))
current = []
state = STATE_READY
} else {
current.push(ch)
}
break
case STATE_READY:
--i
state = determine(ch) // switch to an appropriate state
break
}
}
return output
}
States are represented as constants; there's a loop that iterates over every byte of the input, and performs behavior based on the current state. A list of tokens is produced and returned. This is sufficient when the input size is small enough that it can fit entirely into memory. However, requiring the entire input to produce output is inefficient and potentially dangerous:
- The parser could be doing work on the CPU while whatever input source is buffering up enough data to produce the next chunk.
- Maliciously sized input could cause the parser to run out of memory.
¶ A Tiny FBSM
Function-based state machines are far more composable. To start, let's define a basic machine for parsing hexadecimal input:
function hex(expect_len, emit) {
var num = 0
return take
function take(byte) {
var base = 0
if(byte >= 48 && byte <= 57) { // 0-9
byte -= 48
} else {
base = 10
if(byte >= 65 && byte <= 70) { // A-F
byte -= 65
} else if(byte >= 97 && byte <= 102) { // a-f
byte -= 97
} else {
throw new Error("unexpected " + String.fromCharCode(byte))
}
}
// multiply num by 16, add base + byte.
num = (num << 4) + byte + base
if(!--expectLen) {
return emit(num)
}
return take
}
}
This is a state machine that can take a pluggable path when it exits.
It both delivers the parsed value via emit
as well as sends the
next state to the runner. A simple runner would look something like this:
var current = hex(4, function(value) {
console.log(value)
return hex
})
while(input.length) {
current = current(input.shift())
}
Which, assuming input
is an array of bytes representing hexadecimal characters,
would print out the integer value of each 4 bytes. Even better, as part of a
streaming system, we have control over when we send bytes to the hex machine:
if we get oddly sized chunks of byte data, the hex parser doesn't have to care
about that!
Here's an example of embedding our hex
state machine into a larger state machine:
function string(until_byte, emit) {
var accum = []
return chars
function chars(byte) {
// `until_byte` will be one of `"` or `'`
if(byte === until_byte) {
return emit(accum.join(''))
}
if(byte === 92) { // \
return escaped
}
}
// Most of the time, we'll simply be storing bytes so long as they're not our terminal
// byte, or the backslash escape. When we do encounter a backslash:
function escaped(byte) {
// "\\" or "\'"
if(byte === 92 || byte === until_byte) {
accum.push(byte)
return chars
}
if(byte === 117) { // u
return hex(4, _on_codepoint)
}
if(byte === 120) { // x
return hex(2, _on_codepoint)
}
if(byte === 110) { // n
accum.push('\n')
return chars
}
// handle \t, \r, and friends
// ...
throw new Error("unexpected "+byte)
}
// We switch on the value of the next character and store the appropriate item --
// or we enter the `hex` machine with a custom emit value! Once `hex` emits, it'll
// call `_on_codepoint`, and we'll simply store the new value and continue on.
function _on_codepoint(value) {
accum.push(String.fromCharCode(value))
return chars
}
}
¶ A More Intense Application
Often, byte-oriented parsers will define a number of bytes or bits to take before returning to a new state.