¶ On the Performance of Closures in V8
trevnorris recently wrote a great post titled
“Long live the callbacks” in which, among other subjects, he
compares the performance of two hypothetical Point
classes,
each running a distance
method, that exhibit drastically
different performance characteristics. Paraphrasing, the
recommendation from this portion of his post was to avoid
creating functions that rely on variables from an enclosing
scope in performance critical paths. When I first read trevnorris's
conclusions about the Point
benchmark, I was skeptical.
Could the performance difference really be explained satisfactorily by the mere presence of variable lookups into another scope? And if so, why?
I resolved to find a root cause (and hopefully, in so doing, reveal that closures were not horrifyingly bad for performance). This post roughly chronicles my progress in finding such an explanation.
¶ The Benchmark
// ./closure.js
// the closure utilizing version
module.exports = Point
function Point(x0, y0, x1, y1) {
this.distance = function dist() {
var x = x1 - x0
var y = y1 - y0
return Math.sqrt(x * x + y * y)
}
}
// ./bench.js
var Point = require('./' + (process.env.WHICH || 'closure'))
var iter = 1000000
var rand = Math.random
for (var i = 0; i < iter; ++i) {
var p = new Point(rand(), rand(), rand(), rand())
for (var j = 0; j < 1000; ++j) {
p.distance()
}
}
The first thing that jumped out at me from reading trevnorris' post was the following observation:
... so it took 33 seconds and used right around 50MB memory. Let's make a slight adjustment to the implementation:
(What follows is a slightly modified version of his original refinement.)
// ./proto.js
// the prototype-utilizing version
module.exports = Point
function Point(x0, y0, x1, y1) {
this.x0 = x0
this.y0 = y0
this.x1 = x1
this.y1 = y1
}
Point.prototype.distance = function() {
var x = this.x1 - this.x0
, y = this.y1 - this.y0
return Math.sqrt(x * x + y * y)
}
... Now it runs in under 2 seconds and only uses 15MB memory.
My first theory was that the primary difference between the two classes was that
the original, closure'd version would create a function instance on every instantiation
in addition to the Point
instance itself. In my experience, each function instance
that V8 encounters is treated as a totally distinct function by the optimizer.
In other words optimizations apply to function instances, not function definitions; and
every thousand times through this loop we would create a new Point
instance with a new
point.distance
function instance and have to re-warm the JIT against this new function.
Given the relatively small inner loop count compared to the outer loop count, this seemed
like a potentially viable explanation for the slowdown.
But what about the memory?
Given that we were creating one million Point
instances in our benchmark, I reasoned
that the overhead of creating an extra function instance (for .distance
) per iteration
could eventually affect that sort of memory behavior — especially if it was eagerly
materializing .distance
's .prototype
object. This felt tenuous, but it was testable
separately from the first explanation — we'd simply compare --trace_gc
as the two
versions ran, and see if one collected (and thus generated) more garbage than the other.
¶ My Hypothesis
If the object instantiation from the outer loop is removed so that the benchmark tests only the
.distance
function, then the performance gap between the two approaches should shrink considerably.
To test this hypothesis, I modified ./bench.js
so that it only ever creates one Point
instance: that way it's only testing the speed of the .distance
function itself. My assumption
was that by doing so, I would see that the performance gap would shrink considerably, and
that the memory usage would go down drastically. I used process.hrtime
for the former,
and looked at GC behavior for the latter (since I couldn't replicate the memory usage locally
using process.memoryUsage
.
// ./bench.js
var Point = require('./' + (process.env.WHICH || 'closure'))
var iter = 1000000 * 1000
var rand = Math.random
var hrtime = process.hrtime()
var p = new Point(rand(), rand(), rand(), rand())
for (var i = 0; i < iter; ++i) {
p.distance()
}
hrtime = process.hrtime(hrtime) // get the difference since original measurement
console.log(
'%sms'
, hrtime[0] * 1e3 + hrtime[1] / 1e6
)
I got the following results:
$ time WHICH=proto node bench.js
883.941066ms
real 0m0.943s
user 0m0.930s
sys 0m0.016s
$ time WHICH=closure node bench.js
12233.704858ms
real 0m12.291s
user 0m12.295s
sys 0m0.059s
¶ Hypothesis: Disproven
It's not the allocation!
The closure was still really slow. I checked the GC behavior next. Having deleted any explicit allocation from the benchmark, I fully expected the garbage collector to be run infrequently on both closure and proto. How wrong I was:
$ time WHICH=proto node --trace_gc bench
[34339] 34 ms: Scavenge 2.0 (35.0) -> 1.6 (36.0) MB, 0 ms [Runtime::PerformGC].
875.469205ms
real 0m0.935s
user 0m0.922s
sys 0m0.016s
$ WHICH=closure node --trace_gc bench.js
[34421] 32 ms: Scavenge 2.0 (35.0) -> 1.6 (36.0) MB, 0 ms [Runtime::PerformGC].
[34421] 34 ms: Scavenge 2.3 (36.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 35 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 35 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 36 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 37 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
# ... snip snip snip ...
[34421] 12076 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 12077 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 12078 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 12079 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 12080 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 12080 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
[34421] 12081 ms: Scavenge 2.6 (37.0) -> 1.6 (37.0) MB, 0 ms [Runtime::PerformGC].
real 0m12.116s
user 0m12.128s
sys 0m0.067s
The closure was really putting the GC through its paces, while the prototype version
barely generated any garbage — which seems impossible at first glance. After all,
the only things to generate garbage are the function calls themselves (creating
stack frames, the arguments
object, etc, etc).
This is about the point where I slapped my forehead in realization: The only way to
eliminate that allocation is to eliminate the function call. V8 was inlining
.distance
in the prototype version, but not in the closure version.
$ time WHICH=closure node --trace_inlining bench
Inlining builtin 0x13c86dd1c901 <JS Function sqrt>
Did not inline require called from (target requires context change).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline Point called from (target contains unsupported syntax [early]).
Did not inline dist called from (target requires context change).
12064.476165ms
real 0m12.122s
user 0m12.122s
sys 0m0.061s
$ time WHICH=proto node --trace_inlining bench
Inlining builtin 0x3e56e111c901 <JS Function sqrt>
Did not inline require called from (target requires context change).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Inlined Point called from .
Inlining builtin 0x3e56e111c901 <JS Function sqrt>
Inlined Point.distance called from .
889.616314ms
real 0m0.947s
user 0m0.935s
sys 0m0.016s
Sure enough, V8 was able to inline the prototype version of .distance
, but not the
closure version. A quick grep through the v8 codebase led me to the culprit:
// Target must be able to use caller's context.
CompilationInfo* outer_info = current_info();
if (target->context() != outer_info->closure()->context() ||
outer_info->scope()->contains_with() ||
outer_info->scope()->num_heap_slots() > 0) {
TraceInline(target, caller, "target requires context change");
return false;
}
In particular, we're concerned with the first check — of the function-to-be-inlined's context versus the function-to-accept-the-inline's context. This line prevents any function that closes over an outer value from being inlined in a loop that does not also have access to those variables at that scope.
Armed with this newfound understanding, I set out to write a hybrid approach that
would use the "instance method" approach from closure, with the "this
as context"
from proto:
// hybrid.js:
module.exports = Point
function Point(x0, y0, x1, y1) {
this.x0 = x0
this.x1 = x1
this.y0 = y0
this.y1 = y1
// an instancemethod that looks up the
// properties needed from `this` instead
// of from scope.
this.distance = function distance() {
var x = this.x1 - this.x0
, y = this.y1 - this.y0
return Math.sqrt(x * x + y * y)
}
}
$ WHICH=hybrid node --trace_inlining bench.js
Inlining builtin 0x27153e11c901 <JS Function sqrt>
Did not inline require called from (target requires context change).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline random called from (target not inlineable).
Did not inline Point called from (target contains unsupported syntax [early]).
Inlining builtin 0x27153e11c901 <JS Function sqrt>
Inlined distance called from .
888.523968ms
The hybrid distance
is inlined! This confused me for a second — distance
clearly
has x0
, y0
, x1
, and y1
in scope, but somehow those don't fail the context
comparison in v8.
A while back, I had the experience of helping a co-worker debug a frontend JS issue
with the chrome debugger. They had set a breakpoint in a function and were trying to
reference a variable that hadn't been used in that scope, but they kept getting
ReferenceError
's. It turned out that since the variable wasn't used in the function's
scope, it had been omitted from that function's context and any attempt (from the debugger)
to reach the value stored in that variable would trigger an error. That was a fun lesson,
and I thought back to it when pondering over why the function might be inline-able despite
the presence of an obviously different scope chain.
It seems the only allowed context that the function-to-be-inlined is allowed to keep is its
arguments — which are directly provided by the calling function-to-accept-the-inline. Interestingly,
this
doesn't count against the function's promotion up the scope chain
because it's passed as an argument internally (which makes sense, as it's determined
at the call site).
¶ Lessons learned
And an admonition
The admonition: microbenchmarks are a poor indicator of your app's overall performance. Benchmark your app, find the bottlenecks, find a more appropriate algorithm — and if one doesn't exist, then start worrying about the inlining behavior of your code.
If you're writing performance critical code, be keenly aware of the inlining behavior of your code. Run your code with
--trace_inlining
to see what's getting inlined, and what isn't.If something isn't getting inlined, check to see that it doesn't close over any outer variables that the loop doesn't have access to.
Barring that, don't be afraid to clone the v8 codebase and search for the appropriate "reason" string you obtain from
--trace_inlining
.