Benchmarking interpreters written in OCaml
Doing pattern-matching up front saves time.
An interpreter written in OCaml can be about one-third faster if the evaluation function is partially applied so it runs in two stages – first pattern-match on the expression constructors and then perform the computation.
Using the functional programming language OCaml, we’ve recently implemented a domain-specific language for specifying analyses of website traffic. This involved the classic computer-science task of parsing an abstract syntax tree (AST) from a text file and evaluating that AST in varying environments. The AST will be parsed once and evaluated in millions of different environments.
Performance is good, but we wanted to understand if we could easily make it better. To investigate this without being distracted by the details of our particular domain-specific language, we boiled it down to the problem of interpreting the following toy language of expressions with one free variable.
type binop = Add | Sub | Mul | Div
type expr =
| Const of int
| VarX
| Neg of expr
| Binop of binop * expr * expr
The VarX
constructor corresponds to the one free variable x
. For example,
the expression 1 + x
is represented as Binop (Add, Const 1, VarX)
. A more
realistic language would of course support multiple variables.
The standard way to interpret this language is through a function eval: expr ->
int -> int
, where eval e x
evaluates the expression e
in the environment
x
, meaning that the one free variable has value x
. Since
we’ll be interpreting random expressions in random environments, we define a
toy division operator that doesn’t raise exceptions when dividing by zero.
let (/?) = fun x y -> if y = 0 then x else x / y
let eval_binop = function
| Add -> (+)
| Sub -> (-)
| Mul -> ( * )
| Div -> (/?)
let rec eval e x =
match e with
| Const n -> n
| VarX -> x
| Neg e -> - eval e x
| Binop (binop, e1, e2) ->
(eval_binop binop) (eval e1 x) (eval e2 x)
No computation happens in the partial application let f = eval e
, so if we
invoke f x
and f y
then both invocations result in the AST being
pattern-matched. Alternatively, we can define the function partial
with the
same type signature as eval
but different performance characteristics – pattern-matching on e
happens at the partial application let f = partial e
, and subsequent invocations of f x
and f y
do not inspect the AST.
let rec partial = function
| Const n -> fun _ -> n
| VarX -> fun x -> x
| Neg e -> let f = partial e in fun x -> - f x
| Binop (binop, e1, e2) ->
let f_binop = eval_binop binop in
let f1 = partial e1 in
let f2 = partial e2 in
fun x -> f_binop (f1 x) (f2 x)
To asses the performance of each approach, we’ve created a
benchmark that evaluates the following randomly-generated expression with
values of x
from 0 to 9,999.
(((2040 /? (248 - ((990 /? x) * ((x /? x) - (x + x))))) + (((x + (x - (((x + 262) + (910 /? x)) * x))) + ((870 + (x + 226000)) + ((x /? (325 + x)) /? ((x 437) - 988)))) * 23)) /? ((-(((710 - (x /? (-34176))) /? (247 + x)) - ((-x) * (-((-(1156 - (((x - x) * 33) /? 651))) + x))))) * (((-(x - x)) * ((-((x /? (-((x + (x /? 23)) * 557))) /? (643 - (((-x) + 109) /? 588)))) /? (((150 * (80 /? (x + x))) * 476) * (x * 334)))) /? 262)))
We benchmarked the eval
and partial
approaches and two other variations:
- Fully applying
partial
for every environment, meaning that we call it aspartial e x
in the inner loop. This is called “partial, fully applied” in the table below, and it lets us understand the overhead ofpartial
when there are only a few values ofx
. - Pretty-printing the expression to OCaml syntax, compiling it and running the optimized assembly executable. This is called “compiled” in the results table below, and it lets us understand the interpreter overhead compared with optimized assembly.
We timed the executables produced by both the native-code OCaml compiler
(ocamlopt
) and the bytecode OCaml compiler (ocamlc
). The results are as
follows:
Approach | Time (native code) | Time (bytecode) |
---|---|---|
compiled | 1.25 ms | 5.00 ms |
partial | 4.33 ms | 20.77 ms |
eval | 5.93 ms | 45.29 ms |
partial, fully applied | 10.45 ms | 62.52 ms |
From these timings, we conclude the following:
- Using
partial
is around 37% faster than usingeval
(5.93 / 4.33 = 1.37). - An interpreter written in OCaml runs quite fast. It runs only 3.5x slower than natively-compiled OCaml code (4.33 / 1.25 = 3.5).
- Our own interpreter is slightly faster than OCaml bytecode (4.33 < 5.00).
- Pattern matching in OCaml bytecode has a much higher penalty than it does in native code (45.29 / 20.77 > 5.93 / 4.33).
- If each expression should be evaluated in only one or two environments, it’s
faster to use
eval
thanpartial
(5.93 < 10.45).
Apart from performance, there are several differences between eval
and
partial
worth discussing:
- The
partial
function is harder to read and maintain thaneval
; theBinop
case is four lines instead of one. In the implementation, it’s easy to forget that all pattern matching should happen outside thefun x -> ...
closure. If you forget, the compiler won’t warn you. In the caller, it’s easy to forget that the function should be applied in two stages. - The first stage of
partial
is a natural place to perform sanity checks on the AST and raise exceptions as appropriate. In our case, this allows us to catch many errors on the developer’s laptop instead of only seeing them in the production system when the function is fully applied with realistic data. - The
partial
function can perform optimizations that are not expressible as transformations on the AST. In this toy example, it could try to statically determine that the divisor is not zero, then use a division function withoutif y = 0
in that case. - Sometimes code can be easier to read and maintain when it is written naively
and without worrying about performance. The
partial
approach allows us to do that for all the code that runs in the first stage. -
The heap layout of the closure
eval (Binop (Add, Const 1, VarX))
is sketched in the following illustration, wherecl
is the closure tag, and"eval"
is a code pointer to a function that invokeseval
.----------------- | cl | "eval" | * | ---------------|- V ------------------------ | Binop | Add | * | VarX | ---------------|-------- V ----------- | Const | 1 | -----------
Compare that with
partial (Binop (Add, Const 1, VarX))
, sketched below, where quoted words are again code pointers. There are more heap objects to traverse, but the traversal is only a matter of calling through code pointers.-------------------------------------- | cl | "binop" | * | * | * | ----------------|----------|-------|-- V | V ------------ | ----------------- | cl | "add" | | | cl | "identity" | ------------ V ----------------- ------------------ | cl | "const" | 1 | ------------------
The complete benchmark code is available on GitHub.
With an execution time of 1.25 ms per 10,000 iterations, each iteration takes
only 125 nanoseconds, which is fast enough to be sensitive to tiny changes in
compilation strategy and calling convention. In particular, compiling with
-inline
values greater than 27 brings the execution time of “compiled” down from 1.25 ms to
825 us. Therefore, take these results with a grain of salt and benchmark your
own code in your own environment.