Benchmarking interpreters written in OCaml
Doing pattern-matching up front saves time.
An interpreter written in OCaml can be about one-third faster if the evaluation function is partially applied so it runs in two stages -- first pattern-match on the expression constructors and then perform the computation.
Using the functional programming language OCaml, we've recently implemented a domain-specific language for specifying analyses of website traffic. This involved the classic computer-science task of parsing an abstract syntax tree (AST) from a text file and evaluating that AST in varying environments. The AST will be parsed once and evaluated in millions of different environments.
Performance is good, but we wanted to understand if we could easily make it better. To investigate this without being distracted by the details of our particular domain-specific language, we boiled it down to the problem of interpreting the following toy language of expressions with one free variable.
type binop = Add | Sub | Mul | Div type expr = | Const of int | VarX | Neg of expr | Binop of binop * expr * expr
VarX constructor corresponds to the one free variable
x. For example,
1 + x is represented as
Binop (Add, Const 1, VarX). A more
realistic language would of course support multiple variables.
The standard way to interpret this language is through a function
eval: expr ->
int -> int, where
eval e x evaluates the expression
e in the environment
x, meaning that the one free variable has value
we'll be interpreting random expressions in random environments, we define a
toy division operator that doesn't raise exceptions when dividing by zero.
let (/?) = fun x y -> if y = 0 then x else x / y let eval_binop = function | Add -> (+) | Sub -> (-) | Mul -> ( * ) | Div -> (/?) let rec eval e x = match e with | Const n -> n | VarX -> x | Neg e -> - eval e x | Binop (binop, e1, e2) -> (eval_binop binop) (eval e1 x) (eval e2 x)
No computation happens in the partial application
let f = eval e, so if we
f x and
f y then both invocations result in the AST being
pattern-matched. Alternatively, we can define the function
partial with the
same type signature as
eval but different performance characteristics -- pattern-matching on
e happens at the partial application
let f = partial e, and subsequent invocations of
f x and
f y do not inspect the AST.
let rec partial = function | Const n -> fun _ -> n | VarX -> fun x -> x | Neg e -> let f = partial e in fun x -> - f x | Binop (binop, e1, e2) -> let f_binop = eval_binop binop in let f1 = partial e1 in let f2 = partial e2 in fun x -> f_binop (f1 x) (f2 x)
To asses the performance of each approach, we've created a
benchmark that evaluates the following randomly-generated expression with
x from 0 to 9,999.
(((2040 /? (248 - ((990 /? x) * ((x /? x) - (x + x))))) + (((x + (x - (((x + 262) + (910 /? x)) * x))) + ((870 + (x + 226000)) + ((x /? (325 + x)) /? ((x 437) - 988)))) * 23)) /? ((-(((710 - (x /? (-34176))) /? (247 + x)) - ((-x) * (-((-(1156 - (((x - x) * 33) /? 651))) + x))))) * (((-(x - x)) * ((-((x /? (-((x + (x /? 23)) * 557))) /? (643 - (((-x) + 109) /? 588)))) /? (((150 * (80 /? (x + x))) * 476) * (x * 334)))) /? 262)))
We benchmarked the
partial approaches and two other variations:
- Fully applying
partialfor every environment, meaning that we call it as
partial e xin the inner loop. This is called "partial, fully applied" in the table below, and it lets us understand the overhead of
partialwhen there are only a few values of
- Pretty-printing the expression to OCaml syntax, compiling it and running the optimized assembly executable. This is called "compiled" in the results table below, and it lets us understand the interpreter overhead compared with optimized assembly.
We timed the executables produced by both the native-code OCaml compiler
ocamlopt) and the bytecode OCaml compiler (
ocamlc). The results are as
|Approach||Time (native code)||Time (bytecode)|
|compiled||1.25 ms||5.00 ms|
|partial||4.33 ms||20.77 ms|
|eval||5.93 ms||45.29 ms|
|partial, fully applied||10.45 ms||62.52 ms|
From these timings, we conclude the following:
partialis around 37% faster than using
eval(5.93 / 4.33 = 1.37).
- An interpreter written in OCaml runs quite fast. It runs only 3.5x slower than natively-compiled OCaml code (4.33 / 1.25 = 3.5).
- Our own interpreter is slightly faster than OCaml bytecode (4.33 < 5.00).
- Pattern matching in OCaml bytecode has a much higher penalty than it does in native code (45.29 / 20.77 > 5.93 / 4.33).
- If each expression should be evaluated in only one or two environments, it's
faster to use
partial(5.93 < 10.45).
Apart from performance, there are several differences between
partial worth discussing:
partialfunction is harder to read and maintain than
Binopcase is four lines instead of one. In the implementation, it's easy to forget that all pattern matching should happen outside the
fun x -> ...closure. If you forget, the compiler won't warn you. In the caller, it's easy to forget that the function should be applied in two stages.
- The first stage of
partialis a natural place to perform sanity checks on the AST and raise exceptions as appropriate. In our case, this allows us to catch many errors on the developer's laptop instead of only seeing them in the production system when the function is fully applied with realistic data.
partialfunction can perform optimizations that are not expressible as transformations on the AST. In this toy example, it could try to statically determine that the divisor is not zero, then use a division function without
if y = 0in that case.
- Sometimes code can be easier to read and maintain when it is written naively
and without worrying about performance. The
partialapproach allows us to do that for all the code that runs in the first stage.
The heap layout of the closure
eval (Binop (Add, Const 1, VarX))is sketched in the following illustration, where
clis the closure tag, and
"eval"is a code pointer to a function that invokes
----------------- | cl | "eval" | * | ---------------|- V ------------------------ | Binop | Add | * | VarX | ---------------|-------- V ----------- | Const | 1 | -----------
Compare that with
partial (Binop (Add, Const 1, VarX)), sketched below, where quoted words are again code pointers. There are more heap objects to traverse, but the traversal is only a matter of calling through code pointers.
-------------------------------------- | cl | "binop" | * | * | * | ----------------|----------|-------|-- V | V ------------ | ----------------- | cl | "add" | | | cl | "identity" | ------------ V ----------------- ------------------ | cl | "const" | 1 | ------------------
The complete benchmark code is available on GitHub.
With an execution time of 1.25 ms per 10,000 iterations, each iteration takes
only 125 nanoseconds, which is fast enough to be sensitive to tiny changes in
compilation strategy and calling convention. In particular, compiling with
-inline values greater than 27 brings the execution time of "compiled" down from 1.25 ms to
825 us. Therefore, take these results with a grain of salt and benchmark your
own code in your own environment.