# Benchmarking interpreters written in OCaml

# Doing pattern-matching up front saves time.

An interpreter written in OCaml can be about one-third faster if the evaluation function is partially applied so it runs in two stages -- first pattern-match on the expression constructors and then perform the computation.

Using the functional programming language OCaml, we've recently implemented a domain-specific language for specifying analyses of website traffic. This involved the classic computer-science task of parsing an abstract syntax tree (AST) from a text file and evaluating that AST in varying environments. The AST will be parsed once and evaluated in millions of different environments.

Performance is good, but we wanted to understand if we could easily make it better. To investigate this without being distracted by the details of our particular domain-specific language, we boiled it down to the problem of interpreting the following toy language of expressions with one free variable.

```
type binop = Add | Sub | Mul | Div
type expr =
| Const of int
| VarX
| Neg of expr
| Binop of binop * expr * expr
```

The `VarX`

constructor corresponds to the one free variable `x`

. For example,
the expression `1 + x`

is represented as `Binop (Add, Const 1, VarX)`

. A more
realistic language would of course support multiple variables.

The standard way to interpret this language is through a function ```
eval: expr ->
int -> int
```

, where `eval e x`

evaluates the expression `e`

in the environment
`x`

, meaning that the one free variable has value `x`

. Since
we'll be interpreting random expressions in random environments, we define a
toy division operator that doesn't raise exceptions when dividing by zero.

```
let (/?) = fun x y -> if y = 0 then x else x / y
let eval_binop = function
| Add -> (+)
| Sub -> (-)
| Mul -> ( * )
| Div -> (/?)
let rec eval e x =
match e with
| Const n -> n
| VarX -> x
| Neg e -> - eval e x
| Binop (binop, e1, e2) ->
(eval_binop binop) (eval e1 x) (eval e2 x)
```

No computation happens in the partial application `let f = eval e`

, so if we
invoke `f x`

and `f y`

then both invocations result in the AST being
pattern-matched. Alternatively, we can define the function `partial`

with the
same type signature as `eval`

but different performance characteristics -- pattern-matching on `e`

happens at the partial application `let f = partial e`

, and subsequent invocations of `f x`

and `f y`

do not inspect the AST.

```
let rec partial = function
| Const n -> fun _ -> n
| VarX -> fun x -> x
| Neg e -> let f = partial e in fun x -> - f x
| Binop (binop, e1, e2) ->
let f_binop = eval_binop binop in
let f1 = partial e1 in
let f2 = partial e2 in
fun x -> f_binop (f1 x) (f2 x)
```

To asses the performance of each approach, we've created a
benchmark that evaluates the following randomly-generated expression with
values of `x`

from 0 to 9,999.

```
(((2040 /? (248 - ((990 /? x) * ((x /? x) - (x + x))))) + (((x + (x - (((x + 262) + (910 /? x)) * x))) + ((870 + (x + 226000)) + ((x /? (325 + x)) /? ((x 437) - 988)))) * 23)) /? ((-(((710 - (x /? (-34176))) /? (247 + x)) - ((-x) * (-((-(1156 - (((x - x) * 33) /? 651))) + x))))) * (((-(x - x)) * ((-((x /? (-((x + (x /? 23)) * 557))) /? (643 - (((-x) + 109) /? 588)))) /? (((150 * (80 /? (x + x))) * 476) * (x * 334)))) /? 262)))
```

We benchmarked the `eval`

and `partial`

approaches and two other variations:

- Fully applying
`partial`

for every environment, meaning that we call it as`partial e x`

in the inner loop. This is called "partial, fully applied" in the table below, and it lets us understand the overhead of`partial`

when there are only a few values of`x`

. - Pretty-printing the expression to OCaml syntax, compiling it and running the optimized assembly executable. This is called "compiled" in the results table below, and it lets us understand the interpreter overhead compared with optimized assembly.

We timed the executables produced by both the native-code OCaml compiler
(`ocamlopt`

) and the bytecode OCaml compiler (`ocamlc`

). The results are as
follows:

Approach | Time (native code) | Time (bytecode) |
---|---|---|

compiled | 1.25 ms | 5.00 ms |

partial | 4.33 ms | 20.77 ms |

eval | 5.93 ms | 45.29 ms |

partial, fully applied | 10.45 ms | 62.52 ms |

From these timings, we conclude the following:

- Using
`partial`

is around 37% faster than using`eval`

(5.93 / 4.33 = 1.37). - An interpreter written in OCaml runs quite fast. It runs only 3.5x slower than natively-compiled OCaml code (4.33 / 1.25 = 3.5).
- Our own interpreter is slightly faster than OCaml bytecode (4.33 < 5.00).
- Pattern matching in OCaml bytecode has a much higher penalty than it does in native code (45.29 / 20.77 > 5.93 / 4.33).
- If each expression should be evaluated in only one or two environments, it's
faster to use
`eval`

than`partial`

(5.93 < 10.45).

Apart from performance, there are several differences between `eval`

and
`partial`

worth discussing:

- The
`partial`

function is harder to read and maintain than`eval`

; the`Binop`

case is four lines instead of one. In the implementation, it's easy to forget that all pattern matching should happen outside the`fun x -> ...`

closure. If you forget, the compiler won't warn you. In the caller, it's easy to forget that the function should be applied in two stages. - The first stage of
`partial`

is a natural place to perform sanity checks on the AST and raise exceptions as appropriate. In our case, this allows us to catch many errors on the developer's laptop instead of only seeing them in the production system when the function is fully applied with realistic data. - The
`partial`

function can perform optimizations that are not expressible as transformations on the AST. In this toy example, it could try to statically determine that the divisor is not zero, then use a division function without`if y = 0`

in that case. - Sometimes code can be easier to read and maintain when it is written naively
and without worrying about performance. The
`partial`

approach allows us to do that for all the code that runs in the first stage. The heap layout of the closure

`eval (Binop (Add, Const 1, VarX))`

is sketched in the following illustration, where`cl`

is the closure tag, and`"eval"`

is a code pointer to a function that invokes`eval`

.`----------------- | cl | "eval" | * | ---------------|- V ------------------------ | Binop | Add | * | VarX | ---------------|-------- V ----------- | Const | 1 | -----------`

Compare that with

`partial (Binop (Add, Const 1, VarX))`

, sketched below, where quoted words are again code pointers. There are more heap objects to traverse, but the traversal is only a matter of calling through code pointers.`-------------------------------------- | cl | "binop" | * | * | * | ----------------|----------|-------|-- V | V ------------ | ----------------- | cl | "add" | | | cl | "identity" | ------------ V ----------------- ------------------ | cl | "const" | 1 | ------------------`

The complete benchmark code is available on GitHub.

With an execution time of 1.25 ms per 10,000 iterations, each iteration takes
only 125 nanoseconds, which is fast enough to be sensitive to tiny changes in
compilation strategy and calling convention. In particular, compiling with
`-inline`

values greater than 27 brings the execution time of "compiled" down from 1.25 ms to
825 us. Therefore, take these results with a grain of salt and benchmark your
own code in your own environment.