Benchmarking interpreters written in OCaml

An interpreter written in OCaml can be about one-third faster if the evaluation function is partially applied so it runs in two stages – first pattern-match on the expression constructors and then perform the computation.

Using the functional programming language OCaml, we’ve recently implemented a domain-specific language for specifying analyses of website traffic. This involved the classic computer-science task of parsing an abstract syntax tree (AST) from a text file and evaluating that AST in varying environments. The AST will be parsed once and evaluated in millions of different environments.

Performance is good, but we wanted to understand if we could easily make it better. To investigate this without being distracted by the details of our particular domain-specific language, we boiled it down to the problem of interpreting the following toy language of expressions with one free variable.

type binop = Add | Sub | Mul | Div

type expr =
  | Const of int
  | VarX
  | Neg of expr
  | Binop of binop * expr * expr

The VarX constructor corresponds to the one free variable x. For example, the expression 1 + x is represented as Binop (Add, Const 1, VarX). A more realistic language would of course support multiple variables.

The standard way to interpret this language is through a function eval: expr -> int -> int, where eval e x evaluates the expression e in the environment x, meaning that the one free variable has value x. Since we’ll be interpreting random expressions in random environments, we define a toy division operator that doesn’t raise exceptions when dividing by zero.

let (/?) = fun x y -> if y = 0 then x else x / y

let eval_binop = function
  | Add -> (+)
  | Sub -> (-)
  | Mul -> ( * )
  | Div -> (/?)

let rec eval e x =
  match e with
  | Const n -> n
  | VarX -> x
  | Neg e -> - eval e x
  | Binop (binop, e1, e2) ->
      (eval_binop binop) (eval e1 x) (eval e2 x)

No computation happens in the partial application let f = eval e, so if we invoke f x and f y then both invocations result in the AST being pattern-matched. Alternatively, we can define the function partial with the same type signature as eval but different performance characteristics – pattern-matching on e happens at the partial application let f = partial e, and subsequent invocations of f x and f y do not inspect the AST.

let rec partial = function
  | Const n -> fun _ -> n
  | VarX -> fun x -> x
  | Neg e -> let f = partial e in fun x -> - f x
  | Binop (binop, e1, e2) ->
      let f_binop = eval_binop binop in
      let f1 = partial e1 in
      let f2 = partial e2 in
      fun x -> f_binop (f1 x) (f2 x)

To asses the performance of each approach, we’ve created a benchmark that evaluates the following randomly-generated expression with values of x from 0 to 9,999.

(((2040 /? (248 - ((990 /? x) * ((x /? x) - (x + x))))) + (((x + (x - (((x + 262) + (910 /? x)) * x))) + ((870 + (x + 226000)) + ((x /? (325 + x)) /? ((x 437) - 988)))) * 23)) /? ((-(((710 - (x /? (-34176))) /? (247 + x)) - ((-x) * (-((-(1156 - (((x - x) * 33) /? 651))) + x))))) * (((-(x - x)) * ((-((x /? (-((x + (x /? 23)) * 557))) /? (643 - (((-x) + 109) /? 588)))) /? (((150 * (80 /? (x + x))) * 476) * (x * 334)))) /? 262)))

We benchmarked the eval and partial approaches and two other variations:

Fully applying partial for every environment, meaning that we call it as partial e x in the inner loop. This is called “partial, fully applied” in the table below, and it lets us understand the overhead of partial when there are only a few values of x.
Pretty-printing the expression to OCaml syntax, compiling it and running the optimized assembly executable. This is called “compiled” in the results table below, and it lets us understand the interpreter overhead compared with optimized assembly.

We timed the executables produced by both the native-code OCaml compiler (ocamlopt) and the bytecode OCaml compiler (ocamlc). The results are as follows:

Approach	Time (native code)	Time (bytecode)
compiled	1.25 ms	5.00 ms
partial	4.33 ms	20.77 ms
eval	5.93 ms	45.29 ms
partial, fully applied	10.45 ms	62.52 ms

From these timings, we conclude the following:

Using partial is around 37% faster than using eval (5.93 / 4.33 = 1.37).
An interpreter written in OCaml runs quite fast. It runs only 3.5x slower than natively-compiled OCaml code (4.33 / 1.25 = 3.5).
Our own interpreter is slightly faster than OCaml bytecode (4.33 < 5.00).
Pattern matching in OCaml bytecode has a much higher penalty than it does in native code (45.29 / 20.77 > 5.93 / 4.33).
If each expression should be evaluated in only one or two environments, it’s faster to use eval than partial (5.93 < 10.45).

Apart from performance, there are several differences between eval and partial worth discussing:

The partial function is harder to read and maintain than eval; the Binop case is four lines instead of one. In the implementation, it’s easy to forget that all pattern matching should happen outside the fun x -> ... closure. If you forget, the compiler won’t warn you. In the caller, it’s easy to forget that the function should be applied in two stages.
The first stage of partial is a natural place to perform sanity checks on the AST and raise exceptions as appropriate. In our case, this allows us to catch many errors on the developer’s laptop instead of only seeing them in the production system when the function is fully applied with realistic data.
The partial function can perform optimizations that are not expressible as transformations on the AST. In this toy example, it could try to statically determine that the divisor is not zero, then use a division function without if y = 0 in that case.
Sometimes code can be easier to read and maintain when it is written naively and without worrying about performance. The partial approach allows us to do that for all the code that runs in the first stage.

The heap layout of the closure eval (Binop (Add, Const 1, VarX)) is sketched in the following illustration, where cl is the closure tag, and "eval" is a code pointer to a function that invokes eval.

   -----------------
  | cl | "eval" | * |
   ---------------|-
                  V
           ------------------------
          | Binop | Add | * | VarX |
           ---------------|--------
                          V
                   -----------
                  | Const | 1 |
                   -----------

Compare that with partial (Binop (Add, Const 1, VarX)), sketched below, where quoted words are again code pointers. There are more heap objects to traverse, but the traversal is only a matter of calling through code pointers.

   --------------------------------------
  | cl | "binop" | *     |    *   |   *  |
   ----------------|----------|-------|--
                   V          |       V
               ------------   |   -----------------
              | cl | "add" |  |  | cl | "identity" |
               ------------   V   -----------------
                          ------------------
                         | cl | "const" | 1 |
                          ------------------

The complete benchmark code is available on GitHub.

With an execution time of 1.25 ms per 10,000 iterations, each iteration takes only 125 nanoseconds, which is fast enough to be sensitive to tiny changes in compilation strategy and calling convention. In particular, compiling with -inline values greater than 27 brings the execution time of “compiled” down from 1.25 ms to 825 us. Therefore, take these results with a grain of salt and benchmark your own code in your own environment.

Benchmarking interpreters written in OCaml

Doing pattern-matching up front saves time.