At issuu, we use OCaml to process reader-behavior events from web browsers and mobile apps, digesting these into high-level reports for business analysts. We process around 5,000 events per second on average. On a daily basis, this amounts to roughly 200 GB of uncompressed JSON data.

We have built a domain-specific language (DSL) that allows our business analysts to write concise descriptions of how to aggregate and group the events to produce reports that are stored in an SQL database. Examples of report specifications are:

visitors by device

session_length by country where new_visitor

Each report aggregates a metric (visitors, session length and so on), grouping or filtering by any number of dimensions (device, country and so on).

The DSL has evolved from simple specifications like those above to also contain arithmetic expressions, boolean predicates and subreports. We have built a full-scale system in OCaml, consisting of roughly 10,000 lines of code, which is running in production and used daily by our business analysts.

This experience report describes how we have used Generalized Algebraic Data Types (GADT) to represent the heterogeneous data involved in these reports, gaining both safety and performance.

Naïve implementation

We model the report engine as a function from a report specification and a list of visitors to a list of report rows. Assume a suitable definition of type visitor and assume the following vastly simplified definitions of metrics and dimensions:

type metric = Visitors | Sessions
type dim = Country | Logged_in

The report specified as sessions by country by logged_in in our DSL is represented as a metric * dim list as follows:

let spec = (Sessions, [Country; Logged_in])

With this specification and some sample data, ideally we would like the report engine to return a report like the following, saying that there were two logged-in sessions from Denmark and one not-logged-in session from Denmark.

[ ([ "DK"; true  ], 2); ([ "DK"; false ], 1); ]

Of course, as ML languages do not allow heterogeneous lists, such a value can never exist. We will have to wrap the list elements in an algebraic data type; let's call it dim_value:

type dim_value = String of string | Bool of bool
val make_report: metric * dim list -> visitor list ->
                 (dim_value list * int) list

Invoking the make_report function with our sample specification and data would yield the following rows:

[([String "DK"; Bool true], 2); ([String "DK"; Bool false], 1)]

This leaves us with no more type safety than an untyped programming language would provide. Code to extract information from such a report becomes awkward as it must handle the error cases of unexpected dim_value constructors or list lengths. Performance also suffers, at least in OCaml, because each value is wrapped in a constructor. Lacking type-system guarantees, we would have to unit-test this by adding cases for every value of dim, of which there currently are 24 in our production system.

OCaml GADT implementation

We can address all of the problems mentioned above by encoding dim as a GADT, a type-system feature that is typically confined to purely academic programming languages but has been in OCaml since version 4.

In this encoding, dimensions are parametrized by types 'a that replace the constructors of dim_value. The run-time representation of a dim remains the same, but we are now tracking its value type at compile time. We can no longer use the standard list type to represent lists of dimensions, since each dimension in the list can have different type parameters. Therefore we introduce a new type dims to represent a dimension list. The type parameters are collected in a right-leaning tuple.

type 'a dim = Country: string dim | Logged_in: bool dim

type 'p dims = By: 'a dim * 'p dims -> ('a * 'p) dims
             | End: unit dims

For example, the dimensions by country by logged_in would be represented as By (Country, By (Logged_in, End)) of type:

(string * (bool * unit)) dims

The make_report function now has a strong type, where each row in the returned report has the same shape, and that shape is dictated by the choice of dimensions:

val make_report: metric * 'p dims -> visitor list ->
                 ('p * int) list

Invoking the revised make_report function with our sample specification and data would yield the following rows:

[ (("DK", (true, ())), 2); (("DK", (false, ())), 1); ]

Since OCaml's run-time representation of a right-leaning tuple is the same as that of a list, our report rows are represented like those in in the naïve implementation but without the indirection through constructors of dim_value.

As an example of how to traverse the result, consider the following pretty-printer:

let show_dim: type a. a dim -> a -> string = function
  | Country -> identity
  | Logged_in -> string_of_bool

let rec show_dims: type p. p dims -> p -> string = function
  | By (dim, dims) ->
      let show_x = show_dim dim in
      let show_xs = show_dims dims in
      fun (x, xs) -> show_x x ^ " " ^ show_xs xs
  | End -> fun () -> ""

The show_dims function shows the essence of the patterns in use here: Our traversal of the p-typed result is directed by inspecting the p dims, where we in each case learn about the shape of p. Notice that the type parameter p changes with each recursive call to show_dims; this is known as polymorphic recursion and requires explicit type annotations in OCaml because it is outside the subset of the language for which type inference is complete. Notice also that the functions can be applied in two stages — all pattern matching takes place after the first argument is received, and millions of result rows can then be printed without the overhead of doing the same pattern matches again for every row.

Existential types

Using type-parametrized dimension values solves the problems we encountered in the naive implementation, but it introduces a new problem: How do we parse a report specification from text form into a 'p dims value? How do we convert ["country"; "logged_in"] to By (Country, By (Logged_in, End))? A first attempt might look like this:

val parse_dims: string list -> 'p dims

Unfortunately, the function cannot have that signature because it puts the choice of 'p under control of the caller, but the caller is parsing an unknown string list and does not know 'p in advance.

The solution is to let parse_dims return a value of existential type. We encode this in OCaml by defining wrapper GADTs edim and edims with no type parameters. Here we just show edims and parse_edims:

type edims = Edims: 'p dims -> edims

let rec parse_dims: string list -> edims = function
  | [] -> Edims End
  | d :: ds ->
      let Edim dim = parse_dim d in
      let Edims dims = parse_dims ds in
      Edims (By (dim, dims))

The caller of parse_dims must unwrap the returned value from its Edims constructor just like in the recursive call in parse_dims itself. This unwrapping introduces a fresh type 'p that must not escape the scope of the unpacking let expression.

If we wrote specifications directly in OCaml source code instead of parsing them from text, we would not need existential types in this case. The language would be an Embedded DSL and would have the disadvantage that adding new specifications requires recompilation.

Discussion

A semantic alternative

An alternative to the syntactic representations of dimensions discussed so far would be to represent 'a dim semantically as the set of operations that can be performed on it:

type 'a dimops = { show: 'a -> string; ... }

let country: string dimops = { show = identity; ... }
let logged_in: bool dimops = { show = string_of_bool; ... }

This representation would be natural in an object-oriented language, where we would say that show is a method on the dimops class. Our GADT code could be rewritten into this style by replacing each pattern match on a dim value with a call to a new operation on dimops. There are currently eight such pattern matches in our codebase. Unfortunately, the return types of those pattern matches are often internal to the modules where they occur, so such a code rewrite would break the encapsulation of those modules.

To get the best of both worlds, we use a hybrid approach where module implementations match on an 'a dim to obtain a semantic value like 'a dimops but with only the operations that are relevant to this module. In most cases this is just a single function, so there is no need to wrap it in a record. Look again at the show_dim function and notice what happens: It interprets an 'a dim into a semantic value of type 'a -> string. In this hybrid approach, the pattern match on dim is the natural place to catch certain errors and to run certain preprocessing code; thus, we catch errors early and avoid repeating expensive calculations for every data point.

GADT disadvantages

You might ask why we need to check for errors after pattern matching on a GADT; it should be possible to engineer the GADT so that illegal values cannot occur at all. This is true, but in our case it would require adding one or two more type parameters to dim and dims. Every type parameter is visible in every module that uses those GADTs, and they need to be explicitly named in many places because OCaml does not attempt type inference for GADT pattern matching. The values of the GADT as seen by every module would also be cluttered with type-relation witness values that are only used by one module.

Another disadvantage of GADTs is that they immediately set us on a slippery slope of needing increasingly advanced type-system features. Just to recover basic functionality like parsing and pretty-printing, we needed existential types and polymorphic recursion. In other places in our codebase, we need advanced design patterns such as type-equality witnesses and various flavors of heterogeneous lists.

Emulating GADTs

Very few industrial-strength languages have GADTs, but there is a great deal of literature and folklore on emulating fragments of their functionality. This employs features like rank-2 polymorphism and higher-kinded type variables, both sometimes available through first-class modules. Where these features are not available, much GADT functionality can be emulated with shadow types and unsafe typecasts such as OCaml's Obj.magic.

While these tricks are possible, the increased amount of boilerplate and decreased readability would most likely tip the scales so we would stick with the naïve implementation instead. In our production code, the GADTs for dim, dims and metric are all mutually recursive to support subqueries. Had we used an emulation of GADTs, adding this mutual recursion could have required a complete rewrite of all the code that touches those types; with compiler-supported GADTs, it was simply a matter of replacing some let-declarations with and.