Practicalities and trade-offs in programming with GADTs
A look at issuu's examples of OCaml GADTs in the wild.
At issuu, we use OCaml to process reader-behavior events from web browsers and mobile apps, digesting these into high-level reports for business analysts. We process around 5,000 events per second on average. On a daily basis, this amounts to roughly 200 GB of uncompressed JSON data.
We have built a domain-specific language (DSL) that allows our business analysts to write concise descriptions of how to aggregate and group the events to produce reports that are stored in an SQL database. Examples of report specifications are:
visitors by device
session_length by country where new_visitor
Each report aggregates a metric (visitors, session length and so on), grouping or filtering by any number of dimensions (device, country and so on).
The DSL has evolved from simple specifications like those above to also contain arithmetic expressions, boolean predicates and subreports. We have built a full-scale system in OCaml, consisting of roughly 10,000 lines of code, which is running in production and used daily by our business analysts.
This experience report describes how we have used Generalized Algebraic Data Types (GADT) to represent the heterogeneous data involved in these reports, gaining both safety and performance.
Naïve implementation
We model the report engine as a function from a report specification
and a list of visitors to a list of report rows. Assume a suitable
definition of type visitor
and assume the
following vastly simplified definitions of metrics and dimensions:
type metric = Visitors | Sessions
type dim = Country | Logged_in
The report specified as sessions by country by logged_in
in our DSL is represented as a metric * dim list
as
follows:
let spec = (Sessions, [Country; Logged_in])
With this specification and some sample data, ideally we would like the report engine to return a report like the following, saying that there were two logged-in sessions from Denmark and one not-logged-in session from Denmark.
[ ([ "DK"; true ], 2); ([ "DK"; false ], 1); ]
Of course, as ML languages do not allow heterogeneous lists, such a
value can never exist. We will have to wrap the list elements in an
algebraic data type; let’s call it dim_value
:
type dim_value = String of string | Bool of bool
val make_report: metric * dim list -> visitor list ->
(dim_value list * int) list
Invoking the make_report
function with our sample
specification and data would yield the following rows:
[([String "DK"; Bool true], 2); ([String "DK"; Bool false], 1)]
This leaves us with no more type safety than an untyped programming
language would provide. Code to extract information from such a
report becomes awkward as it must handle the error cases of
unexpected dim_value
constructors or list
lengths. Performance also suffers, at least in OCaml, because each value is wrapped in a
constructor. Lacking type-system guarantees, we would have to
unit-test this by adding cases for every value of dim
,
of which there currently are 24 in our production system.
OCaml GADT implementation
We can address all of the problems mentioned above by encoding
dim
as a GADT, a type-system feature that is typically
confined to purely academic programming languages but has been in
OCaml since version 4.
In this encoding, dimensions are parametrized by types 'a
that replace the constructors of dim_value
. The
run-time representation of a dim
remains the same,
but we are now tracking its value type at compile time. We can no
longer use the standard list
type to represent lists of
dimensions, since each dimension in the list can have different type
parameters. Therefore we introduce a new type dims
to
represent a dimension list. The type parameters are collected in a
right-leaning tuple.
type 'a dim = Country: string dim | Logged_in: bool dim
type 'p dims = By: 'a dim * 'p dims -> ('a * 'p) dims
| End: unit dims
For example, the dimensions by country by logged_in
would be
represented as By (Country, By (Logged_in, End))
of type:
(string * (bool * unit)) dims
The make_report
function now has a strong type, where each row in the returned report
has the same shape, and that shape is dictated by the choice of
dimensions:
val make_report: metric * 'p dims -> visitor list ->
('p * int) list
Invoking the revised make_report
function with our sample
specification and data would yield the following rows:
[ (("DK", (true, ())), 2); (("DK", (false, ())), 1); ]
Since OCaml’s run-time representation of a right-leaning tuple is
the same as that of a list
, our report rows are
represented like those in in the naïve implementation but without the
indirection through constructors of dim_value
.
As an example of how to traverse the result, consider the following pretty-printer:
let show_dim: type a. a dim -> a -> string = function
| Country -> identity
| Logged_in -> string_of_bool
let rec show_dims: type p. p dims -> p -> string = function
| By (dim, dims) ->
let show_x = show_dim dim in
let show_xs = show_dims dims in
fun (x, xs) -> show_x x ^ " " ^ show_xs xs
| End -> fun () -> ""
The show_dims
function shows the essence of the patterns
in use here: Our traversal of the p
-typed result is directed
by inspecting the p dims
, where we in each case
learn about the shape of p
. Notice that the type parameter
p
changes with each recursive call to show_dims
;
this is known as polymorphic recursion and requires explicit
type annotations in OCaml because it is outside the subset of the
language for which type inference is complete. Notice also that
the functions can be applied in two stages —
all pattern matching takes place after the first argument is
received,
and millions of result
rows can then be printed without the overhead of doing the same
pattern matches again for every row.
Existential types
Using type-parametrized dimension values solves the problems we
encountered in the naive implementation, but it introduces a new problem:
How do we parse a report specification from text form into a
'p dims
value? How do we convert
["country"; "logged_in"]
to
By (Country, By (Logged_in, End))
?
A first attempt might look like this:
val parse_dims: string list -> 'p dims
Unfortunately, the function cannot have that signature because it
puts the choice of 'p
under control of the caller, but the
caller is parsing an unknown string list and does not know 'p
in advance.
The solution is to let parse_dims
return a value of
existential type. We encode this in OCaml by defining wrapper
GADTs edim
and edims
with no type
parameters. Here we just show edims
and
parse_edims
:
type edims = Edims: 'p dims -> edims
let rec parse_dims: string list -> edims = function
| [] -> Edims End
| d :: ds ->
let Edim dim = parse_dim d in
let Edims dims = parse_dims ds in
Edims (By (dim, dims))
The caller of parse_dims
must unwrap the returned value
from its Edims
constructor just like in the recursive
call in parse_dims
itself. This unwrapping introduces a
fresh type 'p
that must not escape the scope of the unpacking
let
expression.
If we wrote specifications directly in OCaml source code instead of parsing them from text, we would not need existential types in this case. The language would be an Embedded DSL and would have the disadvantage that adding new specifications requires recompilation.
Discussion
A semantic alternative
An alternative to the syntactic representations of dimensions
discussed so far would be to represent 'a dim
semantically as the set of operations that can be performed on it:
type 'a dimops = { show: 'a -> string; ... }
let country: string dimops = { show = identity; ... }
let logged_in: bool dimops = { show = string_of_bool; ... }
This representation would be natural in an object-oriented language,
where we would say that show
is a method on the
dimops
class. Our GADT code could be rewritten into this
style by replacing each pattern match on a dim
value
with a call to a new operation on dimops
. There are
currently eight such pattern matches in our codebase.
Unfortunately, the return types of those pattern matches are often
internal to the modules where they occur, so such a code rewrite
would break the encapsulation of those modules.
To get the best of both worlds, we use a hybrid approach where
module implementations match on an 'a dim
to
obtain a semantic value like 'a dimops
but with
only the operations that are relevant to this module. In most cases
this is just a single function, so there is no need to wrap it in a
record. Look again at the show_dim
function and notice
what happens: It interprets an 'a dim
into a semantic value of type 'a -> string
. In
this hybrid approach, the pattern match on dim
is the
natural place to catch certain errors and to run certain
preprocessing code; thus, we catch errors early and avoid repeating
expensive calculations for every data point.
GADT disadvantages
You might ask why we need to check for errors after pattern matching
on a GADT; it should be possible to engineer the GADT so that
illegal values cannot occur at all. This is true, but in our case it
would require adding one or two more type parameters to
dim
and dims
. Every type parameter is
visible in every module that uses those GADTs, and they need to be
explicitly named in many places because OCaml does not attempt type
inference for GADT pattern matching. The values of the GADT as seen
by every module would also be cluttered with type-relation witness
values that are only used by one module.
Another disadvantage of GADTs is that they immediately set us on a slippery slope of needing increasingly advanced type-system features. Just to recover basic functionality like parsing and pretty-printing, we needed existential types and polymorphic recursion. In other places in our codebase, we need advanced design patterns such as type-equality witnesses and various flavors of heterogeneous lists.
Emulating GADTs
Very few industrial-strength languages have GADTs, but there is a
great deal of literature and folklore on emulating fragments of
their functionality. This employs features like rank-2 polymorphism
and higher-kinded type variables, both sometimes available through
first-class modules. Where these features are not available, much
GADT functionality can be emulated with shadow types and unsafe
typecasts such as OCaml’s Obj.magic
.
While these tricks are possible, the increased amount of boilerplate
and decreased readability would most likely tip the scales so we
would stick with the naïve implementation
instead. In our production code, the GADTs for dim
,
dims
and metric
are all mutually recursive
to support subqueries. Had we used an emulation of GADTs, adding
this mutual recursion could have required a complete rewrite of all
the code that touches those types; with compiler-supported GADTs, it
was simply a matter of replacing some let
-declarations
with and
.