Awesome

OCaml Towards Clarity and Grace

This is a guide towards writing better OCaml code. We all like to write OCaml code that is correct, maintainable, efficient, and maybe even beautiful. The sad truth is that this cannot be achieved simply by following some rules. Just like it is not enough for a book to be composed of correct sentences and coherent paragraphs. However, a book violating these principles stands little chance of finding many readers. Likewise, this guide tries to help with the small structures in programming OCaml upon which we can hope to build bigger ones.

I am happy to accept more material and to include critical discussion of the recommendations already in place. Uncovered Topics is a place to record desirable topics.

<details> <summary>Table of Contents</summary>

OCaml Towards Clarity and Grace

Created by gh-md-toc

</details>

Other Resources

Official OCaml Guidelines - very detailed
Jane Street
Xen Project
Upenn's style guide for student projects
OCaml Manual
Real World OCaml Book
Hammer Lab

Below is a list of projects that I like for their clean code that could serve as a general inspiration.

Guiding Principles

This probably needs some discussion

Code is more often read than written - make the life of the reader easy
Don't fight the language, play to its strength
Less code is better, cryptic code is worse
Think ahead but avoid over-engineering
Strong cohesion, loose coupling
Make correctness as obvious as possible

Uncovered Topics

Functors
Threads
Locking for Concurrency
Type Identity Across Modules

Indentation, Line Length

Indentation must always reflect the logic of a program.
In existing code, adopt the existing style for indentation: tabs or spaces.
Line length should not exceed 100 characters. Break up long lines in new code in particular. The OCaml Guidelines demand lines not exceeding 80 characters.
For new code, prefer spaces over tabs and use ocp-indent to maintain consistent indentation. Note that ocp-indent does not break up long lines.
OCamlformat can be used to automatically re-format a project beyond just indentation. Some teams are moving towards using it to format all code to avoid any discussions about formating.
Most OCaml code bases use 2 spaces per indentation level.
When making changes ensure that indentation is still correct after your change, re-indenting as necessary, but not excessively.

Rationale:

Overly long lines create problems when printing source code and when resolving merge conflicts in tools by forcing a lot of sideways scrolling. Lines of 80 characters and shorter support viewing code side by side.

General typography recommends the line length not to exceed 2.5 to 3 alphabets, i.e. 65 to 78 characters, for supporting the eye moving from the end of a line to the beginning of the next (cf. The Elements of Typographic Style).
Spaces are unambiguous in meaning whereas tabs depend on the environment.
ocp-indent is a proven way to re-establish consistent indentation while still leaving room for other aspects of code formatting.
When adding an if expression it might be necessary to re-indent the whole body of a function, but avoid re-indenting the whole file, which causes merge conflicts with people working on other branches, it is best to plan ahead and do such large scale changes as separate commits.

Comments

Comments generally go before the code they are referencing. The possible exception are declarations in interfaces (mli files, signatures) and types where they can go after the declaration.

Syntactically there are two kinds of comments:

General comments, enclosed in (* and *)
Special comments, enclosed in (** and *)

Special comments are associated with type and values in a program and treated specially by the compiler. They become available in automatically generated documentation. For the association to work, there must be no empty line between a special comment and the element they are associated with. See the section about ocamldoc for details.

Code should always be as clear as possible but that clarity cannot always convey the reason behind a design. Comments have the role to provide it: the why.

What to Comment

The purpose of a module or functor
The purpose of a value or function in a signature (interface)
The purpose of a type declaration or its components, if not obvious
The purpose of record components and variants in types, if not obvious
Unusual Algorithms and their complexity
Invariants when not expressed as assertions
Error handling
Basic examples on how to use the library
Known limitations
Short introduction to the technology covered by the library, if not obvious

What not to Comment

Purpose of a local let binding - the name should tell it
Every line in a function

Examples

OCaml List The documentation of the standard list module. Each function is commented below its signature.
Container Hash Set
Mtime with implementation and rendered documentation
Uunf with implementation and rendered documentation
Introduction to the domain covered by the library, e.g. introduction to unicode in Uucp
Documentation stored together with code: squeezed design and diagrams

Caveats

Code duplication is bad but copying comments is worse. Be very careful when copying comments to make sure they are not becoming misleading in a new context.

Viewing rendered documentation

You can use odig odoc && odig doc to view the documentation of all installed packages. If your package uses jbuilder then you can also view the documentation of the package you are working on with jbuilder doc.

Naming and Declarations

OCaml has a few conventions for names. In particular, capitalisation is significant.

Types: lower_case.
Type variables: 'lower_case
Values and Functions: lower_case
Constructors: UpperCase or Upper_Case
Record Fields: lower_case
Modules: UpperCase
Signatures: UpperCase
Module Types: ALLCAPS

General considerations:

Local names can be short, type variables very short. In general, the length of a name should be proportional to the size of its scope.
Prefer short, but self-describing names in public interfaces.
Use scoping (let, struct) to keep the number of names in a scope small.
Avoid encoding the type into a name: x_int or x_opt is usually not better than x.
Avoid repeating the module name in names for types and values: QMP.connection is a better name for a type than QMP.qmp_connection.
In a functional language like OCaml, using get as part of a name is often redundant unless it involves obtaining a value from a database or file.
You may see auto-generated code use the style .mIX_case where .MIX_case was meant, but record fields cannot start with capital letters. Avoid this style.
You can use longer names for type variables if it improves clarity, e.g. for phantom types.

Order of declarations: in a module, typically the following order is maintained unless dependencies force a different order or mixing declarations:

Exceptions
Types
Modules
Values

Special cases:

A file abc.ml is mapped to module Abc even though the filename is not capitalised. The corresponding interface is abc.mli.
When a module defines a central (often abstract) data type, it is typically named t:

  module Tree: sig
    type 'a t
    val empty: 'a t
  end = struct
    type 'a t = Empty | Tree of 'a t * 'a * 'a t
    let empty = Empty
  end

This ties in with the recommendation above to not use tree for the type declaration in module Tree.

Don't use ;;

There is no place for the double semicolon (;;) in OCaml source code. It is used only interactively in the OCaml toplevel (REPL).

Scoping

Use the module system to group values. It's quite common to define simple values that belong together and to indicate this in their names:

let option_debug     = false
let option_verbosity = High
let option_log       = stdout

It is better to let the module system do the work:

module Option = struct
  let debug     = false
  let verbosity = High
  let log       = stdout
end

A value can now be accessed like in Option.debug. A module simply used for grouping doesn't require an interface.

Constants

Avoid using magic constants as literals like 86400 for the number of seconds in a day. Constants should be let-bound to a name:

  let ( ** ) x y    = Int64.mul (Int64.of_int x) y
  let sec           = 1L
  let sec_per_min   = 60 ** sec
  let sec_per_hour  = 60 ** sec_per_min
  let sec_per_day   = 24 ** sec_per_hour

For literals, picking the correct base can make them less magical: compare 255 (decimal) with 0xff (hexadecimal) or 0b1111_1111 (binary).
Structure long literal numbers with underscores for readability as in 0b1111_1111 above. It works in all bases, like for example in 1_000_000.

More details are in the OCaml Manual.

When defining strings that need to contain lots of escaped quotes or backslashes (such as regular expressions), prefer to use the {|...|} form instead of "..." (see 8.18 Quoted Strings): e.g. use {|["\]|} instead of "[\"\\]"

Introduce and Document Interfaces

Module interfaces are the best way to document and control an implementation - employ them widely.

A module encapsulates code serving a specific purpose. An interface should hide implementation details and document the API.

In OCaml, every file abc.ml is a module and should come with an interface abc.mli. In addition, a module can be part of a (file) module using struct .. end. This is the only way to define functors.

module ID : sig
  type t
  val make: unit -> t
  (** [make] creates a unique token *)
  val equal: t -> t -> bool
  (** [equal] is true, if and only if two tokens are the same *)
end = struct
  type t = unit ref
  let make () = ref ()
  let equal x y = x == y (* use pointer equality *)
end

Add list of examples here

Interfaces not only help document code, but can also prevent needless recompilation (at least when you are using bytecode). As an exception to this rule, if your module only defines signatures then prefer using a .ml file for this, otherwise either the .mli would just be a duplicate of the .ml file, or you'd have to use .mli-only modules which don't have good tooling support.

Polymorphic vs. Regular Variants

OCaml has two kinds of variants: regular and polymorphic variants:

type colour
  = Red   of float
  | Blue  of float
  | Green of float    (* regular      *)

let blue = Blue 1.0

let blue' = `blue 0.1 (* polymorphic, no declaration requited *)

Use regular variants by default. Polymorphic don't require a type declaration, which makes them flexible but also difficult to debug and they lead to complicated inferred types. They have their use case in specific use cases but they should not be used simply to avoid a type declaration.

A recent discussion of polymorphic and regular variants is Do I need polymorphic variants?. The subject is encoding a finite state machine where the author would like the type system to prohibit certain state transitions.

Avoid Opening Modules Globally

For convenience, many developers open modules globally in order to gain access to functions without having to use qualified access. Below, Printf, List, and Sys are opened in this way:

open Printf
open List
open Sys

let rec join = function
  | []      -> ""
  | [x]     -> x
  | [x;y]   -> x ^ " and " ^ y
  | x::xs   -> x ^ ", "    ^ join xs

let main () =
  let argv = Array.to_list argv in
  let args = tl argv in
  match args with
  | []        -> printf "Hello, world!\n"
  | names     -> printf "Hello, %s!\n" (join names)

let () = main ()

Do This Instead

Between opening a module globally and not at all, several options exist.

Always use fully qualified names (see the code below). This is the best solution if we need few values from a module and only sporadically.

  let main () =
    let argv = Array.to_list Sys.argv in
    let args = List.tl argv in
    match args with
    | []        -> Printf.printf "Hello, world!\n"
    | names     -> Printf.printf "Hello, %s!\n" (join names)

Introduce aliases to shorten long module and function names. This is almost always a good solution.

  module L = List

  let printf  = Printf.printf
  let sprintf = Printf.sprintf

Open a module locally with open

Opening a module locally limits the scope for the open module:

  let main () =
    let open Printf in
    let argv = Array.to_list Sys.argv in
    let args = List.tl argv in
    match args with
    | []        -> printf "Hello, world!\n"
    | names     -> printf "Hello, %s!\n" (join names)

Open a module locally with Module.(...):

  Int64.(add (of_int x) 1_000_000L)

The code above is another way of writing the code below. The module Int64 is open inside the parentheses.

  Int64.add (Int64.of_int x) 1_000_000L

This is especially effective to access constructors that are defined inside another module.

Define a sub-module that is meant to be opened (locally), to be used sparingly:

  module M = struct
     type +'a t
     module Infix = struct
       val (>>=) : 'a t -> ('a -> 'b t) -> 'b t
     end
     ...
  end
  ....
  let foo =
     let open M.Infix in
     f x
     >>= fun () ->
     ...

This avoid bringing in all the names from M itself in scope, it only brings in scope the very small number of operators defined by M.Infix

Rationale

Opening a module introduces all its values and constructors into the local module. Since the definitions of these values and constructors are not visible, it becomes very hard to understand the code without tool support. While a developer might argue that tool support is available, it is not during reviews on GitHub or when looking at a printout. The problem is exaggerated when several modules are opened.

Avoid using references

Most code does not require references. Introducing them should be well justified. In particular, using references to implement loops and similar local control flow is probably avoidable.

Equality: `!=` and `==` vs `<>` and `=`

Using != and == for equality is probably wrong and you should use <> and = instead.

OCaml has two kinds of equality:

Structural equality, tested with = and <>. This compares the shape of two values and is typically the correct choice.
Physical equality (pointer equality), tested with == and !=. This compares the address in memory of two values. This is typically used in performance-oriented code. Pointer equality implies structural equality but not vice versa.

if vs. match - use of semicolon

Be aware that statement sequences require begin/end in if expressions. OCaml has some inconsistencies how it handles statement sequences (containing ;). Compare:

match true with
  | true -> ()
  | false -> print_endline "1"; print_endline "2"
(* prints nothing *)

versus:

if false then print_endline "1"; print_endline "2"
(* prints 2 *)

A statement sequence inside an if or else branch must be grouped by begin/end or parentheses to be governed by the guard.

if false then begin
  print_endline "1";
  print_endline "2"
end else begin
  print_endline "3";
  print_endline "4"
end

The danger of not realising what code is governed by an if-expression is exasperated by incorrect indentation and by incremental changes to existing code.

Error Messages and Error Handling

For handling errors programmatically, appropriate error values should be defined to avoid matching against strings (The standard library is violating this principle.)
Log messages are typically created from strings. OCaml defines the following strings to report the location of an error: __LOC__, __FILE__, __LINE__, __MODULE__ and a few more -- see OCaml Pervasives
Constructing log and error messages can be made simpler by defining a function that behaves like Printf.sprintf:
Two basic strategies for error handling exist: using exceptions, or the Result.result type. The latter has the advantage of making error handling explicit in the types of functions whereas exceptions are not tracked by the type system. However, there is no strategy that is clearly superior. A recent discussion of this topic can be found at Specific reason for not embracing the use of exceptions for error propagation?.

type 'a t = Ok of 'a | Error of string
let error fmt = Printf.kprintf (fun msg -> Error msg) fmt

Messages going into a log still must make sense in the context of many other lines before and after in the log. Therefore they must contain enough detail.
Don't split log messages because in the actual log they be no longer next to each other.

Split imperative and functional code

Purely functional code is easiest to test. Therefore code should be as functional as possible and imperative code minimised.

In order of preference the interface of a module should expose:

immutable data structures and operations on them

Note that the implementation can use mutation if this makes the implementation of the algorithm more natural, as long as it doesn't "leak" the mutated variable by returning it or storing it outside local variables
idempotent API calls

If the nature of the API requires mutation (e.g. a database) make it idempotent. The reason is that network/RPC calls may get interrupted before getting an answer, and the caller may not know whether the call succeeded or not, so it can just retry. If you make the retry a no-op it simplifies the logic on both sides.

Functions - Argument Order

In a function definition, the most common arguments should come first.

While a function can work with any order of formal arguments, putting common arguments make it more usable. For example:

let sum     = List.fold_left (+) 0
let default x = function
  | None   -> x
  | Some y -> y

let sum' xs  = xs |> List.map (default 0) |> sum

Function sum: int list -> int computes the sum from a list of integers. It's definition is so concise because List.fold_left takes its list argument last. Likewise, default and List.map combine well because of the order of their arguments. Usually operations on data structures should always take the data structure last to allow for the above concise data-flow usage.

The best argument order is not always clear - so this should be just a general consideration when writing code.

Partial application: arguments more likely to be static usually appear before other arguments in order to facilitate partial application.
Most usual order: where an operation represents a well-known mathematical function on more than one data structure, the arguments are chosen to match the most usual argument order for the function.

Functions - Named Parameters

Parameters in a function call are usually identified by their position but may be identified by name. The name becomes part of the type.

(** [prefix ~pre s] returns [true] iff [pre] is a prefix of [s]. *)
let prefix ~pre s =
  let len = String.length pre in
  if len > String.length s then false
  else begin
    let rec check i =
      if i=len then true
      else if String.get s i <> String.get pre i then false
      else check (i+1)
    in
    check 0
  end

Consider using named parameters if a function takes several arguments of the same type that could be confused. Without a named parameter, the above function would have type string -> string -> bool and is it not obvious, which roles the two strings have. As it is, the function has the following type:

val prefix : pre:string -> string -> bool

This function may be called as prefix "foobar" ~pre:"foo" but it can't be called without a named parameter -- which adds to clarity.

Functions - Pattern Matching

Pattern matching is the preferred way to define functions. Patterns are easier to read and extend than if-then-else expressions. Consider matching multiple arguments at the same time to arrive at a short and un-nested function definition.

The example below defines a binary tree and a function sum that sums up the nodes along a path in the tree. It uses pattern matching effectively by matching over two values (the current node, and the current position in the path) at the same time.

type 'a tree
  = Empty
  | Node of 'a tree * 'a * 'a tree

type branch = Left | Right
type path   = branch list

let sum tree path =
  let rec loop acc path tree = match path, tree with
    | []          , _           -> acc
    | _::_        , Empty       -> failwith "path too long"
    | Left::path  , Node(l,n,r) -> loop (acc+n) path l
    | Right::path , Node(l,n,r) -> loop (acc+n) path r
  in
    loop 0 path tree

Functions - Data Flow

The readability of code can often be improved by using the pipe operator |> because it allows functions to be written such that data flows from top to bottom and left to right.

Consider the following function numbers and numbers' that compute a string from a list of optional numbers:

let xs = [Some 1; None; Some 2; None; None; Some 3; Some 4; Some 5] in

let (++) x xs = match x with
  | Some x -> x :: xs
  | None   -> xs

let numbers xs =
  String.concat "," (List.map string_of_int (List.fold_right (++) xs [])

let numbers' xs =
  List.fold_right (++) xs []
  |> List.map string_of_int
  |> String.concat ", "

The definition of numbers is more traditional and numbers' uses the pipe operator. In the latter the data from the argument xs flows through the function definition from top to bottom and left to right. In the more traditional definition, data flows from right to left.

Functions - Avoid Deep Nesting

Code that is deeply nested is hard to read and hard to test. This suggests that it should be restructured - probably by introducing new functions. Pattern matching is a another proven way to reduce the nesting of code.

Functions - Tail Recursion

Recursive functions operating on large data structures need to be tail recursive as otherwise the runtime stack may overflow.

While tail recursiveness is generally desirable, it is often not required because data is not large. For performance it is usually better to pay attention to allocation patterns than to tail recursion.

Be aware that tail recursion can be inhibited by exceptions: the loop below is not tail recursive!

let read_lines inc =
   let rec loop acc =
     try
       let l = input_line inc in
       loop (l :: acc)
     with End_of_file -> List.rev acc
   in
   loop []

To make it tail recursive, handling of values and exceptions need to be combined into one match expression (available since OCaml 4.02):

let read_lines io =
  let rec loop acc =
      match input_line io with
      | l -> loop (l :: acc)
      | exception End_of_file -> List.rev acc
  in
  loop []

Function calls can be annotated to receive a compiler warning if a call is not tail recursive as expected:

let read_lines io =
  let rec loop acc =
    try
      let l = input_line io in
      (loop [@tailcall]) (l :: acc)
    with End_of_file -> List.rev acc
  in
  loop []

The problem with this is that one has to be aware of the problem in the first place to write such an annotation.

Be aware that some functions from the standard library are not tail recursive (e.g. List.map).

Resources and Exceptions: use finally

Ensure that resources are de-allocated in the presence of exceptions. This is typically done with a function like finally:

let finally (f: unit -> 'a) (free: unit -> 'b) =
  let res = try f () with exn -> free (); raise exn in
  free ();
  res

let with_file path (f: in_channel -> 'a) =
  let io = open_in path in
  finally
    (fun () -> f io)
    (fun () -> close_in io)

with_file "/etc/passwd" (fun io -> input_line io |> print_endline)

In the above code we make sure to close the file even if the function reading from it throws an exception. A function like finally should be already defined in existing projects so there is no need to re-implement it locally.

Resources that need to be managed are not just files but can be anything like database and network connections or timers.

A finer detail is that care should be taken not to destroy the backtrace of an exception if the code in finally (or functions called by it) raise/catch exceptions of its own.

Exceptions: try-with vs. match

OCaml 4.02 introduced the ability to match exceptions in patterns. This is often more elegant than using try .. with. However, a small difference is worth knowing:

let f x =
  try
    match x with
    | [] -> 0
    | _ :: _ -> failwith "not empty" (* caught below *)
  with _ -> 1

let g x =
  match x with
  | [] -> 0
  | _ :: _ -> failwith "not empty" (* escapes *)
  | exception _ -> 1

When using match, the evaluation of the right hand side is not governed by an exception handler, whereas it is when using match inside a try block.

Compare Functions

OCaml provides a generic compare function but it can give unexpected results when comparing complex values or not a desired order. This is the reason that the Map.Make functor requires to define a compare function. Here is a recipe to define a custom compare function:

module Time = struct
  type 'a t =
    { hour:     int
    ; minutes:  int
    ; seconds:  int
    ; info:     'a
    }

  let (<?>) a b =
    if a = 0 then b else a

  let compare a b =
        compare a.hour    b.hour
    <?> compare a.minutes b.minutes
    <?> compare a.seconds b.seconds
    <?> 0
end

This relies on the left-associativity of the infix <?> operator.

Type Compatibility and Functors

Below we have a functor and instantiate it twice:

module X (E: sig end): sig 
  type t 
  val t: t 
end = struct 
  type t = int 
  let t = 0 
end

module X1 = X (struct end)
module X2 = X (struct end)

The types X1.t and X2.t are distinct:

X1.t = X2.t;;
Error: This expression has type X2.t but an expression was expected of type
         X1.t

However, if we construct the two modules slightly different by applying an existing module twice, they are compatible:

module E = struct end
module X1' = X(E)
module X2' = X(E)

X1'.t = X2'.t;;
- : bool = true

Both behaviours can be useful - so be careful when instantiating a functor multiple times.

Objects

OCaml has an object system. It should not be used by default. Since it was introduced, the language gained a number of features that make using objects less convincing. In particular, OCaml supports modules as first-class values. Almost all libraries in the OCaml/Opam ecosystem use a functional style and don't provide a class hierarchy.

Git - Commit Messages

The subject should express what the commit achieves
The body should provide context for the commit
Subject and body should be clearly separated and working independently
Commit messages must not exceed a line length of 80 characters
Commits should include a signature (git commit -s)
Commit messages can use markdown if it helps clarity

Examples:

Use Pattern Matching for Value Destruction

The readability of code is improved by using pattern matching rather than (deeply nested) if-then-else control flow. Multiple patterns can be matched at the same time.

type color = Red | Yellow | Green

let wait = function (* good *)
  | Red     -> seconds 30
  | Yellow  -> seconds 10
  | Green   -> seconds 0

let wait color = (* bad *)
  if color = Red then seconds 30
  else if color = Yellow then seconds 10
  else seconds 0

Avoid introducing new dependencies

Introducing new dependencies on outside libraries needs to be well justified.

With Opam it is easy to install libraries and to use them. It is usually better to use a well-established and maintained library than to roll your own. This is especially true for well-established protocols, formats and problems in general. But any new library also brings along the responsibility to watch its development.