Monday, April 4, 2011

Dependent typing: reversing non-empty lists

This is the first post in a series on dependent typing. For those not familiar with dependent typing, it is the ability to describe data using types which say something about the particular value of the data itself – the type of the data depends on the value of the data. For example, the phrase “even integers” could describe a dependent type which is a subtype of integers. Whether a number belongs to that type depends on whether its value is evenly divisible by 2. (Note that any subtype can be viewed as a simple kind of dependent type.)

The obvious use of dependent typing is to describe invariants of your program, for example, “a non-NULL pointer”, “an array of length 5”, or “a non-empty list”. While full-blown dependent typing requires a full-blown theorem prover (such as Coq or ATS/LF), many mainstream languages support basic dependent types. Java can infer whether a value is not null, OCaml can talk about types which are made up of only certain constructors, Mozilla’s new language Rust can attach arbitrary predicates to types, and I would be remiss not to mention Mercury.

Mercury has limited but useful support for dependent typing via its mode system. The mode system appears complicated, so let’s start with a brief overview.

The mode system


Each predicate (or function) in Mercury does something to its arguments. The most common possibilities are to look at an argument and leave it be (this would be an input argument), or to attempt to bind it to a value (this would be an output argument). Another possibility is that it destroys its argument after looking at it (this happens with the IO state). These possible actions are called modes.

Modes are described simply by the expected state of the argument before the call, and the guaranteed state of the argument after the call using the syntax Before >> After. These states (called instantiatednesses or insts for short) may be one of free (the argument is unbound), ground or bound (the argument is bound to a value), clobbered (the argument is destroyed or overwritten), and unique (the argument is bound to a value which is not referenced anywhere else). The latter two are used mainly for IO state and arrays, so we will focus on free and bound.

The modes we are all familiar with, in and out, are defined in the module builtin as ground >> ground and free >> ground, respectively. In other words, an in argument is expected to be bound to a value before the call and is left that way after, and an out argument may be unbound before the call, but will be bound after.

The magic is that we may tell Mercury that we expect or guarantee an argument to be bound only to certain constructors! To do this, we must first define the instantiatedness in which we are interested, say non-empty lists:
:- inst non_empty_list == bound([ground | ground]).
This defines the non_empty_list inst as a variable bound to a list cons cell, the head and tail of which are both bound. We could similarly define the inst of an empty list as:
:- inst empty_list == bound([]).
In other words, an empty_list is a variable which is bound to the empty list constructor.

Note that there is also a shorthand which allows the above two insts to be written similarly to type definitions:
:- inst non_empty_list ---> [ground | ground].
:- inst empty_list ---> [].
We could then define the mode of an input argument which is expected to be a non-empty list as:
:- mode in_non_empty_list == non_empty_list >> non_empty_list.
In other words, an argument with this mode would be expected to be bound to a non-empty list, and would remain bound that way afterward. Similarly:
:- mode out_non_empty_list == free >> non_empty_list.
An argument with this mode may be unbound before the call, but is guaranteed to be bound to a non-empty list after. Mercury actually has the mode-functions in() and out() already defined, so we may write in(non_empty_list) or out(non_empty_list) to describe these same modes.

Example


Note: due to bug 191, the following example won’t trigger the proper error messages from the compiler as of this post. Download and apply this patch against the compiler to fix the issue, or work around the bug by replacing the two out modes with out(list(ground)), which won’t change the meaning but will avoid triggering the bug. UPDATE: parts of the standard library won’t compile without the bug, so don’t go applying that patch just yet.

On a recent thread at Hacker News, a question was brought up about proving that [X | Xs] = reverse([Y | Ys]) was total (or in Mercury terms, det). Let’s see if we can do this in Mercury using our new tools. First let’s define reverse2 (named so not to conflict with built-in list.reverse) using a recursive auxiliary function with an accumulator:
:- import_module list.

:- func reverse2(list(A)) = list(A).
:- mode reverse2(in) = out.
reverse2(A) = reverse2_aux(A, []).

:- func reverse2_aux(list(A), list(A)) = list(A).
:- mode reverse2_aux(in, in) = out.
reverse2_aux([], Acc) = Acc.
reverse2_aux([A | As], Acc) = reverse2_aux(As, [A | Acc]).
(I have included the mode declarations as we will be adding to them later.) Testing this function in our main predicate is simple:
main(!IO) :-
    Y = 1, Ys = [2,3,4,5],  
    [X | Xs] = reverse2([Y | Ys]),
    print({X, Xs}, !IO), nl(!IO).
A sharp eye will spot the problem before we even compile: the deconstruction [X | Xs] = can fail, which would cause main to be semidet instead of det (that is, a partial function). The compiler spots this as well:
In `main'(di, uo):
  error: determinism declaration not satisfied.
  Declared `det', inferred `semidet'.
  Unification with `list.[X | Xs]' can fail.
Of course this isn’t true. What the compiler would love to see is a mode declaration stating that if reverse2 is passed a non-empty list, then it will return a non-empty list. Let’s write that:
:- mode reverse2(in(non_empty_list)) = out(non_empty_list).
Now we get a different error:
In clause for `reverse2(in((list.non_empty_list))) =
  out((list.non_empty_list))':
  mode error: argument 2 did not get sufficiently instantiated.
  Final instantiatedness of `HeadVar__2' was `ground',
  expected final instantiatedness was `bound(list.'[|]'(ground,
  ground))'.
(Note: if you didn’t/can’t apply the above compiler patch, as a workaround, replace out in the original mode declarations with out(list(ground)).)

Here, HeadVar__2 refers to the function result (what would be the second variable in the head were reverse2 a predicate). The compiler expected the function result to be non-empty (bound to list.[|]) but it was bound to an unknown value of type list. Of course this is because reverse2_aux doesn't guarantee a non-empty result on non-empty input, so let’s add a declaration stating so as well:
:- mode reverse2_aux(in(non_empty_list), in) = out(non_empty_list).
(Note that we don’t say anything about the accumulator in this declaration.) Unfortunately recompiling gives us a similar error, this time complaining about the second (the recursive) clause of reverse2_aux. Of course – if our initial list had only one item, the recursive call would pass reverse2_aux an empty list, triggering the first clause, and we haven’t made (and can’t make) any special guarantees about that case. Let's take a closer look at that clause:
reverse2_aux([], Acc) = Acc.
The only way that this clause will return a non-empty list is if the accumulator, Acc, is non-empty. Fortunately, it is called recursively with a non-empty accumulator (since elements are prepended on every recurrence), so this should be enough to convince the compiler. Let’s add a mode declaration stating this fact:
:- mode reverse2_aux(in, in(non_empty_list)) = out(non_empty_list).
Success! Compilation is successful, meaning main has been proved to be det (a total function), and running the program gives the expected output:
{5, [4, 3, 2, 1]}
Using Mercury’s mode system, we were able to prove that reversing a non-empty list results in a non-empty list, by adding descriptive mode declarations in strategic locations.

As an exercise, add an inst describing a list of at least two elements and add modes to reverse2 and reverse2_aux which make guarantees about lists of at least two elements.

Download reverse.m

Saturday, April 2, 2011

Enabling float unboxing on 64-bit systems

In many high-level languages, it’s desirable that all data be represented with the same size word (usually a system word, 32- or 64-bits) so that polymorphic functions need not perform run-time dispatch, nor be type-specialized at compile time, to work with data of different sizes. This is usually done by “boxing” non-word-size data by storing it in the heap and passing around a word-sized pointer to the data.

Of course for “small” data boxing entails the performance overhead of an extra pointer indirection. (For “large” data boxing is usually a performance win since it reduces the amount of data which must be copied; this is also why in C structures are usually passed by reference.) For this reason anything smaller than or equal in size to a word will usually be stored unboxed. On 32-bit systems, this means characters, integers, enumerations, and single-precision (32-bit) floating-point numbers, but not double-precision (64-bit) floating-point numbers. On 64-bit systems, all of the above including doubles can be stored unboxed.

Mercury follows these same conventions, which does not bode well for scientific computations on 32-bit systems, since all doubles will be boxed. (This is actually special-cased for arrays and structures in OCaml.)

Fortunately Mercury does store doubles unboxed on 64-bit systems, with one gotcha: Double-precision floats will only be stored unboxed if the compiler is built with an existing compiler, and not bootstrapped! This means that after you first install Mercury, you must make realclean, ensure that mmc and friends are in your PATH, and repeat the installation (including ./configure). You can check that configure enabled float unboxing with grep unboxed config.log; you will see messages indicating its status.

If you are doing any sort of scientific calculations on a 64-bit machine it’s important to ensure that this optimization is enabled; it can easily double the performance of any float-heavy computations.

Benchmarks Game: spectral-norm

The Computer Language Benchmarks Game measures the performance of programming language implementations over a set of algorithms. Mercury is known for being fast, so let’s put it to the test. We will compare it to four high-performing language implementations, C (gcc), Java (Sun), Haskell (GHC), and OCaml, on a 64-bit four-way processor (Intel Atom D510).

Problem description

Today we will develop a solution to the spectral-norm test, whose description reads:

Calculate an eigenvalue using the power method.

Eigenvectors are vectors which, when multiplied by a matrix, produce a multiple of themselves. The eigenvalue of this eigenvector is this multiple. The power method involves repeatedly multiplying a random vector by the matrix in question until the vector converges. This is then an eigenvector, and its associated eigenvalue is found easily.

To compute the spectral norm of a matrix, one simply computes the square root of the maximum eigenvalue (which coincidentally is the one found by the power method).

Implementation

Let’s take the design of the single-threaded Java entry as a template. This way, we can ensure that we are using the same algorithm as the other implementations, which is a requirement of the Benchmarks Game.

The interface and main predicate are unremarkable:

:- module spectralnorm.                   

:- interface.
:- import_module io.
:- pred main(io::di, io::uo) is det.

:- implementation.

:- import_module float, int, list, math, std_util, string.  

main(!IO) :-
command_line_arguments(Args, !IO),
(if to_int(head(Args), N0) then N = N0 else N = 100),
format("%.9f\n", [f(approximate(N))], !IO).

Note that we have imported list, for use in representing the vectors. One may think that using arrays would perform much better but surprisingly the gain is negligible in this program (mostly because there are no random accesses), and would come at the cost of more complicated code.

The approximate function works by iterating multiplication 20 times, and using that result and the second-to-last to compute the spectral norm. This function is easily written using higher-order functions; in particular, list.foldl and list.map_corresponding are used to compute the dot product of two vectors:

approximate(N) = sqrt(VBv / Vv) :-
V = pow(multiplyAtAv(N), 19, unit(N)),
U = multiplyAtAv(N, V),
VBv = foldl((+), map_corresponding((*), U, V), 0.0),
Vv = foldl((+), map_corresponding((*), V, V), 0.0).

The unit vector, used as the initial vector, is easily defined recursively:

unit(N) = (if N = 0 then [] else [1.0 | unit(N - 1)]).

Next we define a function a to represent the matrix we are using:

a(I, J) = 1.0 / float((I + J) * (I + J + 1) / 2 + I + 1).

Next we define multiplication of the matrix by a vector by using two recursive functions. multiplyAvI multiplies each row I < N of the matrix by the vector V by calling multiplyAvIJ, which multiplies and sums each element of a row. multiplyAvI simply starts the process.

multiplyAv(N, V) = multiplyAvI(N, 0, V).

multiplyAvI(N, I, V) = YYs :-
if I < N then
Y = multiplyAvIJ(I, 0, V),  
Ys = multiplyAvI(N, I + 1, V),  
YYs = [Y|Ys]
else YYs = [].  

multiplyAvIJ(_, _, []) = 0.0.
multiplyAvIJ(I, J, [X|V]) = a(I, J) * X + multiplyAvIJ(I, J + 1, V).

To multiply by the transpose of A, we similarly define multiplyAtv:

multiplyAtv(N, V) = multiplyAtvI(N, 0, V).

multiplyAtvI(N, I, V) = YYs :-
if I < N then
Y = multiplyAtvIJ(I, 0, V),
Ys = multiplyAtvI(N, I + 1, V),
YYs = [Y|Ys]
else YYs = [].

multiplyAtvIJ(_, _, []) = 0.0.
multiplyAtvIJ(I, J, [X|V]) = a(J, I) * X + multiplyAtvIJ(I, J + 1, V).
Finally, multiplyAtAv ties these together. Note we must declare its type, since we partially apply it in approximate.

:- func multiplyAtAv(int, list(float)) = list(float). 
multiplyAtAv(N, V) = multiplyAtv(N, multiplyAv(N, V)).
Compiling with mmc --infer-all -O6 --intermod-opt --rebuild spectralnorm, we obtain a runtime of about 38.1 seconds for N = 3000. Not too bad, considering that the highly optimized implementation has a single-threaded runtime of about 26.5 seconds. But we can do better.

Optimization

Compiling with -p and running mprof after running our program tells us that the majority of our time (99%) is spent in multiplyAtIJ and multiplyAtvIJ, as expected. So let's focus our efforts there, and in a, which apparently has been inlined into each of those functions, since it does not otherwise appear in the profile.

Unchecked arithmetic

The most obvious speedup is to replace the floating-point division in a with `unchecked_quotient`, which doesn't check if its right argument is zero. (We make sure to add parentheses as appropriate, since infix functions bind very tightly.) Recompiling, this reduces our runtime to 29.8 seconds! Similarly, replacing the integer division by two with `unchecked_right_shift` 1 reduces the runtime to 28.4 seconds.

Tail-call optimization / accumulator introduction

Next, both multiplyAvIJ and multiplyAtvIJ are written in a fashion which precludes tail-call optimization: they perform an operation on the result of the recursive call, which prevents Mercury from turning them into simple loops. Fortunately, we can direct mmc to transform these functions into accumulator-passing functions (which can be optimized) by using the --introduce-accumulators, and telling Mercury that floating-point addition is associative:

:- promise all [A, B, C, ABC]
(A + B) + C = ABC: float <=> A + (B + C) = ABC: float.
Unfortunately, due to bug 193, mmc doesn't recognize this declaration due to the typecast (which is necessary to distinguish between integer- and floating-point addition), so we must instead introduce a dummy function fplus to disambiguate:

:- func fplus(float, float) = float.
A `fplus` B = A + B.

:- promise all [A, B, C, ABC]
(A `fplus` B) `fplus` C = ABC <=> A `fplus` (B `fplus` C) = ABC.
After replacing our use of + with `fplus` in multiplyAvIJ and multiplyAtvIJ (and again, parenthesizing appropriately), we cut our runtime down to 25.2 seconds -- better than the C implementation!

An optimization along similar lines is to enable --optimize-constructor-last-call, which can enable tail-call optimization in functions like unit, multiplyAvI, and multiplyAtvI, which construct a list after their recursive call. While this is a win for smaller values of N, it actually degrades performance for larger values so we will not enable it.

Parallelization

Comparing our 25.2 seconds to the C implementation’s 26.5 seconds is not really fair, since the C implementation is parallelized, running in 6.9 seconds across four CPUs.

Fortunately parallelization in Mercury is simple. The & operator solves two goals in parallel, expecting them both to succeed (like ,). Our job is to decide where is the best place to place this operator.

At the top level, operations in approximate and multiplyAtAv are necessarily sequential, since the result of each successive multiplication depends on the result of the previous.

Moving down a level, the two operations in multiplyAtI and multiplyAtvI seem parallelizable, as they do not depend on one another. Let’s introduce a parallel goal:

multiplyAvI(N, I, V) = YYs :-
if I < N then
(Y = multiplyAvIJ(I, 0, V) &
Ys = multiplyAvI(N, I + 1, V)),
YYs = [Y|Ys]
else YYs = [].
multiplyAtvI is modified similarly. Note the placement of the parentheses: & binds more loosely than ,. If we did not have parentheses, the list construction would be moved into the second branch, making it dependent on the first branch, and causing (for reasons unknown to the author) massive amounts of memory to be used.

At the lowest level, we could introduce parallel goals in multiplyAvIJ and multiplyAtvIJ, but we should have sufficient parallelization already, and the operations at this level are so short that they would be dwarfed by parallelization overhead.

Compiling with mmc --parallel --infer-all -O6 --intermod-opt --introduce-accumulators --rebuild spectralnorm, we obtain a runtime of 7.2 seconds – less than 5% slower than the C counterpart. The difference between Mercury and C is even less pronounced at N = 5500, coming in at 23.8 and 23.7 seconds respectively.

Results

Results obtained using the Benchmark Game’s bencher.py program for N = 5500:

× Language CPU (s) Elapsed (s) Memory (kB) Code (B) CPU load
1.0 java 87.24 23.36 15617 1227 385%
1.0 ocaml 88.94 23.73 4037 1065 391%
1.0 gcc 89.98 23.74 763 1992 399%
1.0 mmc --parallel 90.85 23.83 54974 722 394%
1.3 ghc 95.45 29.27 2496 1346 346%
3.1 mmc -H 70.06 70.09 7118 722 116%
3.7 mmc 84.96 84.98 40751 722 112%
3.9 mmc --java 88.86 88.21 33007 722 113%

Curiously, the compiling with the -H flag (to enable high-level C “hlc” grade) generates much faster single-threaded code. Checking the generated C code in Mercury/cs/spectralnorm.c indicates that tail calls are implemented using loops in this version, compared to calls to a Mercury primitive in the other versions. Unfortunately the high-level C grade does not support the and-parallelism we are using.

Download spectralnorm.m