Scrap Your Boilerplate: Generic Programming in Haskell

by Lyle Kopnicky, presented at
Portland Functional Programming Study Group
January 2013

The problem

Haskell's statically typed data structures make it hard to write code that works against many data types.

The solution

Scrap Your Boilerplate is one solution to that problem. It provides general functions for traversing hierarchical data structures without having to know the specific types.

Confusing terminology alert

In Java and other object-oriented languages, generic, programming, or generics, refers to what we call "parametric polymorphism" in Haskell and other functional languages. That is, functions or methods that are parametrized over a type that participates in a homogeneous collection. The structure of the parametrized type is ignored.

Another (perhaps clearer) term for generic programming as described here is polytypic programming - the idea being one function can operate over many types, and pays attention to the structure of the type.

Interpreter example

{-# LANGUAGE OverloadedStrings #-}

import Data.Text (Text)

data Program = Program Text [Module]
    deriving (Show)

data Module = Module Text [Procedure]
    deriving (Show)

data Procedure = Procedure Text [Var] [Statement]
    deriving (Show)

newtype Var = Var Text
    deriving (Show, Eq, Ord)

data Statement =
    VarDecl Var
    | Var := Expr
    | Input Var
    | Output Expr
    | Call Text [Expr]
    | If Expr Expr Expr
    deriving (Show)

data Expr =
    VarRef Var
    | Number Float
    | Expr :+ Expr
    | Expr :* Expr
    deriving (Show)

Suppose you want to find and print a list of variables. You have a special type for variables (Var), but you have to write boilerplate code to traverse all the structures:

import Data.Set (Set, empty, singleton, fromList, union, unions)

class GetVars a where
    getVars :: a -> Set Var

instance GetVars Program where
    getVars (Program _ modules) =
        unions $ map getVars modules

instance GetVars Module where
    getVars (Module _ procedures) =
        unions $ map getVars procedures

instance GetVars Procedure where
    getVars (Procedure _ vars statements) =
        fromList vars `union` (unions $ map getVars statements)

instance GetVars Statement where
    getVars (VarDecl var) = singleton var
    getVars (var := expr) =
        singleton var `union` getVars expr
    getVars (Input var) = singleton var
    getVars (Output expr) = varSet expr
    getVars (Call _ params) = unions $ map varSet params
    getVars (If expr1 expr2 expr3) =
        unions $ map getVars [expr1, expr2, expr3]

instance GetVars Expr where
    getVars (VarRef var) = singleton var
    getVars (Number _) = empty
    getVars (expr1 :+ expr2) =
        getVars expr1 `union` varSet expr2
    getVars (expr1 :* expr2) =
        getVars expr1 `union` varSet expr2

Whew! A lot of code, and pretty boring to write. All you are doing is telling the computer how to walk a data structure and find a particular type. Can't it do that itself?

In a language with dynamic typing, this would be easy. Take Scheme, for example. You can easily traverse sublists without paying attention to the tags. Our data structure might look like this:

'(program "Distance Calc"
    ((module "Main"
        ((procedure "main" () (
            (decl (var "A"))
            (decl (var "T"))
            (decl (var "D"))
            (input "A")
            (input "T")
            (assign "D" (mult
                (num 0.5) (mult
                    (var-ref "A") (mult
                        (var-ref "T") (var-ref "T")))))
            (output (var-ref "D"))))))))

Then we could traverse it with a single function:

(define list-vars
    (lambda (code)
        (if ((eq? (car code) 'var) (list code))
           (concatenate (map (list-vars (cdr code)))))))

Why was that so easy? (To be fair, it doesn't eliminate duplicates, but that would be an easy addition, if I only knew what Scheme library function to call.) Because there are no type checks on the construction of the data structure. The down side is that we can accidentally create a structure that doesn't make sense.

Scrap Your Boilerplate allows us to perform this same sort of generic traversal in Haskell. First of all, we need to list the syb package as a dependency in our Cabal file. This goes under the Build-depends section.

Secondly, we need to add a pragma to the top of our module:

{-# LANGUAGE DeriveDataTypeable #-}

Thirdly, we need to import the Data.Generics module:

import Data.Generics (Data, Typeable, mkQ, mkT, everything, everywhere)

Then, we need to add deriving clauses for Data and Typeable to each of our datatypes. (I'll repeat the previous code for context.)

data Program = Program Text [Module]
    deriving (Show, Data, Typeable)

data Module = Module Text [Procedure]
    deriving (Show, Data, Typeable)

data Procedure = Procedure Text [Var] [Statement]
    deriving (Show, Data, Typeable)

newtype Var = Var Text
    deriving (Show, Eq, Ord, Data, Typeable)

data Statement =
    VarDecl Var
    | Var := Expr
    | Input Var
    | Output Expr
    | Call Text [Expr]
    | If Expr Expr Expr
    deriving (Show, Data, Typeable)

data Expr =
    VarRef Var
    | Number Float
    | Expr :+ Expr
    | Expr :* Expr
    deriving (Show, Data, Typeable)

Maybe that seems like a bunch of boilerplate, but it's simple, and look what it does for us:

getVars' :: Data d => d -> Set Var
getVars' code =
    everything
        union
        (mkQ empty (\var@(Var _) -> singleton var))
        code

Let's break this down. The everything function says we want to do a query on the hierarchy. The union function combines the variables from two pieces of code. The mkQ function (Q for query) lets us make a generic query out of a specific one. The empty value is what we default to for non-Var terms. Finally, we supply a function that tells us what to return for a Var - a singleton set of the Var itself.

This is following the same rules that we previously had to mentally translate into operations on each datatype. But it's so much more succinct and expressive.

How does this magic work?

The Data and Typeable classes provide a consistent set of functions for accessing the representations of data. Support has been added in GHC for automatic derivation of class instances using DeriveDataTypeable. In addition, many datatype libraries for Haskell include instances for these classes.

Transformations

In addition to querying data, we can perform transformations. Suppose we would like to distribute all of the multiplications over the additions, to reduce the expressions to a normal form. E.g.,

(Number 3) :* ((VarRef (Var "A")) :+ (VarRef (Var "B")))

should become

((Number 3) :* (VarRef (Var "A"))) :+
    ((Num 3) :* (VarRef (Var "B")))

Instead of using everything, we use everywhere, and instead of mkQ, we use mkT (T for transformation):

distributeMultOverAdd :: Data d => d -> d
distributeMultOverAdd =
    everywhere $ mkT $ \x -> case x of
        (x :* (y1 :+ y2)) -> (x :* y1) :+ (x :* y2)
        _ -> x

No need to write code to find these patterns in that big hierarchy.

What are the down sides?

The datatype you are importing may not be an instance of Data or Typeable

I love Data.HashMap, but you can't use it out of the box with code above. Why? The type constructor takes two type parameters instead of one: the key type and the value type. Typeable works with datatypes that have a single type parameter. So they derive Typeable2 instead.

Well, my keys are usually strings that can be ignored in the traversal, so I want to pretend that a HashMap only takes one parameter - the value type. Then it can be an instance of Typeable. So I have to write the following... boilerplate:

instance (Data k, Data v, Eq k, Hashable k) => Data (HashMap k v) where
    gfoldl k z m   = z HashMap.fromList `k` HashMap.toList m
    toConstr _     = error "toConstr"
    gunfold _ _    = error "gunfold"
    dataTypeOf _   = mkNoRepType "Data.HashMap.Common.HashMap"
    dataCast2 f    = gcast2 f

That took a bit of web searching to figure out. Don't bother to try to understand it, just tack it on. Fortunately, the HashMap function comes with an instance of Data that is just waiting for you to supply an instance of Typeable.

The only problem is that, since you didn't define Data or HashMap in your file, you are creating what is called an orphan instance. You'll get a warning from GHC. To suppress this, you need to supply an option to GHC. You can do this with a pragma at the top of the module:

{-# OPTIONS_GHC -fno-warn-orphans #-}

Or perhaps you're using a third-party datatype for which an instance could easily be derived, but there is none. Then you can use a standalone deriving clause:

deriving instance Typeable SomeDataType
deriving instance Data SomeDataType

To use this, you need the StandaloneDeriving language pragma at the top of your module:

{-# LANGUAGE StandaloneDeriving #-}

Sometimes you use a third-party datatype that should be opaque. That is, we don't want the generic traversal to look inside of it. An example is a newtype-wrapped string that is supposed to abstractly represent a serial number or identifier, like our Var type.

If you just take the deriving clause off of the Var type, you'll get a type error when tring to traverse the Program structure. You can hide its internals by manually writing a Data instance:

newtype Var = Var Text
    deriving Typeable

instance Data Var where
    gfoldl _k z v = z v
    toConstr _    = error "toConstr"
    gunfold _ _   = error "gunfold"
    dataTypeOf _  = mkNoRepType "Var"

So... we're back to boilerplate. It's not perfect.

Another down side

SYB removes a bit of the type safety you started with. These generic functions can now be run on any instance of Data, so you have to be sure it makes sense.

Is it worth it?

You may be thinking, that's an awful lot of boilerplate I have to write, in order to scrap my boilerplate. Some of it unintelligible.

However, I found it to be worthwile in a case where I had a large number of datatypes to traverse. Adding the deriving clauses was tidy, and I had just a handful of exceptional cases. The actual call to everything is so simple and elegant, it seems to make it all worthwhile.

Source code

Code from this article

References

Scrap Your Boilerplate papers

Comparing Approaches to Generic Programming in Haskell