by Lyle Kopnicky, presented at
Portland Functional Programming Study Group
January 2013
Haskell's statically typed data structures make it hard to write code that works against many data types.
Scrap Your Boilerplate is one solution to that problem. It provides general functions for traversing hierarchical data structures without having to know the specific types.
In Java and other object-oriented languages, generic, programming, or generics, refers to what we call "parametric polymorphism" in Haskell and other functional languages. That is, functions or methods that are parametrized over a type that participates in a homogeneous collection. The structure of the parametrized type is ignored.
Another (perhaps clearer) term for generic programming as described here is polytypic programming - the idea being one function can operate over many types, and pays attention to the structure of the type.
{-# LANGUAGE OverloadedStrings #-}
import Data.Text (Text)
data Program = Program Text [Module]
deriving (Show)
data Module = Module Text [Procedure]
deriving (Show)
data Procedure = Procedure Text [Var] [Statement]
deriving (Show)
newtype Var = Var Text
deriving (Show, Eq, Ord)
data Statement =
VarDecl Var
| Var := Expr
| Input Var
| Output Expr
| Call Text [Expr]
| If Expr Expr Expr
deriving (Show)
data Expr =
VarRef Var
| Number Float
| Expr :+ Expr
| Expr :* Expr
deriving (Show)
Suppose you want to find and print a list of variables. You have
a special type for variables (Var
), but you have to write
boilerplate code to traverse all the structures:
import Data.Set (Set, empty, singleton, fromList, union, unions)
class GetVars a where
getVars :: a -> Set Var
instance GetVars Program where
getVars (Program _ modules) =
unions $ map getVars modules
instance GetVars Module where
getVars (Module _ procedures) =
unions $ map getVars procedures
instance GetVars Procedure where
getVars (Procedure _ vars statements) =
fromList vars `union` (unions $ map getVars statements)
instance GetVars Statement where
getVars (VarDecl var) = singleton var
getVars (var := expr) =
singleton var `union` getVars expr
getVars (Input var) = singleton var
getVars (Output expr) = varSet expr
getVars (Call _ params) = unions $ map varSet params
getVars (If expr1 expr2 expr3) =
unions $ map getVars [expr1, expr2, expr3]
instance GetVars Expr where
getVars (VarRef var) = singleton var
getVars (Number _) = empty
getVars (expr1 :+ expr2) =
getVars expr1 `union` varSet expr2
getVars (expr1 :* expr2) =
getVars expr1 `union` varSet expr2
Whew! A lot of code, and pretty boring to write. All you are doing is telling the computer how to walk a data structure and find a particular type. Can't it do that itself?
In a language with dynamic typing, this would be easy. Take Scheme, for example. You can easily traverse sublists without paying attention to the tags. Our data structure might look like this:
'(program "Distance Calc"
((module "Main"
((procedure "main" () (
(decl (var "A"))
(decl (var "T"))
(decl (var "D"))
(input "A")
(input "T")
(assign "D" (mult
(num 0.5) (mult
(var-ref "A") (mult
(var-ref "T") (var-ref "T")))))
(output (var-ref "D"))))))))
Then we could traverse it with a single function:
(define list-vars
(lambda (code)
(if ((eq? (car code) 'var) (list code))
(concatenate (map (list-vars (cdr code)))))))
Why was that so easy? (To be fair, it doesn't eliminate duplicates, but that would be an easy addition, if I only knew what Scheme library function to call.) Because there are no type checks on the construction of the data structure. The down side is that we can accidentally create a structure that doesn't make sense.
Scrap Your Boilerplate allows us to perform this same sort of generic
traversal in Haskell. First of all, we need to list the syb
package as
a dependency in our Cabal file. This goes under the Build-depends
section.
Secondly, we need to add a pragma to the top of our module:
{-# LANGUAGE DeriveDataTypeable #-}
Thirdly, we need to import the Data.Generics
module:
import Data.Generics (Data, Typeable, mkQ, mkT, everything, everywhere)
Then, we need to add deriving
clauses for Data
and Typeable
to each of our datatypes. (I'll repeat the previous code for context.)
data Program = Program Text [Module]
deriving (Show, Data, Typeable)
data Module = Module Text [Procedure]
deriving (Show, Data, Typeable)
data Procedure = Procedure Text [Var] [Statement]
deriving (Show, Data, Typeable)
newtype Var = Var Text
deriving (Show, Eq, Ord, Data, Typeable)
data Statement =
VarDecl Var
| Var := Expr
| Input Var
| Output Expr
| Call Text [Expr]
| If Expr Expr Expr
deriving (Show, Data, Typeable)
data Expr =
VarRef Var
| Number Float
| Expr :+ Expr
| Expr :* Expr
deriving (Show, Data, Typeable)
Maybe that seems like a bunch of boilerplate, but it's simple, and look what it does for us:
getVars' :: Data d => d -> Set Var
getVars' code =
everything
union
(mkQ empty (\var@(Var _) -> singleton var))
code
Let's break this down. The everything
function says we want to do
a query on the hierarchy. The union
function combines the variables
from two pieces of code. The mkQ
function (Q for query) lets us make
a generic query out of a specific one. The empty
value is what we
default to for non-Var
terms. Finally, we supply a function that
tells us what to return for a Var
- a singleton set of the Var
itself.
This is following the same rules that we previously had to mentally translate into operations on each datatype. But it's so much more succinct and expressive.
The Data
and Typeable
classes provide a consistent set of
functions for accessing the representations of data. Support has been
added in GHC for automatic derivation of class instances using
DeriveDataTypeable
. In addition, many datatype libraries for Haskell
include instances for these classes.
In addition to querying data, we can perform transformations. Suppose we would like to distribute all of the multiplications over the additions, to reduce the expressions to a normal form. E.g.,
(Number 3) :* ((VarRef (Var "A")) :+ (VarRef (Var "B")))
should become
((Number 3) :* (VarRef (Var "A"))) :+
((Num 3) :* (VarRef (Var "B")))
Instead of using everything
, we use everywhere
, and instead of
mkQ
, we use mkT
(T for transformation):
distributeMultOverAdd :: Data d => d -> d
distributeMultOverAdd =
everywhere $ mkT $ \x -> case x of
(x :* (y1 :+ y2)) -> (x :* y1) :+ (x :* y2)
_ -> x
No need to write code to find these patterns in that big hierarchy.
Data
or Typeable
I love Data.HashMap
, but you can't use it out of the box with code
above. Why? The type constructor takes two type parameters instead of
one: the key type and the value type. Typeable
works with datatypes
that have a single type parameter. So they derive Typeable2
instead.
Well, my keys are usually strings that can be ignored in the
traversal, so I want to pretend that a HashMap
only takes one
parameter - the value type. Then it can be an instance of Typeable
.
So I have to write the following... boilerplate:
instance (Data k, Data v, Eq k, Hashable k) => Data (HashMap k v) where
gfoldl k z m = z HashMap.fromList `k` HashMap.toList m
toConstr _ = error "toConstr"
gunfold _ _ = error "gunfold"
dataTypeOf _ = mkNoRepType "Data.HashMap.Common.HashMap"
dataCast2 f = gcast2 f
That took a bit of web searching to figure out. Don't bother to try to
understand it, just tack it on. Fortunately, the HashMap
function
comes with an instance of Data
that is just waiting for you to
supply an instance of Typeable
.
The only problem is that, since you didn't define Data
or HashMap
in your file, you are creating what is called an orphan instance.
You'll get a warning from GHC. To suppress this, you need to supply an
option to GHC. You can do this with a pragma at the top of the module:
{-# OPTIONS_GHC -fno-warn-orphans #-}
Or perhaps you're using a third-party datatype for which an instance could easily be derived, but there is none. Then you can use a standalone deriving clause:
deriving instance Typeable SomeDataType
deriving instance Data SomeDataType
To use this, you need the StandaloneDeriving
language pragma at the
top of your module:
{-# LANGUAGE StandaloneDeriving #-}
Sometimes you use a third-party datatype that should be opaque. That
is, we don't want the generic traversal to look inside of it. An
example is a newtype-wrapped string that is supposed to abstractly
represent a serial number or identifier, like our Var
type.
If you just take the deriving
clause off of the Var
type, you'll
get a type error when tring to traverse the Program
structure. You can hide its internals by manually writing a Data
instance:
newtype Var = Var Text
deriving Typeable
instance Data Var where
gfoldl _k z v = z v
toConstr _ = error "toConstr"
gunfold _ _ = error "gunfold"
dataTypeOf _ = mkNoRepType "Var"
So... we're back to boilerplate. It's not perfect.
SYB removes a bit of the type safety you started with. These generic
functions can now be run on any instance of Data
, so you have to be
sure it makes sense.
You may be thinking, that's an awful lot of boilerplate I have to write, in order to scrap my boilerplate. Some of it unintelligible.
However, I found it to be worthwile in a case where I had a large
number of datatypes to traverse. Adding the deriving clauses was tidy,
and I had just a handful of exceptional cases. The actual call to
everything
is so simple and elegant, it seems to make it all
worthwhile.