https://wiki.haskell.org/api.php?action=feedcontributions&user=Sergv&feedformat=atomHaskellWiki - User contributions [en]2022-09-26T16:47:09ZUser contributionsMediaWiki 1.31.7https://wiki.haskell.org/index.php?title=Quasiquotation&diff=60859Quasiquotation2016-06-30T19:50:39Z<p>Sergv: Fix link to "Using QQ with Template Haskell"</p>
<hr />
<div>= Introduction =<br />
This is a tutorial for the quasiquoting facility described in<br />
[https://www.cs.drexel.edu/~mainland/publications/mainland07quasiquoting.pdf Why It's Nice to be Quoted: Quasiquoting for Haskell].<br />
<br />
Quasiquoting allows programmers to use custom, domain-specific syntax to construct fragments of their program. Along with Haskell's existing support for domain specific languages, you are now free to use new syntactic forms for your EDSLs.<br />
<br />
More information on GHC's support for quasiquoting may be found:<br />
<br />
* [http://downloads.haskell.org/~ghc/latest/docs/html/users_guide/glasgow_exts.html#template-haskell-quasi-quotation Using QQ with Template Haskell]<br />
* [https://www.cs.drexel.edu/~mainland/publications/mainland07quasiquoting.pdf Why It's Nice to be Quoted: Quasiquoting for Haskell] :: PDF<br />
<br />
A number of production examples where the result is a string:<br />
<br />
* [http://hackage.haskell.org/package/interpolatedstring-qq Quasiquoter for Ruby-style interpolated strings]<br />
* [http://hackage.haskell.org/package/interpolatedstring-perl6 interpolatedstring-perl6 library: QuasiQuoter for Perl6-style multi-line interpolated strings with "q"]<br />
* [http://hackage.haskell.org/package/here here]<br />
* [http://hackage.haskell.org/package/heredoc heredoc]<br />
* [http://hackage.haskell.org/package/raw-strings-qq raw-strings-qq]<br />
* [http://hackage.haskell.org/package/neat-interpolation neat-interpolation]<br />
* [http://hackage.haskell.org/package/Interpolation Interpolation]<br />
* [http://hackage.haskell.org/package/QuasiText A QuasiQuoter for interpolated Text literals, with support for embedding Haskell expressions]<br />
* [http://hackage.haskell.org/package/shakespeare-text shakespeare-text]<br />
* [http://hackage.haskell.org/package/dump dump]: Shows values for expressions like <hask>let a = 1 in [d| a, map (+a) [1..3] |]</hask> by turning them into <hask>"(a) = 1 (map (+a) [1..3]) = [2,3,4]"</hask> to ease debugging and development.<br />
<br />
Additional examples which producing other things:<br />
<br />
* The [[jmacro]] JavaScript generation library.<br />
* Parsing: [http://hackage.haskell.org/package/regexqq regexqq], [http://hackage.haskell.org/package/rex rex] or [http://tanakh.github.com/Peggy/ Peggy: A Quasiquoter for a packrat parser]<br />
* [http://hackage.haskell.org/package/lighttpd-conf-qq A QuasiQuoter for lighttpd configuration files]<br />
* XML: [http://hackage.haskell.org/package/text-xml-qq text-xml-qq], [http://hackage.haskell.org/package/xml-hamlet xml-hamlet]<br />
* [http://hackage.haskell.org/package/haskell-src-meta haskell-src-meta contains quite a few QuasiQuoters], though the toy examples are no longer exported and others have been moved to [http://hackage.haskell.org/package/applicative-quoters applicative-quoters]<br />
* [http://hackage.haskell.org/package/command-qq command-qq]<br />
* [http://hackage.haskell.org/package/process-qq process-qq]<br />
* [http://hackage.haskell.org/package/Rlang-QQ Rlang-QQ is a quasiquoter for inline R]<br />
* [http://hackage.haskell.org/package/language-c-quote a quasiquoter for C syntax trees]<br />
* [http://hackage.haskell.org/package/dbus-qq dbus-qq]<br />
<br />
Other QuasiQuotation/TemplateHaskell tutorials include:<br />
* [http://quasimal.com/posts/2012-05-25-quasitext-and-quasiquoting.html A look at QuasiQuotation]<br />
<br />
Note that the syntax for quasiquotation has changed since the paper was written: in GHC 7 one writes <hask>[expr|...|]</hask> instead of <hask>[:expr|...|]</hask>. GHC 6.12 uses <hask>[$expr|...|]</hask>. Quasiquotation appeared in GHC 6.9 and is enabled with the <code>QuasiQuotes</code> language option (<code>-XQuasiQuotes</code> on the command line or <hask>{-# LANGUAGE QuasiQuotes #-}</hask> in a source file).<br />
<br />
We show how to build a quasiquoter for a simple mathematical expression<br />
language. Although the example is small, it demonstrates all aspects of building<br />
a quasiquoter. We do not mean to suggest that one gains much from a quasiquoter<br />
for such a small language relative to using abstract syntax directly except from<br />
a pedagogical point of view---this is just a tutorial!<br />
<br />
The tutorial is runnable if its contents is placed in files as follows:<br />
<br />
Place the contents of the [[#Syntax]] and [[#Parsing]] sections in the file <code>Expr.hs</code><br />
with header<br />
<br />
<haskell><br />
{-# LANGUAGE DeriveDataTypeable #-}<br />
module Expr (Expr(..),<br />
BinOp(..),<br />
eval,<br />
parseExpr)<br />
where<br />
<br />
import Data.Generics<br />
import Text.ParserCombinators.Parsec<br />
</haskell><br />
<br />
Place the contents of the section [[#The Quasiquoter]] in a<br />
file <code>Expr/Quote.hs</code> with header<br />
<br />
<haskell><br />
module Expr.Quote (expr) where<br />
<br />
import Data.Generics<br />
import qualified Language.Haskell.TH as TH<br />
import Language.Haskell.TH.Quote<br />
<br />
import Expr<br />
</haskell><br />
<br />
= Syntax =<br />
<br />
Our simple expression language consists of integers, the standard operators<br />
+,x,*,/, and parenthesized expressions. We will write a single parser that takes<br />
concrete syntax for this language and transforms it to abstract syntax. Using<br />
the SYB approach to generic programming, we will then use this parser to produce<br />
expression and pattern quasiquoters. Our quasiquoter will allow us to write<br />
<hask>[expr|1 + 3|]</hask> directly in Haskell source code instead of the<br />
corresponding abstract syntax.<br />
<br />
An obvious datatype for the abstract syntax of this simple language is:<br />
<br />
<haskell><br />
data Expr = IntExpr Integer<br />
| BinopExpr (Integer -> Integer -> Integer) Expr Expr<br />
deriving(Show)<br />
</haskell><br />
<br />
Unfortunately, this won't do for our quasiquoter. First of all, the SYB<br />
technique we use cannot handle function types in a generic way, so the BinopExpr<br />
constructor must be modified. SYB also requires that we derive Typeable and<br />
Data, a trivial change. Finally, we want to support antiquoting for two<br />
syntactic categories, expressions and integers. With antiquoting support, we can<br />
write [expr|$x + $int:y|] where x and y are in-scope variables with types Expr<br />
and Integer, respectively. The final data types for our abstract syntax are:<br />
<br />
<haskell><br />
data Expr = IntExpr Integer<br />
| AntiIntExpr String<br />
| BinopExpr BinOp Expr Expr<br />
| AntiExpr String<br />
deriving(Show, Typeable, Data)<br />
<br />
data BinOp = AddOp<br />
| SubOp<br />
| MulOp<br />
| DivOp<br />
deriving(Show, Typeable, Data)<br />
</haskell><br />
<br />
An evaluator for our abstract syntax can be written as follows:<br />
<br />
<haskell><br />
eval :: Expr -> Integer<br />
eval (IntExpr n) = n<br />
eval (BinopExpr op x y) = (opToFun op) (eval x) (eval y)<br />
where<br />
opToFun AddOp = (+)<br />
opToFun SubOp = (-)<br />
opToFun MulOp = (*)<br />
opToFun DivOp = div<br />
</haskell><br />
<br />
= Parsing =<br />
<br />
We use Parsec to write a parser for our expression language. Note that we have<br />
(somewhat arbitrarily) chosen the syntax for antiquotaton to be as in the above<br />
example; a quasiquoter may choose whatever syntax she wishes.<br />
<br />
<haskell><br />
small = lower <|> char '_'<br />
large = upper<br />
idchar = small <|> large <|> digit <|> char '\''<br />
<br />
lexeme p = do{ x <- p; spaces; return x }<br />
symbol name = lexeme (string name)<br />
parens p = between (symbol "(") (symbol ")") p<br />
<br />
expr :: CharParser st Expr<br />
expr = term `chainl1` addop<br />
<br />
term :: CharParser st Expr<br />
term = factor `chainl1` mulop<br />
<br />
factor :: CharParser st Expr<br />
factor = parens expr <|> integer <|> try antiIntExpr <|> antiExpr<br />
<br />
mulop = do{ symbol "*"; return $ BinopExpr MulOp }<br />
<|> do{ symbol "/"; return $ BinopExpr DivOp }<br />
<br />
addop = do{ symbol "+"; return $ BinopExpr AddOp }<br />
<|> do{ symbol "-"; return $ BinopExpr SubOp }<br />
<br />
integer :: CharParser st Expr<br />
integer = lexeme $ do{ ds <- many1 digit ; return $ IntExpr (read ds) }<br />
<br />
ident :: CharParser s String<br />
ident = do{ c <- small; cs <- many idchar; return (c:cs) }<br />
<br />
antiIntExpr = lexeme $ do{ symbol "$int:"; id <- ident; return $ AntiIntExpr id }<br />
antiExpr = lexeme $ do{ symbol "$"; id <- ident; return $ AntiExpr id }<br />
</haskell><br />
<br />
The helper function parseExpr takes a source code position (consisting of a file<br />
name, line and column) and a string and returns a value of type Expr. This<br />
helper function also ensures that we can parse the whole string rather than just<br />
a prefix.<br />
<br />
<haskell><br />
parseExpr :: Monad m => (String, Int, Int) -> String -> m Expr<br />
parseExpr (file, line, col) s =<br />
case runParser p () "" s of<br />
Left err -> fail $ show err<br />
Right e -> return e<br />
where<br />
p = do pos <- getPosition<br />
setPosition $<br />
(flip setSourceName) file $<br />
(flip setSourceLine) line $<br />
(flip setSourceColumn) col $<br />
pos<br />
spaces<br />
e <- expr<br />
eof<br />
return e<br />
</haskell><br />
<br />
= The Quasiquoter =<br />
<br />
Remember, our quasiquoter allows us to write expression in our simple language,<br />
such as [expr|2 * 3|], directly in Haskell source code. This requires that the<br />
variable expr be in-scope when the quasiquote is encountered, and that it is<br />
bound to a value of type Language.Haskell.TH.Quote.QuasiQuoter, which contains<br />
an expression quoter and a pattern quoter. Note that expr must obey the same<br />
stage restrictions as Template Haskell; in particular, it may not be defined in<br />
the same module where it is used as a quasiquoter, but must be imported.<br />
<br />
Our expression and pattern quoters are quoteExprExp and quoteExprPat,<br />
respectively, so our quasiquoter expr is written as follows:<br />
<br />
<haskell><br />
quoteExprExp :: String -> TH.ExpQ<br />
quoteExprPat :: String -> TH.PatQ<br />
<br />
expr :: QuasiQuoter<br />
expr = QuasiQuoter { quoteExp = quoteExprExp,<br />
quotePat = quoteExprPat<br />
-- with ghc >= 7.4, you could also<br />
-- define quoteType and quoteDec for<br />
-- quasiquotes in those places too<br />
}<br />
</haskell><br />
<br />
Our quasiquoters re-use the parser we wrote in the previous section, parseExpr,<br />
and make use of the generic functions dataToExpQ and dataToPatQ (described in<br />
the Haskell Workshop paper). These functions, from the Language.Haskell.TH.Quote<br />
package, take a Haskell value and reflect it back into the language as Template<br />
Haskell abstract syntax. The catch is that we don't want to handle all values<br />
generically: antiquoted values must be handled specially. Consider the AntiExpr<br />
constructor; we don't want this constructor to be mapped to Template Haskell<br />
abstract syntax for the AntiExpr constructor, but to abstract syntax for the<br />
Haskell variable named by the constructor's argument. The extQ combinator allows<br />
us to do this nicely by defining a function antiExprExp that handles<br />
antiquotations.<br />
<br />
<haskell><br />
quoteExprExp s = do loc <- TH.location<br />
let pos = (TH.loc_filename loc,<br />
fst (TH.loc_start loc),<br />
snd (TH.loc_start loc))<br />
expr <- parseExpr pos s<br />
dataToExpQ (const Nothing `extQ` antiExprExp) expr<br />
<br />
antiExprExp :: Expr -> Maybe (TH.Q TH.Exp)<br />
antiExprExp (AntiIntExpr v) = Just $ TH.appE (TH.conE (TH.mkName "IntExpr"))<br />
(TH.varE (TH.mkName v))<br />
antiExprExp (AntiExpr v) = Just $ TH.varE (TH.mkName v)<br />
antiExprExp _ = Nothing<br />
</haskell><br />
<br />
The corresponding code for patterns is:<br />
<br />
<haskell><br />
quoteExprPat s = do loc <- TH.location<br />
let pos = (TH.loc_filename loc,<br />
fst (TH.loc_start loc),<br />
snd (TH.loc_start loc))<br />
expr <- parseExpr pos s<br />
dataToPatQ (const Nothing `extQ` antiExprPat) expr<br />
<br />
antiExprPat :: Expr -> Maybe (TH.Q TH.Pat)<br />
antiExprPat (AntiIntExpr v) = Just $ TH.conP (TH.mkName "IntExpr")<br />
[TH.varP (TH.mkName v)]<br />
antiExprPat (AntiExpr v) = Just $ TH.varP (TH.mkName v)<br />
antiExprPat _ = Nothing<br />
</haskell><br />
<br />
= Examples =<br />
<br />
We can now try out a few examples by invoking ghci as follows: <code>ghci -XQuasiQuotes Expr/Quote</code><br />
<br />
<pre><br />
> [expr|1 + 3 + 5|]<br />
BinopExpr AddOp (BinopExpr AddOp (IntExpr 1) (IntExpr 3)) (IntExpr 5) <br />
> eval [expr|1 + 3 + 5|]<br />
9 <br />
</pre><br />
<br />
Taking advantage of our quasiquoter, we can re-write our evaluator so it uses<br />
concrete syntax:<br />
<br />
<haskell><br />
eval' :: Expr -> Integer<br />
eval' [expr|$int:x|] = x<br />
eval' [expr|$x + $y|] = eval' x + eval' y<br />
eval' [expr|$x - $y|] = eval' x - eval' y<br />
eval' [expr|$x * $y|] = eval' x * eval' y<br />
eval' [expr|$x / $y|] = eval' x `div` eval' y<br />
</haskell><br />
<br />
Let's make sure it works as advertised:<br />
<br />
<pre><br />
> eval [expr|1 + 2 + 3|] == eval' [expr|1 + 2 + 3|]<br />
True <br />
> eval [expr|1 + 3 * 5|] == eval' [expr|1 + 3 * 5|]<br />
True <br />
</pre></div>Sergvhttps://wiki.haskell.org/index.php?title=GHC/Type_families&diff=57631GHC/Type families2014-02-28T22:02:20Z<p>Sergv: fix typos</p>
<hr />
<div>Indexed type families, or '''type families''' for short, are a Haskell extension supporting ad-hoc overloading of data types. Type families are parametric types that can be assigned specialized representations based on the type parameters they are instantiated with. They are the data type analogue of [[Type class|type classes]]: families are used to define overloaded ''data'' in the same way that classes are used to define overloaded ''functions''. Type families are useful for generic programming, for creating highly parameterised library interfaces, and for creating interfaces with enhanced static information, much like dependent types.<br />
<br />
Type families come in two flavors: ''data families'' and ''type synonym families''. Data families are the indexed form of data and newtype definitions. Type synonym families are the indexed form of type synonyms. Each of these flavors can be defined in a standalone manner or ''associated'' with a type class. Standalone definitions are more general, while associated types can more clearly express how a type is used and lead to better error messages.<br />
<br />
''NB: see also Simon's [http://hackage.haskell.org/trac/ghc/blog/LetGeneralisationInGhc7 blog entry on let generalisation] for a significant change in the policy for let generalisation, driven by the type family extension. In brief: a few programs will puzzlingly fail to compile with <tt>-XTypeFamilies</tt> even though the code is legal Haskell 98.''<br />
<br />
== What are type families? ==<br />
<br />
The concept of a type family comes from type theory. An indexed type family in type theory is a partial function at the type level. Applying the function to parameters (called ''type indices'') yields a type. Type families permit a program to compute what data constructors it will operate on, rather than having them fixed statically (as with simple type systems) or treated as opaque unknowns (as with parametrically polymorphic types).<br />
<br />
Type families are to vanilla data types what type class methods are to regular functions. Vanilla polymorphic data types and functions have a single definition, which is used at all type instances. Classes and type families, on the other hand, have an interface definition and any number of instance definitions. A type family's interface definition declares its [[kind]] and its arity, or the number of type indices it takes. Instance definitions define the type family over some part of the domain.<br />
<br />
As a simple example of how type families differ from ordinary parametric data types, consider a strict list type. We can represent a list of <hask>Char</hask> in the usual way, with cons cells. We can do the same thing to represent a list of <hask>()</hask>, but since a strict <hask>()</hask> value carries no useful information, it would be more efficient to just store the length of the list. This can't be done with an ordinary parametric data type, because the data constructors used to represent the list would depend on the list's type parameter: if it's <hask>Char</hask> then the list consists of cons cells; if it's <hask>()</hask>, then the list consists of a single integer. We basically want to select between two different data types based on a type parameter. Using type families, this list type could be declared as follows:<br />
<br />
<haskell><br />
-- Declare a list-like data family<br />
data family XList a<br />
<br />
-- Declare a list-like instance for Char<br />
data instance XList Char = XCons !Char !(XList Char) | XNil<br />
<br />
-- Declare a number-like instance for ()<br />
data instance XList () = XListUnit !Int<br />
</haskell><br />
<br />
The right-hand sides of the two <code>data instance</code> declarations are exactly ordinary data definitions. In fact, a <code>data instance</code> declaration is nothing more than a shorthand for a <code>data</code> declaration followed by a <code>type instance</code> (see below) declaration. However, they define two instances of the same parametric data type, <hask>XList Char</hask> and <hask>XList ()</hask>, whereas ordinary data declarations define completely unrelated types. A recent [[Simonpj/Talk:FunWithTypeFuns|tutorial paper]] has more in-depth examples of programming with type families. <br />
<br />
[[GADT]]s bear some similarity to type families, in the sense that they allow a parametric type's constructors to depend on the type's parameters. However, all GADT constructors must be defined in one place, whereas type families can be extended. Functional dependencies are similar to type families, and many type classes that use functional dependencies can be equivalently expressed with type families. Type families provide a more functional style of type-level programming than the relational style of functional dependencies.<br />
<br />
== What do I need to use type families? ==<br />
<br />
Type families are a GHC extension enabled with the <code>-XTypeFamilies</code> flag or the <code>{-# LANGUAGE TypeFamilies #-}</code> pragma. The first stable release of GHC that properly supports type families is 6.10.1. (The 6.8 release included an early partial implementation, but its use is deprecated.) Please [http://hackage.haskell.org/trac/ghc/query?status=new&status=assigned&status=reopened&group=priority&type=bug&order=id&desc=1 report bugs] via the GHC bug tracker, ideally accompanied by a small example program that demonstrates the problem. Use the [mailto:glasgow-haskell-users@haskell.org GHC mailing list] for questions or for a discussion of this language extension and its description on this wiki page.<br />
<br />
== An associated data type example ==<br />
<br />
As an example, consider Ralf Hinze's [http://www.cs.ox.ac.uk/ralf.hinze/publications/GGTries.ps.gz generalised tries], a form of generic finite maps. <br />
<br />
=== The class declaration ===<br />
<br />
We define a type class whose instances are the types that we can use as keys in our generic maps:<br />
<haskell><br />
class GMapKey k where<br />
data GMap k :: * -> *<br />
empty :: GMap k v<br />
lookup :: k -> GMap k v -> Maybe v<br />
insert :: k -> v -> GMap k v -> GMap k v<br />
</haskell><br />
The interesting part is the ''associated data family'' declaration of the class. It gives a [http://www.haskell.org/ghc/docs/latest/html/users_guide/type-families.html#data-family-declarations ''kind signature''] (here <hask>* -> *</hask>) for the associated data type <hask>GMap k</hask> - analogous to how methods receive a type signature in a class declaration.<br />
<br />
What it is important to notice is that the first parameter of the associated type <hask>GMap</hask> coincides with the class parameter of <hask>GMapKey</hask>. This indicates that also in all instances of the class, the instances of the associated data type need to have their first argument match up with the instance type. In general, the type arguments of an associated type can be a subset of the class parameters (in a multi-parameter type class) and they can appear in any order, possibly in an order other than in the class head. The latter can be useful if the associated data type is partially applied in some contexts.<br />
<br />
The second important point is that as <hask>GMap k</hask> has kind <hask>* -> *</hask> and <hask>k</hask> (implicitly) has kind <hask>*</hask>, the type constructor <hask>GMap</hask> (without an argument) has kind <hask>* -> * -> *</hask>. Consequently, we see that <hask>GMap</hask> is applied to two arguments in the signatures of the methods <hask>empty</hask>, <hask>lookup</hask>, and <hask>insert</hask>.<br />
<br />
=== An Int instance ===<br />
<br />
To use Ints as keys into generic maps, we declare an instance that simply uses <hask>Data.IntMap</hask>, thusly:<br />
<haskell><br />
instance GMapKey Int where<br />
data GMap Int v = GMapInt (Data.IntMap.IntMap v)<br />
empty = GMapInt Data.IntMap.empty<br />
lookup k (GMapInt m) = Data.IntMap.lookup k m<br />
insert k v (GMapInt m) = GMapInt (Data.IntMap.insert k v m)<br />
</haskell><br />
The <hask>Int</hask> instance of the associated data type <hask>GMap</hask> needs to have both of its parameters, but as only the first one corresponds to a class parameter, the second needs to be a type variable (here <hask>v</hask>). As mentioned before, any associated type parameter that corresponds to a class parameter must be identical to the class parameter in each instance. The right hand side of the associated data declaration is like that of any other data type.<br />
<br />
NB: At the moment, GADT syntax is not allowed for associated data types (or other indexed types). This is not a fundamental limitation, but just a shortcoming of the current implementation, which we expect to lift in the future.<br />
<br />
As an exercise, implement an instance for <hask>Char</hask> that maps back to the <hask>Int</hask> instance using the conversion functions <hask>Char.ord</hask> and <hask>Char.chr</hask>.<br />
<br />
=== A unit instance ===<br />
<br />
Generic definitions, apart from elementary types, typically cover units, products, and sums. We start here with the unit instance for <hask>GMap</hask>:<br />
<haskell><br />
instance GMapKey () where<br />
data GMap () v = GMapUnit (Maybe v)<br />
empty = GMapUnit Nothing<br />
lookup () (GMapUnit v) = v<br />
insert () v (GMapUnit _) = GMapUnit $ Just v<br />
</haskell><br />
For unit, the map is just a <hask>Maybe</hask> value.<br />
<br />
=== Product and sum instances ===<br />
<br />
Next, let us define the instances for pairs and sums (i.e., <hask>Either</hask>):<br />
<haskell><br />
instance (GMapKey a, GMapKey b) => GMapKey (a, b) where<br />
data GMap (a, b) v = GMapPair (GMap a (GMap b v))<br />
empty = GMapPair empty<br />
lookup (a, b) (GMapPair gm) = lookup a gm >>= lookup b <br />
insert (a, b) v (GMapPair gm) = GMapPair $ case lookup a gm of<br />
Nothing -> insert a (insert b v empty) gm<br />
Just gm2 -> insert a (insert b v gm2 ) gm<br />
<br />
instance (GMapKey a, GMapKey b) => GMapKey (Either a b) where<br />
data GMap (Either a b) v = GMapEither (GMap a v) (GMap b v)<br />
empty = GMapEither empty empty<br />
lookup (Left a) (GMapEither gm1 _gm2) = lookup a gm1<br />
lookup (Right b) (GMapEither _gm1 gm2 ) = lookup b gm2<br />
insert (Left a) v (GMapEither gm1 gm2) = GMapEither (insert a v gm1) gm2<br />
insert (Right b) v (GMapEither gm1 gm2) = GMapEither gm1 (insert b v gm2)<br />
</haskell><br />
If you find this code algorithmically surprising, I'd suggest to have a look at [http://www.cs.ox.ac.uk/ralf.hinze/publications/index.html#J4 Ralf Hinze's paper]. The only novelty concerning associated types, in these two instances, is that the instances have a context <hask>(GMapKey a, GMapKey b)</hask>. Consequently, the right hand sides of the associated type declarations can use <hask>GMap</hask> recursively at the key types <hask>a</hask> and <hask>b</hask> - not unlike the method definitions use the class methods recursively at the types for which the class is given in the instance context.<br />
<br />
=== Using a generic map ===<br />
<br />
Finally, some code building and querying a generic map:<br />
<haskell><br />
myGMap :: GMap (Int, Either Char ()) String<br />
myGMap = insert (5, Left 'c') "(5, Left 'c')" $<br />
insert (4, Right ()) "(4, Right ())" $<br />
insert (5, Right ()) "This is the one!" $<br />
insert (5, Right ()) "This is the two!" $<br />
insert (6, Right ()) "(6, Right ())" $<br />
insert (5, Left 'a') "(5, Left 'a')" $<br />
empty<br />
main = putStrLn $ maybe "Couldn't find key!" id $ lookup (5, Right ()) myGMap<br />
</haskell><br />
<br />
=== Download the code ===<br />
<br />
If you want to play with this example without copying it off the wiki, just download the source code[http://darcs.haskell.org/testsuite/tests/ghc-regress/indexed-types/should_run/GMapAssoc.hs] for <hask>GMap</hask> from GHC's test suite.<br />
<br />
== Detailed definition of data families ==<br />
<br />
Data families appear in two flavours: (1) they can be defined on the toplevel or (2) they can appear inside type classes (in which case they are known as associated types). The former is the more general variant, as it lacks the requirement for the type-indices to coincide with the class parameters. However, the latter can lead to more clearly structured code and compiler warnings if some type instances were - possibly accidentally - omitted. In the following, we always discuss the general toplevel form first and then cover the additional constraints placed on associated types.<br />
<br />
=== Family declarations ===<br />
<br />
Indexed data families are introduced by a signature, such as <br />
<haskell><br />
data family GMap k :: * -> *<br />
</haskell><br />
The special <hask>family</hask> distinguishes family from standard data declarations. The result kind annotation is optional and, as usual, defaults to <hask>*</hask> if omitted. An example is<br />
<haskell><br />
data family Array e<br />
</haskell><br />
Named arguments can also be given explicit kind signatures if needed. Just as with [http://www.haskell.org/ghc/docs/latest/html/users_guide/gadt.html GADT declarations] named arguments are entirely optional, so that we can declare <hask>Array</hask> alternatively with<br />
<haskell><br />
data family Array :: * -> *<br />
</haskell><br />
<br />
==== Associated family declarations ====<br />
<br />
When a data family is declared as part of a type class, we drop the <hask>family</hask> keyword. The <hask>GMap</hask> declaration takes the following form<br />
<haskell><br />
class GMapKey k where<br />
data GMap k :: * -> *<br />
...<br />
</haskell><br />
In contrast to toplevel declarations, named arguments must be used for all type parameters that are to be used as type-indices. Moreover, the argument names must be class parameters. Each class parameter may only be used at most once per associated type, but some may be omitted and they may be in an order other than in the class head. In other words: '''the named type parameters of the data declaration must be a permutation of a subset of the class variables'''. <br />
<br />
Example is admissible:<br />
<haskell><br />
class C a b c where { data T c a :: * } -- OK<br />
class C a b c where { data T a a :: * } -- Bad: repeated variable<br />
class D a where { data T a x :: * } -- Bad: x is not a class variable<br />
class D a where { data T a :: * -> * } -- OK<br />
</haskell><br />
<br />
=== Instance declarations ===<br />
<br />
Instance declarations of data and newtype families are very similar to standard data and newtype declarations. The only two differences are that the keyword <hask>data</hask> or <hask>newtype</hask> is followed by <hask>instance</hask> and that some or all of the type arguments can be non-variable types, but may not contain forall types or type synonym families. However, data families are generally allowed in type parameters, and type synonyms are allowed as long as they are fully applied and expand to a type that is itself admissible - exactly as this is required for occurrences of type synonyms in class instance parameters. For example, the <hask>Either</hask> instance for <hask>GMap</hask> is<br />
<haskell><br />
data instance GMap (Either a b) v = GMapEither (GMap a v) (GMap b v)<br />
</haskell><br />
In this example, the declaration has only one variant. In general, it can be any number.<br />
<br />
Data and newtype instance declarations are only legit when an appropriate family declaration is in scope - just like class instances require the class declaration to be visible. Moreover, each instance declaration has to conform to the kind determined by its family declaration. This implies that the number of parameters of an instance declaration matches the arity determined by the kind of the family. Although all data families are declared with the <hask>data</hask> keyword, instances can be either <hask>data</hask> or <hask>newtype</hask>s, or a mix of both.<br />
<br />
Even if type families are defined as toplevel declarations, functions that perform different computations for different family instances still need to be defined as methods of type classes. In particular, the following is not possible:<br />
<haskell><br />
data family T a<br />
data instance T Int = A<br />
data instance T Char = B<br />
nonsense :: T a -> Int<br />
nonsense A = 1 -- WRONG: These two equations together...<br />
nonsense B = 2 -- ...will produce a type error.<br />
</haskell><br />
Given the functionality provided by GADTs (Generalised Algebraic Data Types), it might seem as if a definition, such as the above, should be feasible. However, type families - in contrast to GADTs - are ''open''; i.e., new instances can always be added, possibly in other modules. Supporting pattern matching across different data instances would require a form of extensible case construct.<br />
<br />
==== Associated type instances ====<br />
<br />
When an associated family instance is declared within a type class instance, we drop the <hask>instance</hask> keyword in the family instance. So, the <hask>Either</hask> instance for <hask>GMap</hask> becomes:<br />
<haskell><br />
instance (GMapKey a, GMapKey b) => GMapKey (Either a b) where<br />
data GMap (Either a b) v = GMapEither (GMap a v) (GMap b v)<br />
...<br />
</haskell><br />
The most important point about associated family instances is that the type indices corresponding to class parameters must be identical to the type given in the instance head; here this is the first argument of <hask>GMap</hask>, namely <hask>Either a b</hask>, which coincides with the only class parameter. Any parameters to the family constructor that do not correspond to class parameters, need to be variables in every instance; here this is the variable <hask>v</hask>.<br />
<br />
Instances for an associated family can only appear as part of instance declarations of the class in which the family was declared - just as with the equations of the methods of a class. Also in correspondence to how methods are handled, declarations of associated types can be omitted in class instances. If an associated family instance is omitted, the corresponding instance type is not inhabited; i.e., only diverging expressions, such as <hask>undefined</hask>, can assume the type.<br />
<br />
==== Scoping of class parameters ====<br />
<br />
In the case of multi-parameter type classes, the visibility of class parameters in the right-hand side of associated family instances depends ''solely'' on the parameters of the data family. As an example, consider the simple class declaration<br />
<haskell><br />
class C a b where<br />
data T a<br />
</haskell><br />
Only one of the two class parameters is a parameter to the data family. Hence, the following instance declaration is invalid:<br />
<haskell><br />
instance C [c] d where<br />
data T [c] = MkT (c, d) -- WRONG!! 'd' is not in scope<br />
</haskell><br />
Here, the right-hand side of the data instance mentions the type variable <hask>d</hask> that does not occur in its left-hand side. We cannot admit such data instances as they would compromise type safety.<br />
<br />
==== Type class instances of family instances ====<br />
<br />
Type class instances of instances of data families can be defined as usual, and in particular data instance declarations can have <hask>deriving</hask> clauses. For example, we can write<br />
<haskell><br />
data GMap () v = GMapUnit (Maybe v)<br />
deriving Show<br />
</haskell><br />
which implicitly defines an instance of the form<br />
<haskell><br />
instance Show v => Show (GMap () v) where ...<br />
</haskell><br />
<br />
Note that class instances are always for particular ''instances'' of a data family and never for an entire family as a whole. This is for essentially the same reasons that we cannot define a toplevel function that performs pattern matching on the data constructors of ''different'' instances of a single type family. It would require a form of extensible case construct.<br />
<br />
==== Overlap ====<br />
<br />
The instance declarations of a data family used in a single program may not overlap at all, independent of whether they are associated or not. In contrast to type class instances, this is not only a matter of consistency, but one of type safety.<br />
<br />
=== Import and export ===<br />
<br />
The association of data constructors with type families is more dynamic than that is the case with standard data and newtype declarations. In the standard case, the notation <hask>T(..)</hask> in an import or export list denotes the type constructor and all the data constructors introduced in its declaration. However, a family declaration never introduces any data constructors; instead, data constructors are introduced by family instances. As a result, which data constructors are associated with a type family depends on the currently visible instance declarations for that family. Consequently, an import or export item of the form <hask>T(..)</hask> denotes the family constructor and all currently visible data constructors - in the case of an export item, these may be either imported or defined in the current module. The treatment of import and export items that explicitly list data constructors, such as <hask>GMap(GMapEither)</hask>, is analogous.<br />
<br />
==== Associated families ====<br />
<br />
As expected, an import or export item of the form <hask>C(..)</hask> denotes all of the class' methods and associated types. However, when associated types are explicitly listed as subitems of a class, we need some new syntax, as uppercase identifiers as subitems are usually data constructors, not type constructors. To clarify that we denote types here, each associated type name needs to be prefixed by the keyword <hask>type</hask>. So for example, when explicitly listing the components of the <hask>GMapKey</hask> class, we write <hask>GMapKey(type GMap, empty, lookup, insert)</hask>.<br />
<br />
==== Examples ====<br />
<br />
Assuming our running <hask>GMapKey</hask> class example, let us look at some export lists and their meaning:<br />
<br />
* <hask>module GMap (GMapKey) where...</hask>: Exports just the class name.<br />
* <hask>module GMap (GMapKey(..)) where...</hask>: Exports the class, the associated type <hask>GMap</hask> and the member functions <hask>empty</hask>, <hask>lookup</hask>, and <hask>insert</hask>. None of the data constructors is exported.<br />
* <hask>module GMap (GMapKey(..), GMap(..)) where...</hask>: As before, but also exports all the data constructors <hask>GMapInt</hask>, <hask>GMapChar</hask>, <hask>GMapUnit</hask>, <hask>GMapPair</hask>, and <hask>GMapEither</hask>.<br />
* <hask>module GMap (GMapKey(empty, lookup, insert), GMap(..)) where...</hask>: As before.<br />
* <hask>module GMap (GMapKey, empty, lookup, insert, GMap(..)) where...</hask>: As before.<br />
<br />
Finally, you can write <hask>GMapKey(type GMap)</hask> to denote both the class <hask>GMapKey</hask> as well as its associated type <hask>GMap</hask>. However, you cannot write <hask>GMapKey(type GMap(..))</hask> &mdash; i.e., sub-component specifications cannot be nested. To specify <hask>GMap</hask>'s data constructors, you have to list it separately.<br />
<br />
==== Instances ====<br />
<br />
Family instances are implicitly exported, just like class instances. However, this applies only to the heads of instances, not to the data constructors an instance defines.<br />
<br />
== An associated type synonym example ==<br />
<br />
Type synonym families are an alternative to functional dependencies, which makes functional dependency examples well suited to introduce type synonym families. In fact, type families are a more functional way to express the same as functional dependencies (despite the name!), as they replace the relational notation of functional dependencies by an expression-oriented notation; i.e., functions on types are really represented by functions and not relations.<br />
<br />
=== The <hask>class</hask> declaration ===<br />
<br />
Here's an example from Mark Jones' seminal paper on functional dependencies:<br />
<haskell><br />
class Collects e ce | ce -> e where<br />
empty :: ce<br />
insert :: e -> ce -> ce<br />
member :: e -> ce -> Bool<br />
toList :: ce -> [e]<br />
</haskell><br />
<br />
With associated type synonyms we can write this as<br />
<haskell><br />
class Collects ce where<br />
type Elem ce<br />
empty :: ce<br />
insert :: Elem ce -> ce -> ce<br />
member :: Elem ce -> ce -> Bool<br />
toList :: ce -> [Elem ce]<br />
</haskell><br />
Instead of the multi-parameter type class, we use a single parameter class, and the parameter <hask>e</hask><br />
turned into an associated type synonym <hask>Elem ce</hask>.<br />
<br />
=== An <hask>instance</hask>===<br />
<br />
Instances change correspondingly. An instance of the two-parameter class<br />
<haskell><br />
instance Eq e => Collects e [e] where<br />
empty = []<br />
insert e l = (e:l)<br />
member e [] = False<br />
member e (x:xs) <br />
| e == x = True<br />
| otherwise = member e xs<br />
toList l = l<br />
</haskell><br />
becomes an instance of a single-parameter class, where the dependent type parameter turns into an associated type instance declaration:<br />
<haskell><br />
instance Eq e => Collects [e] where<br />
type Elem [e] = e<br />
empty = []<br />
insert e l = (e:l)<br />
member e [] = False<br />
member e (x:xs) <br />
| e == x = True<br />
| otherwise = member e xs<br />
toList l = l<br />
</haskell><br />
<br />
=== Using generic collections ===<br />
<br />
With Functional Dependencies the code would be:<br />
<haskell><br />
sumCollects :: (Collects e c1, Collects e c2) => c1 -> c2 -> c2<br />
sumCollects c1 c2 = foldr insert c2 (toList c1)<br />
</haskell><br />
<br />
In contrast, with associated type synonyms, we get:<br />
<haskell><br />
sumCollects :: (Collects c1, Collects c2, Elem c1 ~ Elem c2) => c1 -> c2 -> c2<br />
sumCollects c1 c2 = foldr insert c2 (toList c1)<br />
</haskell><br />
<br />
== Detailed definition of type synonym families ==<br />
<br />
Type families appear in two flavours: (1) they can be defined on the toplevel or (2) they can appear inside type classes (in which case they are known as associated type synonyms). The former is the more general variant, as it lacks the requirement for the type-indices to coincide with the class parameters. However, the latter can lead to more clearly structured code and compiler warnings if some type instances were - possibly accidentally - omitted. In the following, we always discuss the general toplevel form first and then cover the additional constraints placed on associated types.<br />
<br />
=== Family declarations ===<br />
<br />
Indexed type families are introduced by a signature, such as <br />
<haskell><br />
type family Elem c :: *<br />
</haskell><br />
The special <hask>family</hask> distinguishes family from standard type declarations. The result kind annotation is optional and, as usual, defaults to <hask>*</hask> if omitted. An example is<br />
<haskell><br />
type family Elem c<br />
</haskell><br />
Parameters can also be given explicit kind signatures if needed. We call the number of parameters in a type family declaration, the family's arity, and all applications of a type family must be fully saturated w.r.t. to that arity. This requirement is unlike ordinary type synonyms and it implies that the kind of a type family is not sufficient to determine a family's arity, and hence in general, also insufficient to determine whether a type family application is well formed. As an example, consider the following declaration:<br />
<haskell><br />
type family F a b :: * -> * -- F's arity is 2, <br />
-- although its overall kind is * -> * -> * -> *<br />
</haskell><br />
Given this declaration the following are examples of well-formed and malformed types:<br />
<haskell><br />
F Char [Int] -- OK! Kind: * -> *<br />
F Char [Int] Bool -- OK! Kind: *<br />
F IO Bool -- WRONG: kind mismatch in the first argument<br />
F Bool -- WRONG: unsaturated application<br />
</haskell><br />
<br />
A top-level type family can be declared as open or closed. (Associated type<br />
families are always open.) A closed type family has all of its equations<br />
defined in one place and cannot be extended, whereas an open family can have<br />
instances spread across modules. The advantage of a closed family is that<br />
its equations are tried in order, similar to a term-level function definition:<br />
<haskell><br />
type family G a where<br />
G Int = Bool<br />
G a = Char<br />
</haskell><br />
With this definition, the type <hask>G Int</hask> becomes <hask>Bool</hask><br />
and, say, <hask>G Double</hask> becomes <hask>Char</hask>. See also [http://ghc.haskell.org/trac/ghc/wiki/NewAxioms here] for more information about closed type families.<br />
<br />
==== Associated family declarations ====<br />
<br />
When a type family is declared as part of a type class, we drop the <hask>family</hask> special. The <hask>Elem</hask> declaration takes the following form<br />
<haskell><br />
class Collects ce where<br />
type Elem ce :: *<br />
...<br />
</haskell><br />
Exactly as in the case of an associated data declaration, '''the named type parameters must be a permutation of a subset of the class parameters'''. Examples<br />
<haskell><br />
class C a b c where { type T c a :: * } -- OK<br />
class D a where { type T a x :: * } -- No: x is not a class parameter<br />
class D a where { type T a :: * -> * } -- OK<br />
</haskell><br />
<br />
=== Type instance declarations ===<br />
<br />
Instance declarations of open type families are very similar to standard type synonym declarations. The only two differences are that the keyword <hask>type</hask> is followed by <hask>instance</hask> and that some or all of the type arguments can be non-variable types, but may not contain forall types or type synonym families. However, data families are generally allowed, and type synonyms are allowed as long as they are fully applied and expand to a type that is admissible - these are the exact same requirements as for data instances. For example, the <hask>[e]</hask> instance for <hask>Elem</hask> is<br />
<haskell><br />
type instance Elem [e] = e<br />
</haskell><br />
<br />
A type family instance declaration must satisfy the following rules:<br />
* An appropriate family declaration is in scope - just like class instances require the class declaration to be visible. <br />
* The instance declaration conforms to the kind determined by its family declaration<br />
* The number of type parameters in an instance declaration matches the number of type parameters in the family declaration.<br />
* The right-hand side of a type instance must be a monotype (i.e., it may not include foralls) and after the expansion of all saturated vanilla type synonyms, no synonyms, except family synonyms may remain.<br />
<br />
Here are some examples of admissible and illegal type instances and closed families:<br />
<haskell><br />
type family F a :: *<br />
type instance F [Int] = Int -- OK!<br />
type instance F String = Char -- OK!<br />
type instance F (F a) = a -- WRONG: type parameter mentions a type family<br />
type instance F (forall a. (a, b)) = b -- WRONG: a forall type appears in a type parameter<br />
type instance F Float = forall a.a -- WRONG: right-hand side may not be a forall type<br />
<br />
type family F2 a where -- OK!<br />
F (Maybe Int) = Int<br />
F (Maybe Bool) = Bool<br />
F (Maybe a) = String<br />
<br />
type family G a b :: * -> *<br />
type instance G Int = (,) -- WRONG: must be two type parameters<br />
type instance G Int Char Float = Double -- WRONG: must be two type parameters<br />
</haskell><br />
<br />
==== Closed family simplification ====<br />
<br />
Currently in recent versions of ghc 7.7 and planed to be included in 7.8.1.<br />
<br />
When dealing with closed families, simplifying the type is harder than just finding a left-hand side that matches and replacing that with a right-hand side. GHC will select an equation to use in a given type family application (the "target") if and only if the following 2 conditions hold:<br />
<br />
# There is a substitution from the variables in the equation's LHS that makes the left-hand side of the branch coincide with the target.<br />
# For each previous equation in the family: either the LHS of that equation is ''apart'' from the type family application, '''or''' the equation is ''compatible'' with the chosen equation.<br />
<br />
Now, we define ''apart'' and ''compatible'':<br />
# Two types are ''apart'' when one cannot simplify to the other, even after arbitrary type-family simplifications<br />
# Two equations are ''compatible'' if, either, their LHSs are apart or their LHSs unify and their RHSs are the same under the substitution induced by the unification.<br />
<br />
Some examples are in order:<br />
<haskell><br />
type family F a where<br />
F Int = Bool<br />
F Bool = Char<br />
F a = Bool<br />
<br />
type family And (a :: Bool) (b :: Bool) :: Bool where<br />
And False c = False<br />
And True d = d<br />
And e False = False<br />
And f True = f<br />
And g g = g<br />
</haskell><br />
<br />
In <hask>F</hask>, all pairs of equations are compatible except the second and third. The first two are compatible because their LHSs are apart. The first and third are compatible because the unifying substitution leads the RHSs to be the same. But, the second and third are not compatible because neither of these conditions holds. As a result, GHC will not use the third equation to simplify a target unless that target is apart from <hask>Bool</hask>.<br />
<br />
In <hask>And</hask>, ''every'' pair of equations is compatible, meaning GHC never has to make the extra apartness check during simplification.<br />
<br />
Why do all of this? It's a matter of type safety. Consider this example:<br />
<br />
<haskell><br />
type family J a b where<br />
J a a = Int<br />
J a b = Bool<br />
</haskell><br />
<br />
Say GHC selected the second branch just because the first doesn't apply at the moment, because two type variables are distinct. The problem is that those variables might later be instantiated at the same value, and then the first branch would have applied. You can convince this sort of inconsistency to produce <hask>unsafeCoerce</hask>.<br />
<br />
It gets worse. GHC has no internal notion of inequality, so it can't use previous, failed term-level GADT pattern matches to refine its type assumptions. For example:<br />
<br />
<haskell><br />
data G :: * -> * where<br />
GInt :: G Int<br />
GBool :: G Bool<br />
<br />
type family Foo (a :: *) :: * where<br />
Foo Int = Char<br />
Foo a = Double<br />
<br />
bar :: G a -> Foo a<br />
bar GInt = 'x'<br />
bar _ = 3.14<br />
</haskell><br />
<br />
The last line will fail to typecheck, because GHC doesn't know that the type variable <hask>a</hask> can't be <hask>Int</hask> here, even though it's obvious. The only general way to fix this is to have inequality evidence introduced into GHC, and that's a big deal and we don't know if we have the motivation for such a change yet.<br />
<br />
==== Associated type instances ====<br />
<br />
When an associated family instance is declared within a type class instance, we drop the <hask>instance</hask> keyword in the family instance. So, the <hask>[e]</hask> instance for <hask>Elem</hask> becomes:<br />
<haskell><br />
instance (Eq (Elem [e])) => Collects ([e]) where<br />
type Elem [e] = e<br />
...<br />
</haskell><br />
The most important point about associated family instances is that the type indexes corresponding to class parameters must be identical to the type given in the instance head; here this is <hask>[e]</hask>, which coincides with the only class parameter.<br />
<br />
Instances for an associated family can only appear as part of instance declarations of the class in which the family was declared - just as with the equations of the methods of a class. Also in correspondence to how methods are handled, declarations of associated types can be omitted in class instances. If an associated family instance is omitted, the corresponding instance type is not inhabited; i.e., only diverging expressions, such as <hask>undefined</hask>, can assume the type.<br />
<br />
==== Overlap ====<br />
<br />
The instance declarations of an open type family used in a single program must be compatible, in the form defined above. This condition is independent of whether the type family is associated or not, and it is not only a matter of consistency, but one of type safety. <br />
<br />
Here are two examples to illustrate the condition under which overlap is permitted.<br />
<haskell><br />
type instance F (a, Int) = [a]<br />
type instance F (Int, b) = [b] -- overlap permitted<br />
<br />
type instance G (a, Int) = [a]<br />
type instance G (Char, a) = [a] -- ILLEGAL overlap, as [Char] /= [Int]<br />
</haskell><br />
<br />
==== Decidability ====<br />
<br />
In order to guarantee that type inference in the presence of type families is decidable, we need to place a number of additional restrictions on the formation of type instance declarations (c.f., Definition 5 (Relaxed Conditions) of [http://www.cse.unsw.edu.au/~chak/papers/SPCS08.html Type Checking with Open Type Functions]). Instance declarations have the general form<br />
<haskell><br />
type instance F t1 .. tn = t<br />
</haskell><br />
where we require that for every type family application <hask>(G s1 .. sm)</hask> in <hask>t</hask>, <br />
# <hask>s1 .. sm</hask> do not contain any type family constructors,<br />
# the total number of symbols (data type constructors and type variables) in <hask>s1 .. sm</hask> is strictly smaller than in <hask>t1 .. tn</hask>, and<br />
# for every type variable <hask>a</hask>, <hask>a</hask> occurs in <hask>s1 .. sm</hask> at most as often as in <hask>t1 .. tn</hask>.<br />
These restrictions are easily verified and ensure termination of type inference. However, they are not sufficient to guarantee completeness of type inference in the presence of, so called, ''loopy equalities'', such as <hask>a ~ [F a]</hask>, where a recursive occurrence of a type variable is underneath a family application and data constructor application - see the above mentioned paper for details. <br />
<br />
If the option <tt>-XUndecidableInstances</tt> is passed to the compiler, the above restrictions are not enforced and it is on the programmer to ensure termination of the normalisation of type families during type inference.<br />
<br />
=== Equality constraints ===<br />
<br />
Type context can include equality constraints of the form <hask>t1 ~ t2</hask>, which denote that the types <hask>t1</hask> and <hask>t2</hask> need to be the same. In the presence of type families, whether two types are equal cannot generally be decided locally. Hence, the contexts of function signatures may include equality constraints, as in the following example:<br />
<haskell><br />
sumCollects :: (Collects c1, Collects c2, Elem c1 ~ Elem c2) => c1 -> c2 -> c2<br />
</haskell><br />
where we require that the element type of <hask>c1</hask> and <hask>c2</hask> are the same. In general, the types <hask>t1</hask> and <hask>t2</hask> of an equality constraint may be arbitrary monotypes; i.e., they may not contain any quantifiers, independent of whether higher-rank types are otherwise enabled.<br />
<br />
Equality constraints can also appear in class and instance contexts. The former enable a simple translation of programs using functional dependencies into programs using family synonyms instead. The general idea is to rewrite a class declaration of the form<br />
<haskell><br />
class C a b | a -> b<br />
</haskell><br />
to<br />
<haskell><br />
class (F a ~ b) => C a b where<br />
type F a<br />
</haskell><br />
That is, we represent every functional dependency (FD) <hask>a1 .. an -> b</hask> by an FD type family <hask>F a1 .. an</hask> and a superclass context equality <hask>F a1 .. an ~ b</hask>, essentially giving a name to the functional dependency. In class instances, we define the type instances of FD families in accordance with the class head. Method signatures are not affected by that process.<br />
<br />
== Frequently asked questions ==<br />
<br />
=== Comparing type families and functional dependencies ===<br />
<br />
Functional dependencies cover some of the same territory as type families. How do the two compare?<br />
There are some articles about this question:<br />
<br />
* Experiences in converting functional dependencies to type families: "[[Functional dependencies vs. type families]]"<br />
* [http://hackage.haskell.org/trac/ghc/wiki/TFvsFD GHC trac] on a comparison of functional dependencies and type families<br />
<br />
=== Injectivity, type inference, and ambiguity ===<br />
<br />
A common problem is this<br />
<haskell><br />
type family F a<br />
<br />
f :: F a -> F a<br />
f = undefined<br />
<br />
g :: F Int -> F Int<br />
g x = f x<br />
</haskell><br />
The compiler complains about the definition of <tt>g</tt> saying<br />
<haskell><br />
Couldn't match expected type `F Int' against inferred type `F a1'<br />
</haskell><br />
In type-checking <tt>g</tt>'s right hand side GHC discovers (by instantiating <tt>f</tt>'s type with a fresh type variable) that it has type <tt>F a1 -> F a1</tt> for some as-yet-unknown type <tt>a1</tt>. Now it tries to make the inferred type match <tt>g</tt>'s type signature. Well, you say, just make <tt>a1</tt> equal to <tt>Int</tt> and you are done. True, but what if there were these instances<br />
<haskell><br />
type instance F Int = Bool<br />
type instance F Char = Bool<br />
</haskell><br />
Then making <tt>a1</tt> equal to <tt>Char</tt> would ''also'' make the two types equal. Because there is (potentially) more than one choice, the program is rejected.<br />
<br />
However (and confusingly) if you omit the type signature on <tt>g</tt> altogether, thus<br />
<haskell><br />
f :: F a -> F a<br />
f = undefined<br />
<br />
g x = f x<br />
</haskell><br />
GHC will happily infer the type <tt>g :: F a -> F a</tt>. But you can't ''write'' that type signature or, indeed, the more specific one above. (Arguably this behaviour, where GHC ''infers'' a type it can't ''check'', is very confusing. I suppose we could make GHC reject both programs, with and without type signatures.)<br />
<br />
'''What is the problem?''' The nub of the issue is this: knowing that <tt>F t1</tt>=<tt>F t2</tt> does ''not'' imply that <tt>t1</tt> = <tt>t2</tt>.<br />
The difficulty is that the type function <tt>F</tt> need not be ''injective''; it can map two distinct types to the same type. For an injective type constructor like <tt>Maybe</tt>, if we know that <tt>Maybe t1</tt> = <tt>Maybe t2</tt>, then we know that <tt>t1</tt> = <tt>t2</tt>. But not so for non-injective type functions.<br />
<br />
The problem starts with <tt>f</tt>. Its type is ''ambiguous''; even if I know the argument and result types for <tt>f</tt>, I cannot use that to find the type at which <tt>a</tt> should be instantiated. (So arguably, <tt>f</tt> should be rejected as having an ambiguous type, and probably will be in future.) The situation is well known in type classes: <br />
<haskell><br />
bad :: (Read a, Show a) => String -> String<br />
bad x = show (read x)<br />
</haskell><br />
At a call of <tt>bad</tt> one cannot tell at what type <tt>a</tt> should be instantiated.<br />
<br />
The only solution is to avoid ambiguous types. In the type signature of a function, <br />
* Ensure that every type variable occurs in the part after the "<tt>=></tt>"<br />
* Ensure that every type variable appears at least once outside a type function call.<br />
Alternatively, you can use data families, which create new types and are therefore injective. The following code works:<br />
<br />
<haskell>data family F a<br />
<br />
f :: F a -> F a<br />
f = undefined<br />
<br />
g :: F Int -> F Int<br />
g x = f x</haskell><br />
<br />
== References ==<br />
<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKPM05.html Associated Types with Class.] Manuel M. T. Chakravarty, Gabriele Keller, Simon Peyton Jones, and Simon Marlow. In ''Proceedings of The 32nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL'05)'', pages 1-13, ACM Press, 2005.<br />
* [http://www.cse.unsw.edu.au/~chak/papers/CKP05.html Associated Type Synonyms.] Manuel M. T. Chakravarty, Gabriele Keller, and Simon Peyton Jones. In ''Proceedings of The Tenth ACM SIGPLAN International Conference on Functional Programming'', ACM Press, pages 241-253, 2005.<br />
* [http://www.cse.unsw.edu.au/~chak/papers/SCPD07.html System F with Type Equality Coercions.] Martin Sulzmann, Manuel M. T. Chakravarty, Simon Peyton Jones, and Kevin Donnelly. In ''Proceedings of The Third ACM SIGPLAN Workshop on Types in Language Design and Implementation'', ACM Press, 2007.<br />
* [http://www.cse.unsw.edu.au/~chak/papers/SPCS08.html Type Checking With Open Type Functions.] Tom Schrijvers, Simon Peyton-Jones, Manuel M. T. Chakravarty, Martin Sulzmann. In ''Proceedings of The 13th ACM SIGPLAN International Conference on Functional Programming'', ACM Press, pages 51-62, 2008.<br />
* [[Simonpj/Talk:FunWithTypeFuns | Fun with Type Functions]] Oleg Kiselyov, Simon Peyton Jones, Chung-chieh Shan (the source for this paper can be found at http://patch-tag.com/r/schoenfinkel/typefunctions/wiki ) <br />
<br />
[[Category:Type-level programming]]<br />
[[Category:Language extensions]]<br />
[[Category:GHC|Indexed types]]</div>Sergvhttps://wiki.haskell.org/index.php?title=Research_papers/Functional_pearls&diff=57118Research papers/Functional pearls2013-11-21T07:07:17Z<p>Sergv: add "A Play on Regular Expressions"</p>
<hr />
<div>Functional pearls are elegant, instructive examples of functional programming. They are supposed to be fun, and they teach important programming techniques and fundamental design principles. They traditionally appear in [http://journals.cambridge.org/action/displayJournal?jid=JFP The Journal of Functional Programming], and at [http://www.icfpconference.org/index.html ICFP] and affiliated workshops.<br />
<br />
The history of functional pearls is covered by:<br />
<br />
;[http://icfp06.cs.uchicago.edu/bird-talk.pdf How to Write a Functional Pearl]<br />
:Richard Bird. ICFP 2006.<br />
<br />
;[http://portal.acm.org/ft_gateway.cfm?id=1159832&type=pdf&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618 Fifteen years of functional pearls]<br />
:Richard Bird. ICFP 2006.<br />
<br />
;[http://spivey.oriel.ox.ac.uk/mike/firstpearl.pdf Strachey's functional pearl, forty years on]<br />
:Mike Spivey, 2006.<br />
<br />
;[http://www.brics.dk/RS/07/14/index.html On Barron and Strachey's Cartesian Product Function]<br />
: Olivier Danvy and Michael Spivey, July 2007<br />
<br />
There have been many functional pearls in JFP, and some others at<br />
ICFP and the Haskell Workshop. There is also a collection of them in<br />
[http://web2.comlab.ox.ac.uk/oucl/publications/books/fop/ The Fun of Programming].<br />
<br />
The pearls tend to concentrate on:<br />
<br />
* Examples of program calculation and proof<br />
* Neat presentations of new or old data structures<br />
* Interesting applications and techniques<br />
<br />
Some advice on writing pearls for JFP is available in this [http://www.comlab.ox.ac.uk/people/Jeremy.Gibbons/pearls/ editorial].<br />
<br />
=== Online ===<br />
<br />
Functional pearls online:<br />
<br />
;[http://web.cecs.pdx.edu/~mpj/snakecube/ Solving the Snake Cube Puzzle in Haskell]<br />
:Mark P. Jones. 2013<br />
<br />
;[http://spivey.oriel.ox.ac.uk/wiki/images/0/06/Maybe.pdf When Maybe is not good enough]<br />
:Michael Spivey. 2012<br />
<br />
;[http://www.cis.upenn.edu/~byorgey/pub/monoid-pearl.pdf Monoids: Theme and Variations]<br />
:Brent Yorgey. 2012.<br />
<br />
;[http://wwwhome.ewi.utwente.nl/~fokkinga/mmf2005d-houghtrf-JFPpre.pdf The Hough transform]<br />
:Maarten Fokkinga. 2011.<br />
<br />
;[http://www.staff.science.uu.nl/~swier004/Publications/DutchNationalFlag.pdf Sorted. Verifying the Problem of the Dutch National Flag in Agda]<br />
:Wouter Swierstra. 2011.<br />
<br />
;[http://www.cs.ox.ac.uk/people/ralf.hinze/publications/Quote.pdf Typed Quote/Antiquote - Or: Compile-time Parsing]<br />
:Ralf Hinze. 2011.<br />
<br />
;[http://research.microsoft.com/en-us/people/dimitris/every-bit-counts.pdf Every Bit Counts]<br />
:Dimitrios Vytiniotis, Andrew Kennedy. 2010.<br />
<br />
;[http://sebfisch.github.io/haskell-regexp/regexp-play.pdf A Play on Regular Expressions]<br />
:Sebastian Fischer. Frank Huch, Thomas Wilke. 2010.<br />
<br />
;[http://wwwtcs.inf.tu-dresden.de/~voigt/icfp09.pdf Free Theorems Involving Type Constructor Classes]<br />
:Janis VoigtlÃ¤nder. 2009.<br />
<br />
;[http://www.cs.kent.ac.uk/pubs/2009/2847/content.pdf Linear, Bounded, Functional Pretty-Printing]<br />
:S. Doaitse Swierstra, Olaf Chitil. 2009.<br />
<br />
;[http://wwwtcs.inf.tu-dresden.de/~voigt/popl09-2.pdf Bidirectionalization for Free!]<br />
:Janis VoigtlÃ¤nder. 2009.<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/ICFP08.pdf Functional Pearl: Streams and Unique Fixed Points]<br />
:Ralf Hinze, ICFP 2008<br />
<br />
;[ftp://ftp.diku.dk/diku/semantics/papers/D-582.pdf Generic Discrimination: Sorting and Partitioning Unshared Data in Linear Time]<br />
:Fritz Henglein. 2008<br />
<br />
;[http://wwwtcs.inf.tu-dresden.de/~voigt/popl202-voigtlaender.pdf Much Ado about Two: A Pearl on Parallel Prefix Computation]<br />
:Janis VoigtlÃ¤nder. 2008.<br />
<br />
;[http://strictlypositive.org/CJ.pdf Clowns to the Left of me, Jokers to the Right: Dissecting Data Structures]: Conor McBride. 2008.<br />
<br />
;[http://www.ccs.neu.edu/home/dherman/research/papers/icfp07-great-escape.pdf Functional Pearl: The Great Escape: Or how to jump the border without getting caught]<br />
:David Herman. 2007.<br />
<br />
;[https://www.cs.indiana.edu/~adamsmd/papers/scrap_your_zippers/ScrapYourZippers-2007.pdf Scrap Your Zippers]<br />
:Michael Adams. 2007. Superseded by the [https://www.cs.indiana.edu/~adamsmd/papers/scrap_your_zippers/ScrapYourZippers-2010.pdf WGP 2010 version].<br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.4086 A type-correct, stack-safe, provably correct expression compiler in Epigram]<br />
:James McKinna and Joel Wright. 2006.<br />
<br />
;[http://www.soi.city.ac.uk/~ross/papers/Applicative.pdf Applicative Programming with Effects]<br />
:Conor McBride and Ross Paterson. 2006.<br />
<br />
;[http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/publications/rationals.pdf Enumerating the rationals]<br />
:Jeremy Gibbons, David Lester and Richard Bird. 2006.<br />
<br />
;[http://www.cs.tufts.edu/~nr/comp150fp/archive/richard-bird/sudoku.pdf A program to solve Sudoku]<br />
:Richard Bird. 2006. (slides [http://icfp06.cs.uchicago.edu/bird-talk.pdf appear here], a Literate Haskell implementation by Graham Hutton based on this can be found [http://www.cs.nott.ac.uk/~gmh/sudoku.lhs here]).<br />
<br />
;[http://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf Probabilistic functional programming in Haskell]<br />
:Martin Erwig and Steve Kollmansberger. 2006. <br />
<br />
;[http://cms.brookes.ac.uk/staff/SharonCurtis/publications/marbles.ps.gz Marble mingling]<br />
:Sharon Curtis. 2006.<br />
<br />
;[http://wiki.di.uminho.pt/twiki/pub/Personal/Xana/WebHome/report.pdf Strong Types for Relational Databases]<br />
:Alexandra Silva, Joost Visser. 2006. (Haskell Workshop)<br />
<br />
;[http://okmij.org/ftp/papers/LogicT.pdf Backtracking, interleaving, and terminating monad transformers]<br />
:Oleg Kiselyov, Chung-chieh Shan, Daniel P. Friedman, Amr Sabry. 2005. <br />
<br />
;[http://homepages.inf.ed.ac.uk/jcheney/publications/cheney05icfp.pdf Scrap your Nameplate]<br />
:James Cheney. 2005.<br />
<br />
;[http://research.microsoft.com/~akenn/fun/picklercombinators.pdf Pickler Combinators]<br />
:Andrew Kennedy. 2004.<br />
<br />
;[http://web.cecs.pdx.edu/~mpj/pubs/composing-fractals.pdf Composing fractals]<br />
:Mark P. Jones. 2004.<br />
<br />
;[http://www.cs.dartmouth.edu/~doug/nfa.ps.gz Enumerating the strings of regular languages]<br />
:M. Douglas McIlroy. 2004.<br />
<br />
;[http://www.kestrel.edu/home/people/meertens/publications/papers/Calculating_the_Sieve_of_Eratosthenes.pdf Calculating the Sieve of Eratosthenes]<br />
:Lambert Meertens. Journal of Functional Programming, 14(6):759-763, 2004. ([http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/wg21/meeting57/meertens-sieve.pdf slides])<br />
<br />
;[http://pauillac.inria.fr/~maranget/enum/pearl.ps Functional satisfaction]<br />
:Luc Maranget. 2004. ([http://pauillac.inria.fr/~maranget/enum/index.html More info]).<br />
<br />
;[http://portal.acm.org/ft_gateway.cfm?id=1017481&type=pdf&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618 Implicit configurations--or, type classes reflect the values of types]<br />
:Oleg Kiselyov, Chung-chieh Shan. 2004. ([http://www.cs.rutgers.edu/~ccshan/prepose/p1214-kiselyov.pdf also here])<br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.145&rep=rep1&type=ps Global variables in Haskell]<br />
:John Hughes. 2004. ([http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=241773 JFP])<br />
<br />
;[http://portal.acm.org/ft_gateway.cfm?id=1017477&type=pdf&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618 I am not a number -- I am a free variable]<br />
:Conor McBride, James McKinna. 2004. <br />
<br />
;[http://www.informatik.uni-bonn.de/~ralf/hw2001/2.ps.gz Inverting the Burrows Wheeler transform]<br />
:Richard Bird and Shin-Cheng Mu. 2004. ([http://web.comlab.ox.ac.uk/people/Richard.Bird/online/BirdMu2004Inverting.pdf Also here]).<br />
<br />
;[http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/wg21/meeting56/loeh-paper.pdf Parsing permutation phrases]<br />
:Arthur Baars, Andres Loh and S. Doaitse Swierstra. 2004.<br />
<br />
;[http://web.cecs.pdx.edu/~antoy/homepage/publications/inject/paper.pdf Concurrent distinct choices]<br />
:Sergio Antoy and Michael Hanus. 2004. <br />
<br />
;[http://www.ling.gu.se/~peb/pubs/Ljunglof-2004b.pdf Functional chart parsing of context-free grammars]<br />
:Peter Ljunglf. 2004.<br />
<br />
;[http://www.seas.upenn.edu/~sweirich/papers/cast/cast.pdf Type-Safe Cast]<br />
:Stephanie Weirich. 2004.<br />
<br />
;[http://research.microsoft.com/~akenn/fun/picklercombinators.pdf Pickler Combinators]<br />
:Andrew J. Kennedy. 2004.<br />
<br />
;[www.cse.chalmers.se/edu/course/afp/Papers/parser-claessen.pdf Parallel Parsing Processes]<br />
:Koen Claessen. 2004.<br />
<br />
;[http://www.informatik.uni-bonn.de/~ralf/hw2001/1.pdf Derivation of a logarithmic time carry lookahead addition circuit]<br />
:John O'Donnell and Gudula Runger. 2004. ([http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=1030343 ACM]) ([http://journals.cambridge.org/action/displayAbstract?aid=254717 JFP]) ([http://www.informatik.uni-bonn.de/~ralf/hw2001/1.html Homepage]). <br />
<br />
;[http://pages.cs.brandeis.edu/~mairson/Papers/jfp02.pdf Linear lambda calculus and PTIME-completeness]<br />
:Harry G. Mairson. 2004. <br />
<br />
;[http://www.lri.fr/~filliatr/ftp/publis/kr-fp.ps.gz Producing all ideals of a forest, functionally]<br />
:Jean-Christophe Filliatre and Francois Pottier. 2003.<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/HW03.pdf Trouble shared is trouble halved]<br />
:Richard Bird, Ralf Hinze. 2003<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/Format.ps.gz Formatting: a class act]<br />
:Ralf Hinze. 2003.<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/SearchTree.ps.gz A fresh look at binary search trees]<br />
:Ralf Hinze. 2002.<br />
<br />
;[http://www.cs.nott.ac.uk/~gmh/countdown.pdf The countdown problem]<br />
:Graham Hutton. 2002. <br />
<br />
;[http://pdos.csail.mit.edu/papers/packrat-parsing:icfp02.pdf Packrat parsing: simple, powerful, lazy, linear time, functional pearl]<br />
:Bryan Ford. 2002.<br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.3014 Monads for Incremental Computing]<br />
:Magnus Carlsson. 2002.<br />
<br />
;[http://www.cs.nott.ac.uk/~gmh/wgp01/hinze-paper.pdf Haskell does it with class: Functorial unparsing]<br />
:Ralf Hinze. 2001.<br />
<br />
<br />
;[http://eprints.ouls.ox.ac.uk/archive/00000863/01/bird_2001_11_3.pdf Unfolding pointer algorithms]<br />
:Richard Bird. 2001.<br />
<br />
;[http://eprints.ouls.ox.ac.uk/archive/00000862/01/bird_2001.pdf Maximum marking problems]<br />
:Richard Bird. 2001. <br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/TheWeb.ps.gz Weaving a web]<br />
:Ralf Hinze and Johan Jeuring. 2001. <br />
<br />
;[http://www.brics.dk/RS/01/16/BRICS-RS-01-16.pdf Normalization by evaluation with typed abstract syntax]<br />
:Olivier Danvy, Morten Rhiger and Kristoffer H. Rose. 2001. <br />
<br />
;[http://www.brics.dk/RS/01/10/BRICS-RS-01-10.ps.gz Do we need dependent types?]<br />
:Daniel Fridlender and Mia Indrika. 2001.<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/BitReversal.ps.gz Perfect trees and bit-reversal permutations], a revision of [http://www.cs.ox.ac.uk/ralf.hinze/publications/IAI-TR-99-4.ps.gz Technical Report IAI-TR-99-4]<br />
:Ralf Hinze. 2000.<br />
<br />
;[http://spivey.oriel.ox.ac.uk/mike/bfs/bfs.ps Combinators for breadth-first search]<br />
:Michael Spivey. 2000 <br />
<br />
;[http://www.eecs.usma.edu/webs/people/okasaki/icfp00.ps Breadth-First Numbering: Lessons from a Small Exercise in Algorithm Design]<br />
:Chris Okasaki. 2000.<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/ICFP00.ps.gz Deriving Backtracking Monad Transformers]<br />
:Ralf Hinze. 2000.<br />
<br />
;[http://www.cis.upenn.edu/~bcpierce/papers/rsr.ps Recursive subtyping revealed]<br />
:Vladimir Gapeyev, Michael Y. Levin, Benjamin C. Pierce. 2000.<br />
<br />
;[http://research.microsoft.com/Users/simonpj/Papers/financial-contracts/contracts-icfp.ps.gz Composing contracts: an adventure in financial engineering]<br />
:Simon Peyton Jones, Jean-Marc Eber, Julian Seward. 2000.<br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8039 A poor man's concurrency monad]<br />
:Koen Claessen. 1999. <br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/BinomialHeaps/index.html Explaining binomial heaps]<br />
:Ralf Hinze. 1999. <br />
<br />
;[http://www.cs.dartmouth.edu/~doug/pearl.ps.gz Power series, power serious]<br />
:M. Douglas McIlroy. 1999. <br />
<br />
;[http://www.eecs.usma.edu/webs/people/okasaki/jfp99.ps Red-black trees in a functional setting]<br />
:Chris Okasaki. 1999. <br />
<br />
;[http://www.cs.cmu.edu/~rwh/papers/regexp/jfp.ps Proof-directed debugging]<br />
:Robert Harper. 1999. (see also [http://ropas.snu.ac.kr/~kwang/paper/06-jfp-yi.pdf Proof-directed debugging: revisited for a first-order version]).<br />
<br />
;[http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/publications/radix.ps.gz A pointless derivation of radix sort]<br />
:Jeremy Gibbons. 1999.<br />
<br />
;[http://www.cs.nott.ac.uk/~gmh/pearl.pdf Monadic parsing in Haskell]<br />
:Graham Hutton and Erik Meijer . 1998.<br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.6891 Polytypic unification]<br />
:Patrik Jansson and Johan Jeuring. 1998. <br />
<br />
;[http://web.engr.oregonstate.edu/~erwig/papers/Diet_JFP98.pdf Diets for fat sets]<br />
:Martin Erwig. 1998. <br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1673 Even higher-order functions for parsing or Why would anyone ever want to use a sixth-order function?]<br />
:Chris Okasaki. 1998.<br />
<br />
;[http://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced-fp/docs/huet-zipper.pdf The Zipper]<br />
:Gerard Huet. 1997. (See also [http://en.wikibooks.org/wiki/Haskell/Zippers The Haskell Wikibook]).<br />
<br />
;[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.7096 Lazy wheel sieves and spirals of primes]<br />
:Colin Runciman. 1997. <br />
<br />
;[http://www.eecs.usma.edu/webs/people/okasaki/jfp97.ps Three algorithms on Braun trees]<br />
:Chris Okasaki. 1997. <br />
<br />
;[http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/publications/thirdht.ps.gz The Third Homomorphism Theorem]<br />
:Jeremy Gibbons. 1996.<br />
<br />
;[http://www.cs.nott.ac.uk/~gmh/basics.pdf Back to Basics: Deriving Representation Changers Functionally].<br />
:Graham Hutton, Erik Meijer. 1996.<br />
<br />
;[http://research.microsoft.com/~akenn/fun/DrawingTrees.pdf Drawing Trees]<br />
:Andrew J. Kennedy. 1996.<br />
<br />
;[http://research.microsoft.com/~nick/newtrans.pdf Undoing Dynamic Typing]<br />
:Nick Benton. 2008.<br />
<br />
;[http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/publications/drawing.ps.gz Deriving Tidy Drawings of Trees]<br />
:Jeremy Gibbons. 1996. (See also [http://www.cs.auckland.ac.nz/CDMTCS//researchreports/003drawing.pdf the research report]).<br />
<br />
;[http://groups.csail.mit.edu/mac/users/adams/BB/ Efficient Sets - A Balancing Act]<br />
:Stephen Adams. 1993. ([http://hackage.haskell.org/packages/archive/containers/latest/doc/html/Data-Set.html Data.Set]).<br />
<br />
=== Potential Pearls ===<br />
<br />
Unpublished pearls.<br />
<br />
;[http://www.cs.ox.ac.uk/ralf.hinze/publications/Quote.pdf Typed Quote/AntiQuote]<br />
:Ralf Hinze. Unpublished work in progress.<br />
<br />
;[http://www.cs.nott.ac.uk/~txa/publ/alpha-draft.pdf &alpha;-conversion is easy]<br />
:Thorsten Altenkirch. Unpublished draft.<br />
<br />
=== Offline ===<br />
<br />
These appear not to be available online, unfortunately. If you know where they live, please link, and move into the 'online' section!<br />
<br />
;[http://www.cs.kent.ac.uk/pubs/2001/1293/index.html Red-black trees with types]<br />
:Stefan Kahrs. 2001. ([http://journals.cambridge.org/action/displayAbstract?aid=83905 JFP]) ([http://www.cs.kent.ac.uk/people/staff/smk/redblack/rb.html Code!])<br />
<br />
;On generating unique names<br />
:Lennart Augustsson, M Rittri, D Synek. 1994. ([http://www.cs.chalmers.se/~rittri/#publications Rittri's homepage])<br />
<br />
;A Symmetric Set of Efficient List Operations.<br />
:Rob R. Hoogerwoord, 1992. ([https://venus.tue.nl/ep-cgi/ep_publ.opl?taal=NL&fac_id=92&rn=19840694 Hoogerwoord's homepage]).<br />
<br />
;[http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=335124 Finding celebrities: A lesson in functional programming]<br />
:Richard Bird and Sharon Curtis. 2006. ([http://cms.brookes.ac.uk/staff/SharonCurtis/publications/index.html#celebrities See also]).<br />
<br />
;[http://journals.cambridge.org/action/displayAbstract?aid=254705 On tiling a chessboard]<br />
:Richard Bird. 2004. ([http://portal.acm.org/citation.cfm?id=1030333.1030336 ACM]) <br />
<br />
;[http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=44141 Meertens number]<br />
:Richard Bird. 1998.<br />
<br />
;[http://journals.cambridge.org/action/displayAbstract?aid=44091 On merging and selection]<br />
:Richard Bird. 1997. (See also [http://web.comlab.ox.ac.uk/people/Jeremy.Gibbons/publications/merging.ps.gz More on Merging and Selection]).<br />
<br />
;[http://journals.cambridge.org/action/displayAbstract?aid=44107 On building trees with minimum height]<br />
:Richard Bird. 1997. ([http://web.comlab.ox.ac.uk/people/Richard.Bird/publications-bib.html#Bird97:OnBuilding Bib]).<br />
<br />
;The Last Tail. <br />
:Richard Bird. 1993. ([http://web.comlab.ox.ac.uk/people/Richard.Bird/publications-bib.html#Bird93:Last Bib]).<br />
<br />
;Two Greedy Algorithms<br />
:Richard Bird, 1992. ([http://web.comlab.ox.ac.uk/oucl/work/richard.bird/publications-bib.html#Bird92:Two Bib]).<br />
<br />
;On Removing Duplicates<br />
:Richard Bird. 1991. ([http://web.comlab.ox.ac.uk/oucl/work/richard.bird/publications-bib.html#DBLP:journals/jfp/Bird91a Bib]).<br />
<br />
[[Category:Research]] [[Category:Tutorials]]</div>Sergvhttps://wiki.haskell.org/index.php?title=High-level_option_handling_with_GetOpt&diff=57103High-level option handling with GetOpt2013-11-17T19:22:52Z<p>Sergv: fix typo</p>
<hr />
<div>== Introduction ==<br />
<br />
I've used Haskell to create various command-line utitities for unix-like<br />
systems. In the process I developed a simple yet powerful and flexible<br />
technique for processing program options.<br />
<br />
Sven Panne's GetOpt module does a fine job at parsing command line options. It<br />
has a simple, easy to use interface, handles short and long options,<br />
abbreviated options and most needed option argument modes (no / optional /<br />
required argument). It can even produce a nice usage info from option<br />
descriptions. <br />
<br />
(Documentation and example code using this module can be found at<br />
http://www.haskell.org/ghc/docs/latest/html/libraries/base/System-Console-GetOpt.html)<br />
<br />
However, I checked over half a dozen serious programs written in Haskell and<br />
saw that in most of them the code responsible for option handling is quite<br />
ugly, repetitious and error prone (of course there were exceptions, see below).<br />
Saying 'option handling' I don't mean the work of analyzing the command line,<br />
but rather how supplied options influence program's behaviour, how error<br />
situations are handled, etc.<br />
<br />
(The exception is Wolfgang Thaller's VOP program. Wolfgang used a nice, but a<br />
little lengthy technique there. There are probably other ,,exceptional''<br />
programs out there).<br />
<br />
I think there are two major reasons of the current situation. The first reason<br />
is that option handling is rarely of primary concern to the programmer. Often<br />
there are not that much options to handle in the beginning and it seems that<br />
the simplest solution will do. Alas, if the program is evolving and its users<br />
ask for new functionality, new options appear, and the initial design starts to<br />
be an obstacle.<br />
<br />
The other reason is lack of good examples. The example delivered with GetOpt<br />
shows how to use the library, but it doesn't show that there are other, better<br />
ways to use it. I don't propose to introduce more advanced techniques to this<br />
example, because it would make it rather heavy. But it would be nice if there<br />
would be a pointer for interested users.<br />
<br />
== About typical use of GetOpt ==<br />
<br />
In a typical use of GetOpt module (like in the example) there is a definition<br />
of a sum [[type|datatype]] with data [[constructor]]s corresponding to individual command-line options. Options with no arguments are represented by nullary constructors, and options that must or can have arguments - by unary<br />
constructors. For example, below is a slightly modified example from GetOpt's<br />
documentation:<br />
<br />
<haskell><br />
data Flag = Verbose -- this option has no arguments<br />
| Version -- no arguments<br />
| Input (Maybe String) -- optional argument<br />
| Output String -- mandatory argument<br />
| LibDir String -- mandatory argument<br />
</haskell><br />
<br />
GetOpt uses 'ArgDescr a' datatype to specify argument modes of options:<br />
<br />
<haskell><br />
data ArgDescr a = NoArg a<br />
| ReqArg (String -> a) String<br />
| OptArg (Maybe String -> a) String<br />
</haskell><br />
<br />
Because every constructor of Flag datatype has one of types Flag, (String -><br />
Flag) or (Maybe String -> Flag), they can be given verbatim as first arguments<br />
to appropriate constructors of 'ArgDescr Flag' datatype.<br />
<br />
Used in this way, getOpt parses a list of strings to a list of Flag values.<br />
It also separates options from non option arguments, signals unrecognized,<br />
ambiguous or improperly used options. <br />
<br />
But after all, it is not that big step. There is still much processing of this<br />
Flag list to do. One has to check if specific options are there in the<br />
list, some options have to be combined, etc, etc.<br />
<br />
== Advertised technique ==<br />
===Imports for the full program===<br />
( Because I wanted it to be a fully functional Literate Haskell program, here come the required imports )<br />
<br />
<haskell><br />
module Main (main) where<br />
<br />
import System.Console.GetOpt<br />
import System<br />
import Control.Monad<br />
import IO<br />
import List<br />
import Char<br />
</haskell><br />
===Analysis of options typing===<br />
Let's look at program options not from the command-line side, but rather from<br />
program(mer)'s side. How to encode them? What would be the best way to present<br />
them to the programmer. How should they influence program's behaviour?<br />
<br />
'Verbose' could be an easily accessible Bool value, False by default.<br />
'Version', if supplied, could just print the program's version and exit.<br />
'Input' could just yield a String, either from stdin or from a specified file.<br />
'Output' could just take a String and do something with it.<br />
<br />
We need some kind of a fixed-size polytypic dictionary of option values - a<br />
perfect application for Haskell's [[record]]s. <br />
<br />
<haskell><br />
data Options = Options { optVerbose :: Bool<br />
, optInput :: IO String<br />
, optOutput :: String -> IO ()<br />
}<br />
</haskell><br />
<br />
Note that I didn't place optVersion in the record. We will handle this option<br />
differently.<br />
<br />
There is one special combination of options, the start (or default) options:<br />
<br />
<haskell><br />
startOptions :: Options<br />
startOptions = Options { optVerbose = False<br />
, optInput = getContents<br />
, optOutput = putStr<br />
}<br />
</haskell><br />
<br />
===Threading the options through the program===<br />
But in the end we would like to get the record reflecting the options given to<br />
the program. We do this by threading the Options record through functions<br />
processing single options. Each such function can change this record. Why not<br />
just put such functions in ArgDescr and OptDescr datatypes? Here we benefit<br />
from first-class citizenship of functions.<br />
<br />
I won't use a pure function with type (Options -> Options), but rather an<br />
effectful function in the IO Monad, because I want to easily perform side<br />
effects during option parsing (here only in 'version' and 'help' options, but I<br />
could also check validity of input and output files during option processing).<br />
You may prefer to use a pure function, a State+Error monad, a State+IO monad,<br />
or something different...<br />
<br />
<haskell><br />
options :: [ OptDescr (Options -> IO Options) ]<br />
options =<br />
[ Option "i" ["input"]<br />
(ReqArg<br />
(\arg opt -> return opt { optInput = readFile arg })<br />
"FILE")<br />
"Input file"<br />
<br />
, Option "o" ["output"]<br />
(ReqArg<br />
(\arg opt -> return opt { optOutput = writeFile arg })<br />
"FILE")<br />
"Output file"<br />
<br />
, Option "s" ["string"]<br />
(ReqArg<br />
(\arg opt -> return opt { optInput = return arg })<br />
"FILE")<br />
"Input string"<br />
<br />
, Option "v" ["verbose"]<br />
(NoArg<br />
(\opt -> return opt { optVerbose = True }))<br />
"Enable verbose messages"<br />
<br />
, Option "V" ["version"]<br />
(NoArg<br />
(\_ -> do<br />
hPutStrLn stderr "Version 0.01"<br />
exitWith ExitSuccess))<br />
"Print version"<br />
<br />
, Option "h" ["help"]<br />
(NoArg<br />
(\_ -> do<br />
prg <- getProgName<br />
hPutStrLn stderr (usageInfo prg options)<br />
exitWith ExitSuccess))<br />
"Show help"<br />
]<br />
</haskell><br />
<br />
===The <hask>main</hask> function===<br />
<br />
<haskell><br />
main = do<br />
args <- getArgs<br />
<br />
-- Parse options, getting a list of option actions<br />
let (actions, nonOptions, errors) = getOpt RequireOrder options args<br />
<br />
-- Here we thread startOptions through all supplied option actions<br />
opts <- foldl (>>=) (return startOptions) actions<br />
<br />
let Options { optVerbose = verbose<br />
, optInput = input<br />
, optOutput = output } = opts<br />
<br />
when verbose (hPutStrLn stderr "Hello!")<br />
<br />
input >>= output<br />
</haskell><br />
<br />
Voila! As you can see most of work is done in option descriptions.<br />
<br />
[[Category:Idioms]]<br />
<br />
== See also ==<br />
<br />
* [[Command line option parsers]]<br />
* [[GetOpt]]</div>Sergvhttps://wiki.haskell.org/index.php?title=The_Monad.Reader/Issue5/Practical_Graph_Handling&diff=42601The Monad.Reader/Issue5/Practical Graph Handling2011-10-27T18:18:09Z<p>Sergv: /* foldG */ sync explanation and code</p>
<hr />
<div>'''This article needs reformatting! Please help tidy it up.'''--[[User:WouterSwierstra|WouterSwierstra]] 14:26, 9 May 2008 (UTC)<br />
<br />
= A Practical Approach to Graph Manipulation =<br />
''by JeanPhilippeBernardy for The Monad.Reader Issue 5''<br />
[[BR]]<br />
''[[Date(2005-07-08T20:48:51Z)]]''<br />
<br />
'''Abstract.'''<br />
<br />
Tree-based data structures are easy to deal with in Haskell. <br />
However, working with graph-like structures in practice is much less obvious. <br />
In this article I present a solution that has worked for me in many cases.<br />
<br />
<br />
== Introduction ==<br />
<br />
I always found that dealing with graphs in Haskell is a tricky subject. <br />
Even something like a implementing a depth-first search, which is <br />
trivially achieved in an imperative language, deserves an article <br />
on its own for Haskell [[#dfs 4]].<br />
A PhD thesis has been written on the subject of graphs and functional programming [[#king-thesis 2]], and it seems that it still <br />
doesn't exhaust the <br />
design-space: radically different ideas have been proposed afterwards [[#induct 3]].<br />
<br />
In this article I'll present (a simplified version of) a solution that <br />
I think deserves more coverage [[#cycle-therapy 1]]. The idea is to abstract graph <br />
manipulation by anamorphisms and catamorphisms.<br />
This approach features "separation of concerns" and "composability", hence it<br />
can be readily applied to <br />
practical problems.<br />
<br />
* Section 2 shows how anamorphisms and catamorphisms can be generalised to graphs.<br />
* Section 3 details the data structures used to represent graphs<br />
* Section 4 discusses various problems where cata/anamorphisms can be applied<br />
* Section 5 gives a sample implementation for the catamorphism and anamorphism<br />
* Section 6 concludes<br />
<br />
=== Nota ===<br />
<br />
This article has been generated from a literate haskell source. <br />
So, although the text of this wiki page will not compile, all the examples are <br />
real and run. The source can be accessed here: attachment:PracticalGraphHandling.lhs<br />
<br />
We will assume you know the hierarchical libraries. Refer to http://haskell.org/ghc/docs/latest/html/libraries/index.html in case of doubt.<br />
<br />
== Origami with Graphs ==<br />
<br />
=== Fold & Unfold (the big deal) ===<br />
<br />
Most of you probably know what a "fold" (also known as catamorphism) <br />
is. For those who don't, intuitively, it's an higher-order operation <br />
that reduces a complex structure to a single value. It applies a <br />
function given as parameter to each node, propagating the results <br />
up to the root. This is a highly imprecise definition, for more <br />
details please read [[#bananas-lenses 5]].<br />
<br />
For example, the fold operation on lists can be typed as follows:<br />
<br />
<haskell><br />
foldr :: (a -> b -> b) -> -- ^ operation to apply<br />
b -> -- ^ initial value<br />
[a] -> -- ^ input list<br />
b -- ^ result<br />
</haskell><br />
<br />
Conversely, "unfold" builds a complex structure out of a building<br />
function, applying it iteratively.<br />
<br />
<haskell><br />
unfoldr :: (b -> Maybe (a, b)) -> -- ^ building function (Nothing => end of list)<br />
b -> -- ^ seed value<br />
[a] -- ^ result<br />
</haskell><br />
<br />
The second argument is the initial value from which the <br />
whole resulting list will be derived, by applying the 1st argument.<br />
In the following we'll refer to it as the "seed".<br />
<br />
<br />
The catamorphism/anamorphism abstractions have proven to be <br />
very useful in practise. They're ubiquitous to any haskell <br />
programming, either explicitly, or implicitly (hidden in <br />
higher-level operations). In this article I'll show how <br />
those abstractions can be generalised to graph structures, <br />
and argue that they are equally useful in this case.<br />
<br />
The rest of the article assumes the reader is fairly familiar <br />
with fold and unfold. Fortunately there are many articles on the<br />
subject. For example you can refer to [[#bananas-lenses 5]] if you ever feel uncomfortable.<br />
<br />
=== Generalisation ===<br />
<br />
Let's examine how fold/unfold can be generalized for graphs.<br />
Since we are working on graphs instead of lists, we must account for<br />
<br />
1. Any number of children for a node;<br />
2. "Backwards" arcs (cycles);<br />
3. Labelled edges.<br />
<br />
The most relevant point being 2, of course.<br />
<br />
==== unfoldG ====<br />
<br />
From the above, we can deduce that the type of unfoldG will be:<br />
<br />
<haskell><br />
unfoldG :: (Ord s) => (s -> (n, [(e, s)])) -> s -> (Vertex, LabGraph n e)<br />
unfoldG f r = (r', res)<br />
where ([r'], res) = unfoldGMany f [r]<br />
</haskell><br />
where <code>s</code> is the seed type, <code>n</code> is the node labels, <code>e</code> the edges labels.<br />
<br />
The <code>Ord s</code> constraint reflects point 2 above. <br />
It is needed because the unfoldG function must record every <br />
seed value encountered.<br />
Whenever a seed is seen a second time, <code>unfoldG</code> will recognize <br />
it and create a "backward arc".<br />
We use <code>Ord</code> instead of <code>Eq</code> because a mere equality test rules out using <code>Data.Map</code>.<br />
<br />
The attentive reader will note that we return an additional <br />
Vertex value. This is needed to identifty which node the root<br />
seed corresponds to.<br />
<br />
In order to get an intuitive feeling of how <code>unfoldG</code> works,<br />
let's examine a simple example.<br />
<br />
<haskell><br />
gr1 :: LabGraph Int Char<br />
(_,gr1) = unfoldG gen (0::Int) <br />
where gen x = (x,[('a',(x+1) `mod` 10), ('b', (x+2) `mod` 10)])<br />
</haskell><br />
<br />
<code>gr1</code> being defined as above, its structure is:<br />
<br />
attachment:gr1.png<br />
<br />
Because we might want to build a graph from a set of seeds <br />
instead of a single one, we will also need the following function:<br />
<haskell><br />
unfoldGMany :: (Ord s) => (s -> (n, [(e, s)])) -> [s] -> ([Vertex], LabGraph n e)<br />
unfoldGMany f roots = runST ( unfoldGManyST f roots ) -- detailed later<br />
</haskell> <br />
<br />
<code>unfoldG</code>, alone, is already very a practical tool, because it <br />
lets you reify a function (<code>a -> a</code>) graph. It then can be examined, <br />
processed, etc. whereas the function can only be evaluated.<br />
<br />
==== foldG ====<br />
<br />
On a graph, the catamorphism (fold) type will become:<br />
<haskell><br />
foldG :: (Eq r) => r -> (Vertex -> [(e, r)] -> r) -> Graph e -> Vertex -> r<br />
foldG i f g v = foldGAll i f g ! v<br />
</haskell><br />
<br />
As for <code>unfoldG</code>, the <code>foldG</code> <br />
function must include a special mechanism to handle cycles.<br />
The idea is to apply the operation iteratively until the result <br />
converges. It's the purpose of the first <br />
parameter is to "bootstrap" the process: <br />
it will be used as an initial value.<br />
<br />
Thus, <code>foldG i f g ! v</code> will iteratively <br />
apply <code>f</code> on nodes of graph <code>g</code>, <br />
using <code>i</code> as "bottom" value. It will return <br />
the value computed at vertex <code>v</code>. <br />
Of course, this will work only if <code>f</code> is well-behaved: <br />
it must converge at some point.<br />
I won't dwelve in to the theoretical details <br />
here, see [[#cycle-therapy 1]] for a<br />
formal explanation.<br />
<br />
Notice that <code>foldG</code> can work on a graph without node labels. <br />
If the parameter function needs to access node labels, it can <br />
do so without <code>foldG</code> needing to know.<br />
<br />
It's also worth noticing that, in our implementation, the <br />
information will be propagated in the reverse direction of arcs. <br />
<br />
It's very common to need the result value for each vertex, <br />
hence the need for :<br />
<br />
<haskell><br />
foldGAll :: (Eq r) => r -> (Vertex -> [(e, r)] -> r) -> Graph e -> Table r<br />
</haskell><br />
<br />
<br />
<br />
<br />
<br />
The implementation of these functions doesn't matter much. <br />
The point of the article is not how these can be implemented, <br />
but how they can be used for daily programming tasks. <br />
For completeness though, we'll provide a <br />
sample implemenation at the end of the article.<br />
<br />
== Data Structure & Accessors ==<br />
<br />
Without further ado, let's define the data structures we'll work on. <br />
<br />
<br />
<haskell><br />
type Vertex = Int<br />
type Table a = Array Vertex a<br />
type Graph e = Table [(e, Vertex)]<br />
type Bounds = (Vertex, Vertex)<br />
type Edge e = (Vertex, e, Vertex)<br />
</haskell><br />
A graph is a mere adjacency list table, tagged with edge labels.<br />
<br />
The above structure lacks labels for nodes.<br />
This is easily fixed by adding a labeling (or coloring) function.<br />
<haskell><br />
type Labeling a = Vertex -> a<br />
data LabGraph n e = LabGraph (Graph e) (Labeling n)<br />
<br />
vertices (LabGraph gr _) = indices gr<br />
<br />
labels (LabGraph gr l) = map l (indices gr)<br />
</haskell><br />
<br />
<br />
The above departs slightly from what's prescribed in [[#cycle-therapy 1]]. Instead of <br />
a ''true graph'' built by knot-tying, we chose to use an <code>Array</code><br />
with integers as explicit vertex references.<br />
This is closely follows <br />
Data.Graph in the hierarchical libraries, <br />
the only difference being that we have labelled edges. <br />
<br />
Not only this is simpler, but it has the advantage that we can reuse<br />
most of the algorithms from Data.Graph with only minor changes:<br />
<br />
<haskell><br />
-- | Build a graph from a list of edges.<br />
buildG :: Bounds -> [Edge e] -> Graph e<br />
buildG bounds0 edges0 = accumArray (flip (:)) [] bounds0 [(v, (l,w)) | (v,l,w) <- edges0]<br />
<br />
-- | The graph obtained by reversing all edges.<br />
transposeG :: Graph e -> Graph e<br />
transposeG g = buildG (bounds g) (reverseE g)<br />
<br />
reverseE :: Graph e -> [Edge e]<br />
reverseE g = [ (w, l, v) | (v, l, w) <- edges g ]<br />
</haskell><br />
<br />
<br />
However, as previously said, we'll try to abstract <br />
away from the details of the structure.<br />
This is not always possible, but in such cases, <br />
I believe the array representation to be<br />
a good choice, because it's easy to work with. <br />
If anything, one can readily use all the<br />
standard array functions.<br />
<br />
For example, here's the function to output a graph as a GraphViz file:<br />
<haskell><br />
showGraphViz (LabGraph gr lab) = <br />
"digraph name {\n" ++<br />
"rankdir=LR;\n" ++<br />
(concatMap showNode $ indices gr) ++<br />
(concatMap showEdge $ edges gr) ++<br />
"}\n"<br />
where showEdge (from, t, to) = show from ++ " -> " ++ show to ++<br />
" [label = \"" ++ show t ++ "\"];\n"<br />
showNode v = show v ++ " [label = " ++ (show $ lab v) ++ "];\n"<br />
<br />
edges :: Graph e -> [Edge e]<br />
edges g = [ (v, l, w) | v <- indices g, (l, w) <- g!v ]<br />
</haskell> <br />
<br />
<br />
== Applications ==<br />
<br />
I'll now enumerate a few problems where the "origami" approach can be applied successfully.<br />
<br />
=== Closure ===<br />
<br />
A simple application (special case) of "unfoldG" the <br />
computation of the transitive closure of a non-deterministic function.<br />
<br />
<haskell><br />
closure :: Ord a => (a -> [a]) -> (a -> [a])<br />
closure f i = labels $ snd $ unfoldG f' i <br />
where f' x = (x, [((), fx) | fx <- f x])<br />
</haskell><br />
<br />
In this context, "non deterministic" means that it yields many <br />
values, as a list. As noted before, this will work only when <br />
everything remains finite in size.<br />
<br />
<br />
For example, if we define<br />
<br />
<haskell><br />
interleave (x1:x2:xs) = (x1:x2:xs) : (map (x2:) (interleave (x1:xs)))<br />
interleave xs = [xs]<br />
<br />
interleave "abcd" ==> ["abcd","bacd","bcad","bcda"]<br />
</haskell><br />
<br />
a very bad way to compute the permutations of list can be<br />
<br />
<haskell><br />
permutations = closure interleave<br />
<br />
permutations "abcd" ==> ["abcd","bacd","acbd","cabd","abdc","badc", <br />
"adbc","dabc","dbac","bdac","dacb","adcb",<br />
"dcab","cdab","cadb","acdb","cdba","dcba",<br />
"cbda","bcda","bdca","dbca","bcad","cbad"]<br />
</haskell><br />
<br />
But sometimes the function to 'close' is more complicated than <code>interleave</code> and<br />
then <code>closure</code> becomes really useful.<br />
<br />
<br />
=== Shortest Path ===<br />
<br />
Let us now examine the toy problem of finding the distance <br />
to a given node from all the other nodes of the graph. <br />
Most readers probably know the Dijkstra's algorithm to <br />
compute the solution to the problem. We will not try <br />
to reproduce it here, instead we will define the computation in terms of <code>foldG</code>.<br />
<br />
Here it goes:<br />
<haskell><br />
-- | Compute the distance to v for every vertex of gr.<br />
distsTo :: Vertex -> Graph Float -> Table Float<br />
distsTo v gr = foldGAll infinite distance gr <br />
where infinite = 10000000 -- well, you get the idea<br />
distance v' neighbours <br />
| v == v' = 0<br />
| otherwise = minimum [distV+arcWeight | (distV, arcWeight) <- neighbours]<br />
</haskell><br />
<br />
So clear that it barely needs to be explained. :) <br />
Just notice how the minimize function assumes that the <br />
distance is already computed for all its neighbours. <br />
This works because <code>foldG</code> will iterate until it finds the fixed point.<br />
<br />
On this simple graph,<br />
<br />
<haskell><br />
grDist = buildG (1,5) [(1,5.0,2), (2,5.0,3), (2,7.0,4), (3,5.0,4), (4,5.0,5), (4,3.0,1)]<br />
</haskell><br />
<br />
attachment:grdist.png<br />
<br />
the result of <haskell><br />
dists = distsTo 5 grDist<br />
</haskell> is<br />
<br />
attachment:grdist2.png<br />
<br />
(labeling each node with the its result, ie. distance to vertex <code>5</code>)<br />
<br />
=== Finite Automaton ===<br />
<br />
Finite automatons are basically graphs, so let's see how we can apply the <br />
framework to their analysis.<br />
<br />
First, let's define an automaton. For our purposes, it is a graph <br />
of states/transitions, some of the states being marked as initial or final.<br />
<br />
<haskell><br />
type Automaton t = (Vertex, Graph t, Set Vertex) -- ^ Initial, transitions, finals<br />
</haskell><br />
<br />
For starters, here is how the <code>showGraphViz</code> function can be applied to automaton display:<br />
<br />
<haskell><br />
automatonToGraphviz (i, gr, fs) = showGraphViz (LabGraph gr lab)<br />
where lab :: Labeling String<br />
lab v = (if v == i then (">"++) else id) $ <br />
(if v `Set.member` fs then (++"|") else id) []<br />
</haskell><br />
<br />
Nothing ground breaking. We only label the nodes accordingly to<br />
their final or initial status.<br />
<br />
<haskell><br />
aut1 = (1, buildG (1,3) [(1,'a',2),(2,'a',2),(2,'b',2),(2,'c',3),(1,'a',3)], Set.fromList [3])<br />
</haskell><br />
<br />
attachment:aut1.png<br />
<br />
A more interesting example is how to transform a non-deterministic <br />
automaton to an equivalent deterministic one. The underlying idea <br />
is that non-deterministic execution of the automaton is equivalent <br />
to deterministic execution on all possible transitions at once. <br />
Refer to [[#hop&ull 6]] for details. This is relatively easily done using <code>unfoldG</code>.<br />
<haskell><br />
simpleGenerator f x = (x, f x)<br />
<br />
nfaToDfa :: Ord e => Automaton e -> Automaton e<br />
nfaToDfa (initial1, aut1, finals1) = (initial2, aut2, finals2)<br />
where (initial2, LabGraph aut2 mapping) = unfoldG (simpleGenerator build) seed<br />
seed = Set.singleton initial1<br />
build state = Map.toList $ Map.fromListWith Set.union $ map lift $<br />
concat $ map (aut1 !) $ Set.toList state<br />
lift (t,s) = (t, Set.singleton s)<br />
isFinal = setAny (`Set.member` finals1) . mapping<br />
finals2 = Set.fromList $ filter isFinal $ indices aut2<br />
setAny f = any f . Set.toList<br />
</haskell><br />
<br />
The 'build' function is the tricky part. Yet, it's not as complicated as it seems: all it does is <br />
#Find all reachable nodes from a set of nodes; <br />
#Classify them by transition label<br />
#Build target state-sets accordingly.<br />
<br />
<haskell><br />
aut2 = nfaToDfa aut1<br />
</haskell><br />
<br />
attachment:aut2.png<br />
<br />
Another thing we possibly wish to compute is the set of <br />
strings accepted by the automaton, (aka. the language it <br />
defines). Most of the time this will be infinite, so <br />
we will limit ourselves to strings of length <code>n</code> maximum.<br />
We need finiteness because otherwise <code>foldG</code> would not find<br />
a fixed point: string sets would keep growing idefinitely.<br />
<br />
<haskell><br />
accepted n (initial1, aut1, finals1) = Set.unions [resultTable ! v | v <- Set.toList finals1]<br />
-- gather what's accepted at all final states<br />
where resultTable = foldGAll Set.empty step (transposeG aut1)<br />
step v trans = Set.unions ((if v == initial1 then Set.singleton [] else Set.empty) : <br />
[Set.map ((++[t]) . take (n-1) ) s | (t,s) <- trans])<br />
</haskell><br />
<br />
Notice that we need to reverse the graph arcs, otherwise the information propagates in the wrong direction.<br />
<br />
With <br />
<haskell><br />
accAut1 = accepted 4 aut1<br />
accAut2 = accepted 4 aut2<br />
</haskell><br />
we have <br />
<haskell><br />
accAut1 == accAut2 == {"a","aaac","aabc","aac","abac","abbc","abc","ac"}<br />
</haskell><br />
<br />
=== LALR Automaton ===<br />
<br />
Another area where I applied graph (un)folding is LALR(1) parser generation. The detailed code<br />
depends on just too many things to fit in this paper, <br />
thus we will only sketch how pieces fit<br />
together. Also, since a course on parsing is clearly beyond the scope of this article, <br />
please refer to local copy of the dragon book [[#dragon 7]] for details on the method.<br />
<br />
In the process of generating tables for a LALR automaton, <br />
there are three steps amenable to implementation by <code>foldG</code> and <code>unfoldG</code>.<br />
<br />
1. Construction of the closure of a LR-items kernel. This one is very similar to the <code>closure</code> function described above, except that we don't discard the graph structure. It'll be of use for step 3.<br />
2. LR(0) automaton generation. Then again a use for <code>unfoldG</code>.<br />
3. Propagation of the lookahead. It is a fold over the whole graph of LR-items, basically using set union as coalescing operation. It is very similar to computation of acceptable strings above.<br />
<br />
<br />
== Implementation ==<br />
<br />
=== UnfoldG ===<br />
<br />
For the sake of completeness, here's how to implement the <code>unfoldG</code> function.<br />
<br />
The algorithm effectively a depth-first search, written in imperative style.<br />
The only difference is that the search graph is remembered and returned as result.<br />
<br />
<haskell><br />
<br />
unfoldGManyST :: (Ord a) => (a -> (c, [(b, a)]))<br />
-> [a] -> ST s ([Vertex], LabGraph c b)<br />
unfoldGManyST gen seeds =<br />
do mtab <- newSTRef (Map.empty)<br />
allNodes <- newSTRef []<br />
vertexRef <- newSTRef firstId<br />
let allocVertex = <br />
do vertex <- readSTRef vertexRef<br />
writeSTRef vertexRef (vertex + 1)<br />
return vertex<br />
let cyc src =<br />
do probe <- memTabFind mtab src<br />
case probe of<br />
Just result -> return result<br />
Nothing -> do<br />
v <- allocVertex<br />
memTabBind src v mtab <br />
let (lab, deps) = gen src<br />
ws <- mapM (cyc . snd) deps<br />
let res = (v, lab, [(fst d, w) | d <- deps | w <- ws])<br />
modifySTRef allNodes (res:)<br />
return v<br />
mapM_ cyc seeds<br />
list <- readSTRef allNodes<br />
seedsResult <- (return . map fromJust) =<< mapM (memTabFind mtab) seeds<br />
lastId <- readSTRef vertexRef<br />
let cycamore = array (firstId, lastId-1) [(i, k) | (i, a, k) <- list]<br />
let labels = array (firstId, lastId-1) [(i, a) | (i, a, k) <- list]<br />
return (seedsResult, LabGraph cycamore (labels !))<br />
where firstId = 0::Vertex<br />
memTabFind mt key = return . Map.lookup key =<< readSTRef mt<br />
memTabBind key val mt = modifySTRef mt (Map.insert key val)<br />
<br />
</haskell><br />
<br />
Notice how every time a seed is encountered, its corresponding vertex number stored. <br />
Whenever the seed is encountered again, the stored is just returned.<br />
<br />
<br />
=== FoldG ===<br />
<br />
<haskell><br />
foldGAllImplementation bot f gr = finalTbl<br />
where finalTbl = fixedPoint updateTbl initialTbl<br />
initialTbl = listArray bnds (replicate (rangeSize bnds) bot)<br />
<br />
fixedPoint f x = fp x<br />
where fp z = if z == z' then z else fp z'<br />
where z' = f z<br />
updateTbl tbl = listArray bnds $ map recompute $ indices gr<br />
where recompute v = f v [(b, tbl!k) | (b, k) <- gr!v]<br />
bnds = bounds gr<br />
</haskell><br />
<br />
<br />
The proposed implementation for foldG is rather bold.<br />
It just applies the coalescing <br />
function repeatedly till it converges.<br />
<br />
While this is not an ideal situation, it's perfectly suited for a first-trial <br />
implementation, or when performance is not crucial.<br />
<br />
If execution time becomes critical, then more specialized <br />
versions can be crafted.<br />
In the case of the shortest path algorithm, for example, <br />
it could take advantage<br />
of the nice properties of the coalescing function to use <br />
a priority queue and greedily<br />
find the fixed point. This would restore the optimal O(n * log n) complexity.<br />
<br />
<br />
== Conclusion ==<br />
<br />
The approach presented may not be excellent for controlling details of implementation <br />
and tuning run-time performance, but I think that's not the point <br />
of haskell programming anyway.<br />
On the other hand, it is very good for quick implementation <br />
of a large range of graph algorithms. The fact that it's mostly based on a<br />
generalisation on fold and unfold should appeal to haskell<br />
programmers.<br />
<br />
<br />
== References ==<br />
<br />
* [[Anchor(cycle-therapy)]] [1] ''Cycle Therapy: A Prescription for Fold and Unfold on Regular Trees'', F. Turbak and J.B. Wells, http://cs.wellesley.edu/~fturbak/pubs/ppdp01.pdf<br />
* [[Anchor(king-thesis)]] [2] ''Functional Programming and Graph Algorithms'', D. J. King, http://www.macs.hw.ac.uk/~gnik/publications <br />
* [[Anchor(induct)]] [3] ''Inductive Graphs and Functional Graph Algorithms'', Martin Erwig, http://web.engr.oregonstate.edu/~erwig/papers/abstracts.html<br />
* [[Anchor(dfs)]] [4] , D. J. King and John Launchbury, http://www.cse.ogi.edu/~jl/Papers/dfs.ps<br />
* [[Anchor(bananas-lenses)]] [5] ''Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire'', Erik Meijer, Maarten Fokkinga, Ross Paterson. http://citeseer.ist.psu.edu/meijer91functional.html<br />
* [[Anchor(hop&ull)]] [6] ''Introduction to Automata Theory, Languages, and Computation'', JE Hopcroft, and JD Ullman, http://www-db.stanford.edu/~ullman/ialc.html <br />
* [[Anchor(dragon)]] [7] ''Compilers: Principles, Techniques and Tools'', Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. (Addison-Wesley 1986; ISBN 0-201-10088-6)<br />
<br />
----<br />
CategoryArticle</div>Sergvhttps://wiki.haskell.org/index.php?title=The_Monad.Reader/Issue5/Practical_Graph_Handling&diff=42600The Monad.Reader/Issue5/Practical Graph Handling2011-10-27T18:13:47Z<p>Sergv: Fix a typo</p>
<hr />
<div>'''This article needs reformatting! Please help tidy it up.'''--[[User:WouterSwierstra|WouterSwierstra]] 14:26, 9 May 2008 (UTC)<br />
<br />
= A Practical Approach to Graph Manipulation =<br />
''by JeanPhilippeBernardy for The Monad.Reader Issue 5''<br />
[[BR]]<br />
''[[Date(2005-07-08T20:48:51Z)]]''<br />
<br />
'''Abstract.'''<br />
<br />
Tree-based data structures are easy to deal with in Haskell. <br />
However, working with graph-like structures in practice is much less obvious. <br />
In this article I present a solution that has worked for me in many cases.<br />
<br />
<br />
== Introduction ==<br />
<br />
I always found that dealing with graphs in Haskell is a tricky subject. <br />
Even something like a implementing a depth-first search, which is <br />
trivially achieved in an imperative language, deserves an article <br />
on its own for Haskell [[#dfs 4]].<br />
A PhD thesis has been written on the subject of graphs and functional programming [[#king-thesis 2]], and it seems that it still <br />
doesn't exhaust the <br />
design-space: radically different ideas have been proposed afterwards [[#induct 3]].<br />
<br />
In this article I'll present (a simplified version of) a solution that <br />
I think deserves more coverage [[#cycle-therapy 1]]. The idea is to abstract graph <br />
manipulation by anamorphisms and catamorphisms.<br />
This approach features "separation of concerns" and "composability", hence it<br />
can be readily applied to <br />
practical problems.<br />
<br />
* Section 2 shows how anamorphisms and catamorphisms can be generalised to graphs.<br />
* Section 3 details the data structures used to represent graphs<br />
* Section 4 discusses various problems where cata/anamorphisms can be applied<br />
* Section 5 gives a sample implementation for the catamorphism and anamorphism<br />
* Section 6 concludes<br />
<br />
=== Nota ===<br />
<br />
This article has been generated from a literate haskell source. <br />
So, although the text of this wiki page will not compile, all the examples are <br />
real and run. The source can be accessed here: attachment:PracticalGraphHandling.lhs<br />
<br />
We will assume you know the hierarchical libraries. Refer to http://haskell.org/ghc/docs/latest/html/libraries/index.html in case of doubt.<br />
<br />
== Origami with Graphs ==<br />
<br />
=== Fold & Unfold (the big deal) ===<br />
<br />
Most of you probably know what a "fold" (also known as catamorphism) <br />
is. For those who don't, intuitively, it's an higher-order operation <br />
that reduces a complex structure to a single value. It applies a <br />
function given as parameter to each node, propagating the results <br />
up to the root. This is a highly imprecise definition, for more <br />
details please read [[#bananas-lenses 5]].<br />
<br />
For example, the fold operation on lists can be typed as follows:<br />
<br />
<haskell><br />
foldr :: (a -> b -> b) -> -- ^ operation to apply<br />
b -> -- ^ initial value<br />
[a] -> -- ^ input list<br />
b -- ^ result<br />
</haskell><br />
<br />
Conversely, "unfold" builds a complex structure out of a building<br />
function, applying it iteratively.<br />
<br />
<haskell><br />
unfoldr :: (b -> Maybe (a, b)) -> -- ^ building function (Nothing => end of list)<br />
b -> -- ^ seed value<br />
[a] -- ^ result<br />
</haskell><br />
<br />
The second argument is the initial value from which the <br />
whole resulting list will be derived, by applying the 1st argument.<br />
In the following we'll refer to it as the "seed".<br />
<br />
<br />
The catamorphism/anamorphism abstractions have proven to be <br />
very useful in practise. They're ubiquitous to any haskell <br />
programming, either explicitly, or implicitly (hidden in <br />
higher-level operations). In this article I'll show how <br />
those abstractions can be generalised to graph structures, <br />
and argue that they are equally useful in this case.<br />
<br />
The rest of the article assumes the reader is fairly familiar <br />
with fold and unfold. Fortunately there are many articles on the<br />
subject. For example you can refer to [[#bananas-lenses 5]] if you ever feel uncomfortable.<br />
<br />
=== Generalisation ===<br />
<br />
Let's examine how fold/unfold can be generalized for graphs.<br />
Since we are working on graphs instead of lists, we must account for<br />
<br />
1. Any number of children for a node;<br />
2. "Backwards" arcs (cycles);<br />
3. Labelled edges.<br />
<br />
The most relevant point being 2, of course.<br />
<br />
==== unfoldG ====<br />
<br />
From the above, we can deduce that the type of unfoldG will be:<br />
<br />
<haskell><br />
unfoldG :: (Ord s) => (s -> (n, [(e, s)])) -> s -> (Vertex, LabGraph n e)<br />
unfoldG f r = (r', res)<br />
where ([r'], res) = unfoldGMany f [r]<br />
</haskell><br />
where <code>s</code> is the seed type, <code>n</code> is the node labels, <code>e</code> the edges labels.<br />
<br />
The <code>Ord s</code> constraint reflects point 2 above. <br />
It is needed because the unfoldG function must record every <br />
seed value encountered.<br />
Whenever a seed is seen a second time, <code>unfoldG</code> will recognize <br />
it and create a "backward arc".<br />
We use <code>Ord</code> instead of <code>Eq</code> because a mere equality test rules out using <code>Data.Map</code>.<br />
<br />
The attentive reader will note that we return an additional <br />
Vertex value. This is needed to identifty which node the root<br />
seed corresponds to.<br />
<br />
In order to get an intuitive feeling of how <code>unfoldG</code> works,<br />
let's examine a simple example.<br />
<br />
<haskell><br />
gr1 :: LabGraph Int Char<br />
(_,gr1) = unfoldG gen (0::Int) <br />
where gen x = (x,[('a',(x+1) `mod` 10), ('b', (x+2) `mod` 10)])<br />
</haskell><br />
<br />
<code>gr1</code> being defined as above, its structure is:<br />
<br />
attachment:gr1.png<br />
<br />
Because we might want to build a graph from a set of seeds <br />
instead of a single one, we will also need the following function:<br />
<haskell><br />
unfoldGMany :: (Ord s) => (s -> (n, [(e, s)])) -> [s] -> ([Vertex], LabGraph n e)<br />
unfoldGMany f roots = runST ( unfoldGManyST f roots ) -- detailed later<br />
</haskell> <br />
<br />
<code>unfoldG</code>, alone, is already very a practical tool, because it <br />
lets you reify a function (<code>a -> a</code>) graph. It then can be examined, <br />
processed, etc. whereas the function can only be evaluated.<br />
<br />
==== foldG ====<br />
<br />
On a graph, the catamorphism (fold) type will become:<br />
<haskell><br />
foldG :: (Eq r) => r -> (Vertex -> [(e, r)] -> r) -> Graph e -> Vertex -> r<br />
foldG i f g v = foldGAll i f g ! v<br />
</haskell><br />
<br />
As for <code>unfoldG</code>, the <code>foldG</code> <br />
function must include a special mechanism to handle cycles.<br />
The idea is to apply the operation iteratively until the result <br />
converges. It's the purpose of the first <br />
parameter is to "bootstrap" the process: <br />
it will be used as an initial value.<br />
<br />
Thus, <code>foldG i f g v</code> will iteratively <br />
apply <code>f</code> on nodes of graph <code>g</code>, <br />
using <code>i</code> as "bottom" value. It will return <br />
the value computed at vertex <code>v</code>. <br />
Of course, this will work only if <code>f</code> is well-behaved: <br />
it must converge at some point.<br />
I won't dwelve in to the theoretical details <br />
here, see [[#cycle-therapy 1]] for a<br />
formal explanation.<br />
<br />
Notice that <code>foldG</code> can work on a graph without node labels. <br />
If the parameter function needs to access node labels, it can <br />
do so without <code>foldG</code> needing to know.<br />
<br />
It's also worth noticing that, in our implementation, the <br />
information will be propagated in the reverse direction of arcs. <br />
<br />
It's very common to need the result value for each vertex, <br />
hence the need for :<br />
<br />
<haskell><br />
foldGAll :: (Eq r) => r -> (Vertex -> [(e, r)] -> r) -> Graph e -> Table r<br />
</haskell><br />
<br />
<br />
<br />
<br />
<br />
The implementation of these functions doesn't matter much. <br />
The point of the article is not how these can be implemented, <br />
but how they can be used for daily programming tasks. <br />
For completeness though, we'll provide a <br />
sample implemenation at the end of the article.<br />
<br />
== Data Structure & Accessors ==<br />
<br />
Without further ado, let's define the data structures we'll work on. <br />
<br />
<br />
<haskell><br />
type Vertex = Int<br />
type Table a = Array Vertex a<br />
type Graph e = Table [(e, Vertex)]<br />
type Bounds = (Vertex, Vertex)<br />
type Edge e = (Vertex, e, Vertex)<br />
</haskell><br />
A graph is a mere adjacency list table, tagged with edge labels.<br />
<br />
The above structure lacks labels for nodes.<br />
This is easily fixed by adding a labeling (or coloring) function.<br />
<haskell><br />
type Labeling a = Vertex -> a<br />
data LabGraph n e = LabGraph (Graph e) (Labeling n)<br />
<br />
vertices (LabGraph gr _) = indices gr<br />
<br />
labels (LabGraph gr l) = map l (indices gr)<br />
</haskell><br />
<br />
<br />
The above departs slightly from what's prescribed in [[#cycle-therapy 1]]. Instead of <br />
a ''true graph'' built by knot-tying, we chose to use an <code>Array</code><br />
with integers as explicit vertex references.<br />
This is closely follows <br />
Data.Graph in the hierarchical libraries, <br />
the only difference being that we have labelled edges. <br />
<br />
Not only this is simpler, but it has the advantage that we can reuse<br />
most of the algorithms from Data.Graph with only minor changes:<br />
<br />
<haskell><br />
-- | Build a graph from a list of edges.<br />
buildG :: Bounds -> [Edge e] -> Graph e<br />
buildG bounds0 edges0 = accumArray (flip (:)) [] bounds0 [(v, (l,w)) | (v,l,w) <- edges0]<br />
<br />
-- | The graph obtained by reversing all edges.<br />
transposeG :: Graph e -> Graph e<br />
transposeG g = buildG (bounds g) (reverseE g)<br />
<br />
reverseE :: Graph e -> [Edge e]<br />
reverseE g = [ (w, l, v) | (v, l, w) <- edges g ]<br />
</haskell><br />
<br />
<br />
However, as previously said, we'll try to abstract <br />
away from the details of the structure.<br />
This is not always possible, but in such cases, <br />
I believe the array representation to be<br />
a good choice, because it's easy to work with. <br />
If anything, one can readily use all the<br />
standard array functions.<br />
<br />
For example, here's the function to output a graph as a GraphViz file:<br />
<haskell><br />
showGraphViz (LabGraph gr lab) = <br />
"digraph name {\n" ++<br />
"rankdir=LR;\n" ++<br />
(concatMap showNode $ indices gr) ++<br />
(concatMap showEdge $ edges gr) ++<br />
"}\n"<br />
where showEdge (from, t, to) = show from ++ " -> " ++ show to ++<br />
" [label = \"" ++ show t ++ "\"];\n"<br />
showNode v = show v ++ " [label = " ++ (show $ lab v) ++ "];\n"<br />
<br />
edges :: Graph e -> [Edge e]<br />
edges g = [ (v, l, w) | v <- indices g, (l, w) <- g!v ]<br />
</haskell> <br />
<br />
<br />
== Applications ==<br />
<br />
I'll now enumerate a few problems where the "origami" approach can be applied successfully.<br />
<br />
=== Closure ===<br />
<br />
A simple application (special case) of "unfoldG" the <br />
computation of the transitive closure of a non-deterministic function.<br />
<br />
<haskell><br />
closure :: Ord a => (a -> [a]) -> (a -> [a])<br />
closure f i = labels $ snd $ unfoldG f' i <br />
where f' x = (x, [((), fx) | fx <- f x])<br />
</haskell><br />
<br />
In this context, "non deterministic" means that it yields many <br />
values, as a list. As noted before, this will work only when <br />
everything remains finite in size.<br />
<br />
<br />
For example, if we define<br />
<br />
<haskell><br />
interleave (x1:x2:xs) = (x1:x2:xs) : (map (x2:) (interleave (x1:xs)))<br />
interleave xs = [xs]<br />
<br />
interleave "abcd" ==> ["abcd","bacd","bcad","bcda"]<br />
</haskell><br />
<br />
a very bad way to compute the permutations of list can be<br />
<br />
<haskell><br />
permutations = closure interleave<br />
<br />
permutations "abcd" ==> ["abcd","bacd","acbd","cabd","abdc","badc", <br />
"adbc","dabc","dbac","bdac","dacb","adcb",<br />
"dcab","cdab","cadb","acdb","cdba","dcba",<br />
"cbda","bcda","bdca","dbca","bcad","cbad"]<br />
</haskell><br />
<br />
But sometimes the function to 'close' is more complicated than <code>interleave</code> and<br />
then <code>closure</code> becomes really useful.<br />
<br />
<br />
=== Shortest Path ===<br />
<br />
Let us now examine the toy problem of finding the distance <br />
to a given node from all the other nodes of the graph. <br />
Most readers probably know the Dijkstra's algorithm to <br />
compute the solution to the problem. We will not try <br />
to reproduce it here, instead we will define the computation in terms of <code>foldG</code>.<br />
<br />
Here it goes:<br />
<haskell><br />
-- | Compute the distance to v for every vertex of gr.<br />
distsTo :: Vertex -> Graph Float -> Table Float<br />
distsTo v gr = foldGAll infinite distance gr <br />
where infinite = 10000000 -- well, you get the idea<br />
distance v' neighbours <br />
| v == v' = 0<br />
| otherwise = minimum [distV+arcWeight | (distV, arcWeight) <- neighbours]<br />
</haskell><br />
<br />
So clear that it barely needs to be explained. :) <br />
Just notice how the minimize function assumes that the <br />
distance is already computed for all its neighbours. <br />
This works because <code>foldG</code> will iterate until it finds the fixed point.<br />
<br />
On this simple graph,<br />
<br />
<haskell><br />
grDist = buildG (1,5) [(1,5.0,2), (2,5.0,3), (2,7.0,4), (3,5.0,4), (4,5.0,5), (4,3.0,1)]<br />
</haskell><br />
<br />
attachment:grdist.png<br />
<br />
the result of <haskell><br />
dists = distsTo 5 grDist<br />
</haskell> is<br />
<br />
attachment:grdist2.png<br />
<br />
(labeling each node with the its result, ie. distance to vertex <code>5</code>)<br />
<br />
=== Finite Automaton ===<br />
<br />
Finite automatons are basically graphs, so let's see how we can apply the <br />
framework to their analysis.<br />
<br />
First, let's define an automaton. For our purposes, it is a graph <br />
of states/transitions, some of the states being marked as initial or final.<br />
<br />
<haskell><br />
type Automaton t = (Vertex, Graph t, Set Vertex) -- ^ Initial, transitions, finals<br />
</haskell><br />
<br />
For starters, here is how the <code>showGraphViz</code> function can be applied to automaton display:<br />
<br />
<haskell><br />
automatonToGraphviz (i, gr, fs) = showGraphViz (LabGraph gr lab)<br />
where lab :: Labeling String<br />
lab v = (if v == i then (">"++) else id) $ <br />
(if v `Set.member` fs then (++"|") else id) []<br />
</haskell><br />
<br />
Nothing ground breaking. We only label the nodes accordingly to<br />
their final or initial status.<br />
<br />
<haskell><br />
aut1 = (1, buildG (1,3) [(1,'a',2),(2,'a',2),(2,'b',2),(2,'c',3),(1,'a',3)], Set.fromList [3])<br />
</haskell><br />
<br />
attachment:aut1.png<br />
<br />
A more interesting example is how to transform a non-deterministic <br />
automaton to an equivalent deterministic one. The underlying idea <br />
is that non-deterministic execution of the automaton is equivalent <br />
to deterministic execution on all possible transitions at once. <br />
Refer to [[#hop&ull 6]] for details. This is relatively easily done using <code>unfoldG</code>.<br />
<haskell><br />
simpleGenerator f x = (x, f x)<br />
<br />
nfaToDfa :: Ord e => Automaton e -> Automaton e<br />
nfaToDfa (initial1, aut1, finals1) = (initial2, aut2, finals2)<br />
where (initial2, LabGraph aut2 mapping) = unfoldG (simpleGenerator build) seed<br />
seed = Set.singleton initial1<br />
build state = Map.toList $ Map.fromListWith Set.union $ map lift $<br />
concat $ map (aut1 !) $ Set.toList state<br />
lift (t,s) = (t, Set.singleton s)<br />
isFinal = setAny (`Set.member` finals1) . mapping<br />
finals2 = Set.fromList $ filter isFinal $ indices aut2<br />
setAny f = any f . Set.toList<br />
</haskell><br />
<br />
The 'build' function is the tricky part. Yet, it's not as complicated as it seems: all it does is <br />
#Find all reachable nodes from a set of nodes; <br />
#Classify them by transition label<br />
#Build target state-sets accordingly.<br />
<br />
<haskell><br />
aut2 = nfaToDfa aut1<br />
</haskell><br />
<br />
attachment:aut2.png<br />
<br />
Another thing we possibly wish to compute is the set of <br />
strings accepted by the automaton, (aka. the language it <br />
defines). Most of the time this will be infinite, so <br />
we will limit ourselves to strings of length <code>n</code> maximum.<br />
We need finiteness because otherwise <code>foldG</code> would not find<br />
a fixed point: string sets would keep growing idefinitely.<br />
<br />
<haskell><br />
accepted n (initial1, aut1, finals1) = Set.unions [resultTable ! v | v <- Set.toList finals1]<br />
-- gather what's accepted at all final states<br />
where resultTable = foldGAll Set.empty step (transposeG aut1)<br />
step v trans = Set.unions ((if v == initial1 then Set.singleton [] else Set.empty) : <br />
[Set.map ((++[t]) . take (n-1) ) s | (t,s) <- trans])<br />
</haskell><br />
<br />
Notice that we need to reverse the graph arcs, otherwise the information propagates in the wrong direction.<br />
<br />
With <br />
<haskell><br />
accAut1 = accepted 4 aut1<br />
accAut2 = accepted 4 aut2<br />
</haskell><br />
we have <br />
<haskell><br />
accAut1 == accAut2 == {"a","aaac","aabc","aac","abac","abbc","abc","ac"}<br />
</haskell><br />
<br />
=== LALR Automaton ===<br />
<br />
Another area where I applied graph (un)folding is LALR(1) parser generation. The detailed code<br />
depends on just too many things to fit in this paper, <br />
thus we will only sketch how pieces fit<br />
together. Also, since a course on parsing is clearly beyond the scope of this article, <br />
please refer to local copy of the dragon book [[#dragon 7]] for details on the method.<br />
<br />
In the process of generating tables for a LALR automaton, <br />
there are three steps amenable to implementation by <code>foldG</code> and <code>unfoldG</code>.<br />
<br />
1. Construction of the closure of a LR-items kernel. This one is very similar to the <code>closure</code> function described above, except that we don't discard the graph structure. It'll be of use for step 3.<br />
2. LR(0) automaton generation. Then again a use for <code>unfoldG</code>.<br />
3. Propagation of the lookahead. It is a fold over the whole graph of LR-items, basically using set union as coalescing operation. It is very similar to computation of acceptable strings above.<br />
<br />
<br />
== Implementation ==<br />
<br />
=== UnfoldG ===<br />
<br />
For the sake of completeness, here's how to implement the <code>unfoldG</code> function.<br />
<br />
The algorithm effectively a depth-first search, written in imperative style.<br />
The only difference is that the search graph is remembered and returned as result.<br />
<br />
<haskell><br />
<br />
unfoldGManyST :: (Ord a) => (a -> (c, [(b, a)]))<br />
-> [a] -> ST s ([Vertex], LabGraph c b)<br />
unfoldGManyST gen seeds =<br />
do mtab <- newSTRef (Map.empty)<br />
allNodes <- newSTRef []<br />
vertexRef <- newSTRef firstId<br />
let allocVertex = <br />
do vertex <- readSTRef vertexRef<br />
writeSTRef vertexRef (vertex + 1)<br />
return vertex<br />
let cyc src =<br />
do probe <- memTabFind mtab src<br />
case probe of<br />
Just result -> return result<br />
Nothing -> do<br />
v <- allocVertex<br />
memTabBind src v mtab <br />
let (lab, deps) = gen src<br />
ws <- mapM (cyc . snd) deps<br />
let res = (v, lab, [(fst d, w) | d <- deps | w <- ws])<br />
modifySTRef allNodes (res:)<br />
return v<br />
mapM_ cyc seeds<br />
list <- readSTRef allNodes<br />
seedsResult <- (return . map fromJust) =<< mapM (memTabFind mtab) seeds<br />
lastId <- readSTRef vertexRef<br />
let cycamore = array (firstId, lastId-1) [(i, k) | (i, a, k) <- list]<br />
let labels = array (firstId, lastId-1) [(i, a) | (i, a, k) <- list]<br />
return (seedsResult, LabGraph cycamore (labels !))<br />
where firstId = 0::Vertex<br />
memTabFind mt key = return . Map.lookup key =<< readSTRef mt<br />
memTabBind key val mt = modifySTRef mt (Map.insert key val)<br />
<br />
</haskell><br />
<br />
Notice how every time a seed is encountered, its corresponding vertex number stored. <br />
Whenever the seed is encountered again, the stored is just returned.<br />
<br />
<br />
=== FoldG ===<br />
<br />
<haskell><br />
foldGAllImplementation bot f gr = finalTbl<br />
where finalTbl = fixedPoint updateTbl initialTbl<br />
initialTbl = listArray bnds (replicate (rangeSize bnds) bot)<br />
<br />
fixedPoint f x = fp x<br />
where fp z = if z == z' then z else fp z'<br />
where z' = f z<br />
updateTbl tbl = listArray bnds $ map recompute $ indices gr<br />
where recompute v = f v [(b, tbl!k) | (b, k) <- gr!v]<br />
bnds = bounds gr<br />
</haskell><br />
<br />
<br />
The proposed implementation for foldG is rather bold.<br />
It just applies the coalescing <br />
function repeatedly till it converges.<br />
<br />
While this is not an ideal situation, it's perfectly suited for a first-trial <br />
implementation, or when performance is not crucial.<br />
<br />
If execution time becomes critical, then more specialized <br />
versions can be crafted.<br />
In the case of the shortest path algorithm, for example, <br />
it could take advantage<br />
of the nice properties of the coalescing function to use <br />
a priority queue and greedily<br />
find the fixed point. This would restore the optimal O(n * log n) complexity.<br />
<br />
<br />
== Conclusion ==<br />
<br />
The approach presented may not be excellent for controlling details of implementation <br />
and tuning run-time performance, but I think that's not the point <br />
of haskell programming anyway.<br />
On the other hand, it is very good for quick implementation <br />
of a large range of graph algorithms. The fact that it's mostly based on a<br />
generalisation on fold and unfold should appeal to haskell<br />
programmers.<br />
<br />
<br />
== References ==<br />
<br />
* [[Anchor(cycle-therapy)]] [1] ''Cycle Therapy: A Prescription for Fold and Unfold on Regular Trees'', F. Turbak and J.B. Wells, http://cs.wellesley.edu/~fturbak/pubs/ppdp01.pdf<br />
* [[Anchor(king-thesis)]] [2] ''Functional Programming and Graph Algorithms'', D. J. King, http://www.macs.hw.ac.uk/~gnik/publications <br />
* [[Anchor(induct)]] [3] ''Inductive Graphs and Functional Graph Algorithms'', Martin Erwig, http://web.engr.oregonstate.edu/~erwig/papers/abstracts.html<br />
* [[Anchor(dfs)]] [4] , D. J. King and John Launchbury, http://www.cse.ogi.edu/~jl/Papers/dfs.ps<br />
* [[Anchor(bananas-lenses)]] [5] ''Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire'', Erik Meijer, Maarten Fokkinga, Ross Paterson. http://citeseer.ist.psu.edu/meijer91functional.html<br />
* [[Anchor(hop&ull)]] [6] ''Introduction to Automata Theory, Languages, and Computation'', JE Hopcroft, and JD Ullman, http://www-db.stanford.edu/~ullman/ialc.html <br />
* [[Anchor(dragon)]] [7] ''Compilers: Principles, Techniques and Tools'', Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. (Addison-Wesley 1986; ISBN 0-201-10088-6)<br />
<br />
----<br />
CategoryArticle</div>Sergvhttps://wiki.haskell.org/index.php?title=The_Monad.Reader/Issue5/Practical_Graph_Handling&diff=42599The Monad.Reader/Issue5/Practical Graph Handling2011-10-27T17:55:30Z<p>Sergv: Fix numbering in list</p>
<hr />
<div>'''This article needs reformatting! Please help tidy it up.'''--[[User:WouterSwierstra|WouterSwierstra]] 14:26, 9 May 2008 (UTC)<br />
<br />
= A Practical Approach to Graph Manipulation =<br />
''by JeanPhilippeBernardy for The Monad.Reader Issue 5''<br />
[[BR]]<br />
''[[Date(2005-07-08T20:48:51Z)]]''<br />
<br />
'''Abstract.'''<br />
<br />
Tree-based data structures are easy to deal with in Haskell. <br />
However, working with graph-like structures in practice is much less obvious. <br />
In this article I present a solution that has worked for me in many cases.<br />
<br />
<br />
== Introduction ==<br />
<br />
I always found that dealing with graphs in Haskell is a tricky subject. <br />
Even something like a implementing a depth-first search, which is <br />
trivially achieved in an imperative language, deserves an article <br />
on its own for Haskell [[#dfs 4]].<br />
A PhD thesis has been written on the subject of graphs and functional programming [[#king-thesis 2]], and it seems that it still <br />
doesn't exhaust the <br />
design-space: radically different ideas have been proposed afterwards [[#induct 3]].<br />
<br />
In this article I'll present (a simplified version of) a solution that <br />
I think deserves more coverage [[#cycle-therapy 1]]. The idea is to abstract graph <br />
manipulation by anamorphisms and catamorphisms.<br />
This approach features "separation of concerns" and "composability", hence it<br />
can be readily applied to <br />
practical problems.<br />
<br />
* Section 2 shows how anamorphisms and catamorphisms can be generalised to graphs.<br />
* Section 3 details the data structures used to represent graphs<br />
* Section 4 discusses various problems where cata/anamorphisms can be applied<br />
* Section 5 gives a sample implementation for the catamorphism and anamorphism<br />
* Section 6 concludes<br />
<br />
=== Nota ===<br />
<br />
This article has been generated from a literate haskell source. <br />
So, although the text of this wiki page will not compile, all the examples are <br />
real and run. The source can be accessed here: attachment:PracticalGraphHandling.lhs<br />
<br />
We will assume you know the hierarchical libraries. Refer to http://haskell.org/ghc/docs/latest/html/libraries/index.html in case of doubt.<br />
<br />
== Origami with Graphs ==<br />
<br />
=== Fold & Unfold (the big deal) ===<br />
<br />
Most of you probably know what a "fold" (also known as catamorphism) <br />
is. For those who don't, intuitively, it's an higher-order operation <br />
that reduces a complex structure to a single value. It applies a <br />
function given as parameter to each node, propagating the results <br />
up to the root. This is a highly imprecise definition, for more <br />
details please read [[#bananas-lenses 5]].<br />
<br />
For example, the fold operation on lists can be typed as follows:<br />
<br />
<haskell><br />
foldr :: (a -> b -> b) -> -- ^ operation to apply<br />
b -> -- ^ initial value<br />
[a] -> -- ^ input list<br />
b -- ^ result<br />
</haskell><br />
<br />
Conversely, "unfold" builds a complex structure out of a building<br />
function, applying it iteratively.<br />
<br />
<haskell><br />
unfoldr :: (b -> Maybe (a, b)) -> -- ^ building function (Nothing => end of list)<br />
b -> -- ^ seed value<br />
[a] -- ^ result<br />
</haskell><br />
<br />
The second argument is the initial value from which the <br />
whole resulting list will be derived, by applying the 1st argument.<br />
In the following we'll refer to it as the "seed".<br />
<br />
<br />
The catamorphism/anamorphism abstractions have proven to be <br />
very useful in practise. They're ubiquitous to any haskell <br />
programming, either explicitly, or implicitly (hidden in <br />
higher-level operations). In this article I'll show how <br />
those abstractions can be generalised to graph structures, <br />
and argue that they are equally useful in this case.<br />
<br />
The rest of the article assumes the reader is fairly familiar <br />
with fold and unfold. Fortunately there are many articles on the<br />
subject. For example you can refer to [[#bananas-lenses 5]] if you ever feel uncomfortable.<br />
<br />
=== Generalisation ===<br />
<br />
Let's examine how fold/unfold can be generalized for graphs.<br />
Since we are working on graphs instead of lists, we must account for<br />
<br />
1. Any number of children for a node;<br />
2. "Backwards" arcs (cycles);<br />
3. Labelled edges.<br />
<br />
The most relevant point being 2, of course.<br />
<br />
==== unfoldG ====<br />
<br />
From the above, we can deduce that the type of unfoldG will be:<br />
<br />
<haskell><br />
unfoldG :: (Ord s) => (s -> (n, [(e, s)])) -> s -> (Vertex, LabGraph n e)<br />
unfoldG f r = (r', res)<br />
where ([r'], res) = unfoldGMany f [r]<br />
</haskell><br />
where <code>s</code> is the seed type, <code>n</code> is the node labels, <code>e</code> the edges labels.<br />
<br />
The <code>Ord s</code> constraint reflects point 2 above. <br />
It is needed because the unfoldG function must record every <br />
seed value encountered.<br />
Whenever a seed is seen a second time, <code>unfoldG</code> will recognize <br />
it and create a "backward arc".<br />
We use <code>Ord</code> instead of <code>Eq</code> because a mere equality test rules out using <code>Data.Map</code>.<br />
<br />
The attentive reader will note that we return an additional <br />
Vertex value. This is needed to identifty which node the root<br />
seed corresponds to.<br />
<br />
In order to get an intuitive feeling of how <code>unfoldG</code> works,<br />
let's examine a simple example.<br />
<br />
<haskell><br />
gr1 :: LabGraph Int Char<br />
(_,gr1) = unfoldG gen (0::Int) <br />
where gen x = (x,[('a',(x+1) `mod` 10), ('b', (x+2) `mod` 10)])<br />
</haskell><br />
<br />
<code>gr1</code> being defined as above, its structure is:<br />
<br />
attachment:gr1.png<br />
<br />
Because we might want to build a graph from a set of seeds <br />
instead of a single one, we will also need the following function:<br />
<haskell><br />
unfoldGMany :: (Ord s) => (s -> (n, [(e, s)])) -> [s] -> ([Vertex], LabGraph n e)<br />
unfoldGMany f roots = runST ( unfoldGManyST f roots ) -- detailed later<br />
</haskell> <br />
<br />
<code>unfoldG</code>, alone, is already very a practical tool, because it <br />
lets you reify a function (<code>a -> a</code>) graph. It then can be examined, <br />
processed, etc. whereas the function can only be evaluated.<br />
<br />
==== foldG ====<br />
<br />
On a graph, the catamorphism (fold) type will become:<br />
<haskell><br />
foldG :: (Eq r) => r -> (Vertex -> [(e, r)] -> r) -> Graph e -> Vertex -> r<br />
foldG i f g v = foldGAll i f g ! v<br />
</haskell><br />
<br />
As for <code>unfoldG</code>, the <code>foldG</code> <br />
function must include a special mechanism to handle cycles.<br />
The idea is to apply the operation iteratively until the result <br />
converges. It's the purpose of the first <br />
parameter is to "bootstrapp" the process: <br />
it will be used as an initial value.<br />
<br />
Thus, <code>foldG i f g v</code> will iteratively <br />
apply <code>f</code> on nodes of graph <code>g</code>, <br />
using <code>i</code> as "bottom" value. It will return <br />
the value computed at vertex <code>v</code>. <br />
Of course, this will work only if <code>f</code> is well-behaved: <br />
it must converge at some point.<br />
I won't dwelve in to the theoretical details <br />
here, see [[#cycle-therapy 1]] for a<br />
formal explanation.<br />
<br />
Notice that <code>foldG</code> can work on a graph without node labels. <br />
If the parameter function needs to access node labels, it can <br />
do so without <code>foldG</code> needing to know.<br />
<br />
It's also worth noticing that, in our implementation, the <br />
information will be propagated in the reverse direction of arcs. <br />
<br />
It's very common to need the result value for each vertex, <br />
hence the need for :<br />
<br />
<haskell><br />
foldGAll :: (Eq r) => r -> (Vertex -> [(e, r)] -> r) -> Graph e -> Table r<br />
</haskell><br />
<br />
<br />
<br />
<br />
<br />
The implementation of these functions doesn't matter much. <br />
The point of the article is not how these can be implemented, <br />
but how they can be used for daily programming tasks. <br />
For completeness though, we'll provide a <br />
sample implemenation at the end of the article.<br />
<br />
== Data Structure & Accessors ==<br />
<br />
Without further ado, let's define the data structures we'll work on. <br />
<br />
<br />
<haskell><br />
type Vertex = Int<br />
type Table a = Array Vertex a<br />
type Graph e = Table [(e, Vertex)]<br />
type Bounds = (Vertex, Vertex)<br />
type Edge e = (Vertex, e, Vertex)<br />
</haskell><br />
A graph is a mere adjacency list table, tagged with edge labels.<br />
<br />
The above structure lacks labels for nodes.<br />
This is easily fixed by adding a labeling (or coloring) function.<br />
<haskell><br />
type Labeling a = Vertex -> a<br />
data LabGraph n e = LabGraph (Graph e) (Labeling n)<br />
<br />
vertices (LabGraph gr _) = indices gr<br />
<br />
labels (LabGraph gr l) = map l (indices gr)<br />
</haskell><br />
<br />
<br />
The above departs slightly from what's prescribed in [[#cycle-therapy 1]]. Instead of <br />
a ''true graph'' built by knot-tying, we chose to use an <code>Array</code><br />
with integers as explicit vertex references.<br />
This is closely follows <br />
Data.Graph in the hierarchical libraries, <br />
the only difference being that we have labelled edges. <br />
<br />
Not only this is simpler, but it has the advantage that we can reuse<br />
most of the algorithms from Data.Graph with only minor changes:<br />
<br />
<haskell><br />
-- | Build a graph from a list of edges.<br />
buildG :: Bounds -> [Edge e] -> Graph e<br />
buildG bounds0 edges0 = accumArray (flip (:)) [] bounds0 [(v, (l,w)) | (v,l,w) <- edges0]<br />
<br />
-- | The graph obtained by reversing all edges.<br />
transposeG :: Graph e -> Graph e<br />
transposeG g = buildG (bounds g) (reverseE g)<br />
<br />
reverseE :: Graph e -> [Edge e]<br />
reverseE g = [ (w, l, v) | (v, l, w) <- edges g ]<br />
</haskell><br />
<br />
<br />
However, as previously said, we'll try to abstract <br />
away from the details of the structure.<br />
This is not always possible, but in such cases, <br />
I believe the array representation to be<br />
a good choice, because it's easy to work with. <br />
If anything, one can readily use all the<br />
standard array functions.<br />
<br />
For example, here's the function to output a graph as a GraphViz file:<br />
<haskell><br />
showGraphViz (LabGraph gr lab) = <br />
"digraph name {\n" ++<br />
"rankdir=LR;\n" ++<br />
(concatMap showNode $ indices gr) ++<br />
(concatMap showEdge $ edges gr) ++<br />
"}\n"<br />
where showEdge (from, t, to) = show from ++ " -> " ++ show to ++<br />
" [label = \"" ++ show t ++ "\"];\n"<br />
showNode v = show v ++ " [label = " ++ (show $ lab v) ++ "];\n"<br />
<br />
edges :: Graph e -> [Edge e]<br />
edges g = [ (v, l, w) | v <- indices g, (l, w) <- g!v ]<br />
</haskell> <br />
<br />
<br />
== Applications ==<br />
<br />
I'll now enumerate a few problems where the "origami" approach can be applied successfully.<br />
<br />
=== Closure ===<br />
<br />
A simple application (special case) of "unfoldG" the <br />
computation of the transitive closure of a non-deterministic function.<br />
<br />
<haskell><br />
closure :: Ord a => (a -> [a]) -> (a -> [a])<br />
closure f i = labels $ snd $ unfoldG f' i <br />
where f' x = (x, [((), fx) | fx <- f x])<br />
</haskell><br />
<br />
In this context, "non deterministic" means that it yields many <br />
values, as a list. As noted before, this will work only when <br />
everything remains finite in size.<br />
<br />
<br />
For example, if we define<br />
<br />
<haskell><br />
interleave (x1:x2:xs) = (x1:x2:xs) : (map (x2:) (interleave (x1:xs)))<br />
interleave xs = [xs]<br />
<br />
interleave "abcd" ==> ["abcd","bacd","bcad","bcda"]<br />
</haskell><br />
<br />
a very bad way to compute the permutations of list can be<br />
<br />
<haskell><br />
permutations = closure interleave<br />
<br />
permutations "abcd" ==> ["abcd","bacd","acbd","cabd","abdc","badc", <br />
"adbc","dabc","dbac","bdac","dacb","adcb",<br />
"dcab","cdab","cadb","acdb","cdba","dcba",<br />
"cbda","bcda","bdca","dbca","bcad","cbad"]<br />
</haskell><br />
<br />
But sometimes the function to 'close' is more complicated than <code>interleave</code> and<br />
then <code>closure</code> becomes really useful.<br />
<br />
<br />
=== Shortest Path ===<br />
<br />
Let us now examine the toy problem of finding the distance <br />
to a given node from all the other nodes of the graph. <br />
Most readers probably know the Dijkstra's algorithm to <br />
compute the solution to the problem. We will not try <br />
to reproduce it here, instead we will define the computation in terms of <code>foldG</code>.<br />
<br />
Here it goes:<br />
<haskell><br />
-- | Compute the distance to v for every vertex of gr.<br />
distsTo :: Vertex -> Graph Float -> Table Float<br />
distsTo v gr = foldGAll infinite distance gr <br />
where infinite = 10000000 -- well, you get the idea<br />
distance v' neighbours <br />
| v == v' = 0<br />
| otherwise = minimum [distV+arcWeight | (distV, arcWeight) <- neighbours]<br />
</haskell><br />
<br />
So clear that it barely needs to be explained. :) <br />
Just notice how the minimize function assumes that the <br />
distance is already computed for all its neighbours. <br />
This works because <code>foldG</code> will iterate until it finds the fixed point.<br />
<br />
On this simple graph,<br />
<br />
<haskell><br />
grDist = buildG (1,5) [(1,5.0,2), (2,5.0,3), (2,7.0,4), (3,5.0,4), (4,5.0,5), (4,3.0,1)]<br />
</haskell><br />
<br />
attachment:grdist.png<br />
<br />
the result of <haskell><br />
dists = distsTo 5 grDist<br />
</haskell> is<br />
<br />
attachment:grdist2.png<br />
<br />
(labeling each node with the its result, ie. distance to vertex <code>5</code>)<br />
<br />
=== Finite Automaton ===<br />
<br />
Finite automatons are basically graphs, so let's see how we can apply the <br />
framework to their analysis.<br />
<br />
First, let's define an automaton. For our purposes, it is a graph <br />
of states/transitions, some of the states being marked as initial or final.<br />
<br />
<haskell><br />
type Automaton t = (Vertex, Graph t, Set Vertex) -- ^ Initial, transitions, finals<br />
</haskell><br />
<br />
For starters, here is how the <code>showGraphViz</code> function can be applied to automaton display:<br />
<br />
<haskell><br />
automatonToGraphviz (i, gr, fs) = showGraphViz (LabGraph gr lab)<br />
where lab :: Labeling String<br />
lab v = (if v == i then (">"++) else id) $ <br />
(if v `Set.member` fs then (++"|") else id) []<br />
</haskell><br />
<br />
Nothing ground breaking. We only label the nodes accordingly to<br />
their final or initial status.<br />
<br />
<haskell><br />
aut1 = (1, buildG (1,3) [(1,'a',2),(2,'a',2),(2,'b',2),(2,'c',3),(1,'a',3)], Set.fromList [3])<br />
</haskell><br />
<br />
attachment:aut1.png<br />
<br />
A more interesting example is how to transform a non-deterministic <br />
automaton to an equivalent deterministic one. The underlying idea <br />
is that non-deterministic execution of the automaton is equivalent <br />
to deterministic execution on all possible transitions at once. <br />
Refer to [[#hop&ull 6]] for details. This is relatively easily done using <code>unfoldG</code>.<br />
<haskell><br />
simpleGenerator f x = (x, f x)<br />
<br />
nfaToDfa :: Ord e => Automaton e -> Automaton e<br />
nfaToDfa (initial1, aut1, finals1) = (initial2, aut2, finals2)<br />
where (initial2, LabGraph aut2 mapping) = unfoldG (simpleGenerator build) seed<br />
seed = Set.singleton initial1<br />
build state = Map.toList $ Map.fromListWith Set.union $ map lift $<br />
concat $ map (aut1 !) $ Set.toList state<br />
lift (t,s) = (t, Set.singleton s)<br />
isFinal = setAny (`Set.member` finals1) . mapping<br />
finals2 = Set.fromList $ filter isFinal $ indices aut2<br />
setAny f = any f . Set.toList<br />
</haskell><br />
<br />
The 'build' function is the tricky part. Yet, it's not as complicated as it seems: all it does is <br />
#Find all reachable nodes from a set of nodes; <br />
#Classify them by transition label<br />
#Build target state-sets accordingly.<br />
<br />
<haskell><br />
aut2 = nfaToDfa aut1<br />
</haskell><br />
<br />
attachment:aut2.png<br />
<br />
Another thing we possibly wish to compute is the set of <br />
strings accepted by the automaton, (aka. the language it <br />
defines). Most of the time this will be infinite, so <br />
we will limit ourselves to strings of length <code>n</code> maximum.<br />
We need finiteness because otherwise <code>foldG</code> would not find<br />
a fixed point: string sets would keep growing idefinitely.<br />
<br />
<haskell><br />
accepted n (initial1, aut1, finals1) = Set.unions [resultTable ! v | v <- Set.toList finals1]<br />
-- gather what's accepted at all final states<br />
where resultTable = foldGAll Set.empty step (transposeG aut1)<br />
step v trans = Set.unions ((if v == initial1 then Set.singleton [] else Set.empty) : <br />
[Set.map ((++[t]) . take (n-1) ) s | (t,s) <- trans])<br />
</haskell><br />
<br />
Notice that we need to reverse the graph arcs, otherwise the information propagates in the wrong direction.<br />
<br />
With <br />
<haskell><br />
accAut1 = accepted 4 aut1<br />
accAut2 = accepted 4 aut2<br />
</haskell><br />
we have <br />
<haskell><br />
accAut1 == accAut2 == {"a","aaac","aabc","aac","abac","abbc","abc","ac"}<br />
</haskell><br />
<br />
=== LALR Automaton ===<br />
<br />
Another area where I applied graph (un)folding is LALR(1) parser generation. The detailed code<br />
depends on just too many things to fit in this paper, <br />
thus we will only sketch how pieces fit<br />
together. Also, since a course on parsing is clearly beyond the scope of this article, <br />
please refer to local copy of the dragon book [[#dragon 7]] for details on the method.<br />
<br />
In the process of generating tables for a LALR automaton, <br />
there are three steps amenable to implementation by <code>foldG</code> and <code>unfoldG</code>.<br />
<br />
1. Construction of the closure of a LR-items kernel. This one is very similar to the <code>closure</code> function described above, except that we don't discard the graph structure. It'll be of use for step 3.<br />
2. LR(0) automaton generation. Then again a use for <code>unfoldG</code>.<br />
3. Propagation of the lookahead. It is a fold over the whole graph of LR-items, basically using set union as coalescing operation. It is very similar to computation of acceptable strings above.<br />
<br />
<br />
== Implementation ==<br />
<br />
=== UnfoldG ===<br />
<br />
For the sake of completeness, here's how to implement the <code>unfoldG</code> function.<br />
<br />
The algorithm effectively a depth-first search, written in imperative style.<br />
The only difference is that the search graph is remembered and returned as result.<br />
<br />
<haskell><br />
<br />
unfoldGManyST :: (Ord a) => (a -> (c, [(b, a)]))<br />
-> [a] -> ST s ([Vertex], LabGraph c b)<br />
unfoldGManyST gen seeds =<br />
do mtab <- newSTRef (Map.empty)<br />
allNodes <- newSTRef []<br />
vertexRef <- newSTRef firstId<br />
let allocVertex = <br />
do vertex <- readSTRef vertexRef<br />
writeSTRef vertexRef (vertex + 1)<br />
return vertex<br />
let cyc src =<br />
do probe <- memTabFind mtab src<br />
case probe of<br />
Just result -> return result<br />
Nothing -> do<br />
v <- allocVertex<br />
memTabBind src v mtab <br />
let (lab, deps) = gen src<br />
ws <- mapM (cyc . snd) deps<br />
let res = (v, lab, [(fst d, w) | d <- deps | w <- ws])<br />
modifySTRef allNodes (res:)<br />
return v<br />
mapM_ cyc seeds<br />
list <- readSTRef allNodes<br />
seedsResult <- (return . map fromJust) =<< mapM (memTabFind mtab) seeds<br />
lastId <- readSTRef vertexRef<br />
let cycamore = array (firstId, lastId-1) [(i, k) | (i, a, k) <- list]<br />
let labels = array (firstId, lastId-1) [(i, a) | (i, a, k) <- list]<br />
return (seedsResult, LabGraph cycamore (labels !))<br />
where firstId = 0::Vertex<br />
memTabFind mt key = return . Map.lookup key =<< readSTRef mt<br />
memTabBind key val mt = modifySTRef mt (Map.insert key val)<br />
<br />
</haskell><br />
<br />
Notice how every time a seed is encountered, its corresponding vertex number stored. <br />
Whenever the seed is encountered again, the stored is just returned.<br />
<br />
<br />
=== FoldG ===<br />
<br />
<haskell><br />
foldGAllImplementation bot f gr = finalTbl<br />
where finalTbl = fixedPoint updateTbl initialTbl<br />
initialTbl = listArray bnds (replicate (rangeSize bnds) bot)<br />
<br />
fixedPoint f x = fp x<br />
where fp z = if z == z' then z else fp z'<br />
where z' = f z<br />
updateTbl tbl = listArray bnds $ map recompute $ indices gr<br />
where recompute v = f v [(b, tbl!k) | (b, k) <- gr!v]<br />
bnds = bounds gr<br />
</haskell><br />
<br />
<br />
The proposed implementation for foldG is rather bold.<br />
It just applies the coalescing <br />
function repeatedly till it converges.<br />
<br />
While this is not an ideal situation, it's perfectly suited for a first-trial <br />
implementation, or when performance is not crucial.<br />
<br />
If execution time becomes critical, then more specialized <br />
versions can be crafted.<br />
In the case of the shortest path algorithm, for example, <br />
it could take advantage<br />
of the nice properties of the coalescing function to use <br />
a priority queue and greedily<br />
find the fixed point. This would restore the optimal O(n * log n) complexity.<br />
<br />
<br />
== Conclusion ==<br />
<br />
The approach presented may not be excellent for controlling details of implementation <br />
and tuning run-time performance, but I think that's not the point <br />
of haskell programming anyway.<br />
On the other hand, it is very good for quick implementation <br />
of a large range of graph algorithms. The fact that it's mostly based on a<br />
generalisation on fold and unfold should appeal to haskell<br />
programmers.<br />
<br />
<br />
== References ==<br />
<br />
* [[Anchor(cycle-therapy)]] [1] ''Cycle Therapy: A Prescription for Fold and Unfold on Regular Trees'', F. Turbak and J.B. Wells, http://cs.wellesley.edu/~fturbak/pubs/ppdp01.pdf<br />
* [[Anchor(king-thesis)]] [2] ''Functional Programming and Graph Algorithms'', D. J. King, http://www.macs.hw.ac.uk/~gnik/publications <br />
* [[Anchor(induct)]] [3] ''Inductive Graphs and Functional Graph Algorithms'', Martin Erwig, http://web.engr.oregonstate.edu/~erwig/papers/abstracts.html<br />
* [[Anchor(dfs)]] [4] , D. J. King and John Launchbury, http://www.cse.ogi.edu/~jl/Papers/dfs.ps<br />
* [[Anchor(bananas-lenses)]] [5] ''Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire'', Erik Meijer, Maarten Fokkinga, Ross Paterson. http://citeseer.ist.psu.edu/meijer91functional.html<br />
* [[Anchor(hop&ull)]] [6] ''Introduction to Automata Theory, Languages, and Computation'', JE Hopcroft, and JD Ullman, http://www-db.stanford.edu/~ullman/ialc.html <br />
* [[Anchor(dragon)]] [7] ''Compilers: Principles, Techniques and Tools'', Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. (Addison-Wesley 1986; ISBN 0-201-10088-6)<br />
<br />
----<br />
CategoryArticle</div>Sergvhttps://wiki.haskell.org/index.php?title=Xmonad/Using_xmonad_in_XFCE&diff=41972Xmonad/Using xmonad in XFCE2011-09-08T19:04:31Z<p>Sergv: Add clarification about location of Autostarted Applications manager in menu.</p>
<hr />
<div>{{xmonad}}<br />
[[Category:XMonad]]<br />
<br />
=Introduction=<br />
<br />
This is an updated and hopefully less confusing version of my original blog post located here: [http://ivanmiljenovic.wordpress.com/2008/01/20/xmonad/ X<sup>2</sup>Monad]<br />
<br />
''Why use Xfce with XMonad?''<br />
<br />
Whilst XMonad is a truly excellent Window Manager, alone it doesn't offer the full convenience of an entire Desktop Environment, such as the provided menus, all-in-one configuration settings, consistent dialogs, etc. Out of the three main DEs available in Unix-like Operating Systems - Gnome, KDE and Xfce - the latter is often touted as the most "nimble" of the lot. In my own opinion, it is simpler to use than KDE but with more configuration options (and a saner environment) than Gnome.<br />
<br />
= Configuring XMonad to work with Xfce =<br />
<br />
This guide assumes that you're using at least version 4.4 of Xfce, and have both XMonad and XMonad-Contrib 0.7 installed. First we need to configure Xfce to play nicely, before adding in XMonad.<br />
<br />
==Xfce Settings==<br />
<br />
We're going to utilise Xfce's Session Manager to make sure that xfwm4 (Xfce's default WM) is no longer started. Preferably, it'd be nice if we could have XMonad started this way as well (or even set in a configuration option like with Gnome or KDE), but the former isn't possible until XMonad supports the required Session Management Protocol and the latter isn't possible in Xfce at all.<br />
<br />
===Backing up your settings===<br />
<br />
If you're already using Xfce, you may wish to first backup your ~/.config directory. Whilst Xfce isn't the only application/library/etc. to utilise the ~/.config directory to store it's settings, it does so in a few different sub-directories (i.e. in some cases specific applications have their own settings directory), it's easier to backup the whole thing.<br />
<br />
Note that as far as I know, it isn't possible to have for the same user both a standard Xfce login and an XMonad Xfce login. Whilst you can have different sessions to determine whether you use XMonad or not, there are some settings - such as the panel layout - that aren't saved on a per-session basis.<br />
<br />
===Start a new session===<br />
<br />
For backup purposes in case something goes wrong with your XMonad settings, it is recommended to create a new Xfce session. To do so, open the "Sessions and Startup" option dialog either from the "Settings" section of the Xfce menu, or in Xfce's Settings Manager (xfce-setting-show). Enable the the session chooser and automatic saving of sessions, then logout and log back in again. When the session chooser appears, choose to create a new session.<br />
<br />
A word of warning: I found that the session chooser kept crashing Xfce on a freshly installed Xfce-4.4.2. If this happens to you, delete the ~/.config/xfce-session directory.<br />
<br />
===Setting up Xfce===<br />
<br />
Open up the Xfce settings manager. There, you can customise Xfce to your hearts content. Note that the following settings dialogs won't be applicable once we start XMonad:<br />
* Window Manager<br />
* Window Manager Tweaks<br />
* Workspaces and Margins<br />
The last option isn't required, as Xfce will happily use XMonad's workspaces.<br />
<br />
These options are recommended:<br />
* Under "Desktop", let Xfce manage the desktop, but under Behaviour set the Desktop Icons to "None". This way, Xfce will control your wallpaper, etc., allowing you to have random lists, different wallpapers for different screens for multi-head setups, etc.<br />
* The Keyboard settings can be used to set keyboard shortcuts for those multimedia keys (e.g. XF86AudioMute) found on many keyboards, since these types of keys are currently difficult to create shortcuts for in XMonad. Simply setup the various keyevents to these keybindinds for use with xmodmap in ~/.Xmodmap and Xfce will read these on start.<br />
* Mouse: if you don't have or want to use a mouse, there's a limited type of mouse emulation available where you can use the Numpad arrow keys to move the cursor.<br />
* For "Preferred Applications", set whichever terminal emulator you plan to use. I find that Xfce's own Terminal application to work quite nicely with XMonad and to resize rather well.<br />
<br />
If you so wish, you can now disable the session chooser, though I suggest you leave it enabled until you've successfully managed to login to your Xfce/XMonad environment several times.<br />
<br />
===Xfce's Panels===<br />
<br />
If you so wish you can now customize the panels and the plugins on them. This can be safely left to later, however. With XMonad, I typically only have one panel rather than the default two. In terms of panel plugins, I've removed the Task List, but kept the pager: with the EWMH settings in XMonad, Xfce's pager acts as a mini-preview of your various layouts!<br />
<br />
===Ensure XMonad gets started===<br />
<br />
Because XMonad doesn't support the session protocol and Xfce is missing an option to specify which Window Manager to use, we must ensure that XMonad is started each time we log in. '''WARNING!!! Make sure you don't log out and back in again after setting this and before you reach the stage where we kill xfwm4, or else you will have two different Window Managers fighting each other!!!''' Run Xfce's Autostarted Applications manager (either under the Settings section in the Xfce menu (Menu -> Settings -> Setting Manager -> Sessin and Startup -> Application Autostart tab) or run <code>xfce4-autostart-editor</code>). There, add a new autostarted application with the command being simply <code>xmonad</code> (this of course assumes XMonad is in your path... otherwise, specify the full path to the binary) with whatever name you want. Ensure that the new autostarted application entry you just created is ticked.<br />
<br />
To repeat the warning: '''After creating this, don't exit Xfce until you've finished configuring XMonad and have killed xfwm4!!!'''<br />
<br />
==XMonad Hacking==<br />
<br />
[[#Using_XMonad.Config.Xfce|see also, Config.Xfce]]<br />
<br />
It's now time to customise XMonad. You have a wide variety of Layouts, Hooks, etc. to experiment with. Here are some basic changes that you ''should'' make:<br />
<br />
* Replace the old gaps setup with ManageDocks (especially since Gaps are deprecated in the current darcs version of XMonad and won't be present from 0.8 onwards). This also involves changing the keybinding to hide/show panels.<br />
* Use the EWMH hooks so that Xfce can obtain Workspace information from XMonad and vice versa. This lets the Pager plugin to the panel act as a mini workspace-previewer, and let you choose which workspace to view by clicking on the pager. Note that by using this, the pager will also automatically add/remove workspaces based upon how many workspaces XMonad has.<br />
* Use ''mod4'' (aka the "Windows" key) as the modifier key. Many applications - including Xfce - use the Alt key for keybindings themselve (e.g. Xfce uses Alt-F2 to show its run dialog).<br />
* With keybindings, ''xfce4-session-logout'' to exit for the Mod-Shift-q keybinding, rather than just exiting XMonad (which will leave Xfce still running).<br />
* Use Xfce's Terminal as the terminal program.<br />
<br />
Here is a minimal ~/.xmonad/xmonad.hs that demonstrates how to do this. Note that it is expected that you copy and edit the sample configuration file to get the remaining keybindings present, or else use something like the EZConfig extension in the Contrib library to selectively add/edit/remove keybindings from the default list. As it stands, this configuration file ''isn't'' sufficient, as it doesn't list all the keybindings and in fact won't compile due to the "..." present in the keybinding list.<br />
<br />
<haskell><br />
import XMonad<br />
import qualified Data.Map as M<br />
import XMonad.Hooks.ManageDocks<br />
import XMonad.Hooks.EwmhDesktops<br />
<br />
myTerminal = "Terminal"<br />
<br />
myKeys conf@(XConfig {XMonad.modMask = modMask}) = M.fromList $<br />
[<br />
...<br />
, ((modMask, xK_b), sendMessage ToggleStruts)<br />
...<br />
, ((modMask .|. shiftMask, xK_q), spawn "xfce4-session-logout")<br />
...<br />
]<br />
<br />
main = xmonad defaultConfig<br />
{ manageHook = manageDocks <+> manageHook defaultConfig<br />
, logHook = ewmhDesktopsLogHook<br />
, layoutHook = avoidStruts $ layoutHook defaultConfig<br />
, handleEventHook = ewmhDesktopsEventHook<br />
, startupHook = ewmhDesktopsStartup<br />
, modMask = mod4Mask<br />
, keys = myKeys<br />
}<br />
</haskell><br />
<br />
If you so wish, you can also use ''xfrun4'' or ''xfce4-appfinder'' as the program launchers for the Mod-p and Mod-Shift-p keybindings instead of the default dmenu and gmrun.<br />
<br />
Here is a complete xmonad.hs and a diff with the template xmonad.hs:<br />
http://gist.github.com/311016 (xmonad.hs)<br />
http://gist.github.com/311022 (diff)<br />
<br />
== Using XMonad.Config.Xfce ==<br />
''with xmonad-contrib-0.8 or greater''<br />
<br />
Use the following in your ~/.xmonad/xmonad.hs to implement all the suggestions made above. This ''is'' a compilable configuration giving you all the basics.<br />
<br />
<haskell><br />
import XMonad<br />
import XMonad.Config.Xfce<br />
<br />
main = xmonad xfceConfig<br />
</haskell><br />
<br />
You may wish to run ghci or something on your ~/.xmonad/xmonad.hs configuration file to make sure it's correct before moving on to the next section.<br />
<br />
With the above configuration, the Meta key still acts as the modMask. I would suggest remapping the <code>modMask</code> to the Super key by the following changes to the above configuration:<br />
<br />
<haskell><br />
main = xmonad xfceConfig<br />
{ modMask = mod4Mask }<br />
</haskell><br />
<br />
=Get XMonad Going!=<br />
<br />
It's now time to ditch Xfce's default Window Manager xfwm4 and replace it with something better (i.e. XMonad). To do so, it's probably easiest to run xfrun4 with Xfce's default key combination Alt-F2 and enter the following:<br />
<code><br />
killall xfwm4 && xmonad<br />
</code><br />
<br />
If all went well, you'll now find yourself in XMonad!<br />
<br />
=That's all folks!=<br />
<br />
Congratulations, you have now successfully configured XMonad and Xfce so that XMonad acts as Xfce's Window Manager! Sample screenshots to come.<br />
<br />
= Note for 4.6.0 =<br />
<br />
The procedure above did not work for me, I had to manually edit the session cache (in ~/.cache/sessions) to remove the entry for xfwm4. This meant removing all the lines with Client0, decrementing the client number for all other clients, and decrementing Count. This needs to be done after logging out so that it isn't clobbered by saving the session on logout.<br />
<br />
= Notes for 4.6.1 =<br />
<br />
*I had to delete ~/.cache/sessions directory(while already logged out from any existing Xfce session) for the above solution to work.<br />
<br />
*The command <code>killall xfwm4 && xmonad</code> didn't manage to kill xfwm4, it was possible to shut it down in "Settings->Session and Startup->Session".<br />
<br />
= Problems =<br />
<br />
Swapping of two windows does not reflect in the xfce panel. Also the order of the windows in the xfce panel does not match with the order of the windows in xmonad's stack.</div>Sergvhttps://wiki.haskell.org/index.php?title=Regex_Posix&diff=40953Regex Posix2011-07-13T09:40:17Z<p>Sergv: Fix some typos.</p>
<hr />
<div>= regex-posix Bugs =<br />
<br />
Executive summary: If you want a bug-free and/or portable POSIX extended regular expression library to use from Haskell, then regex-posix will not help you. You should use the regex-tdfa package instead.<br />
<br />
The regular expressions provided by the GHC bundle (up to 6.10.1) are<br />
through a wrapping of the operating system's native C library. The C<br />
library calls in<br />
[http://www.opengroup.org/onlinepubs/009695399/basedefs/regex.h.html "regex.h"]<br />
have been imported via FFI and wrapped by<br />
[http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-posix regex-posix]<br />
, either the older 0.72.* version or the newer version (> 0.94.*).<br />
Note that the matchAll semantics changed in regex-posix with version<br />
0.94.0 (see mailing list announcement).<br />
<br />
Unfortunately the native platforms provide Posix regular expressions<br />
support that contains bugs and/or violates the<br />
[http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html specification]. This is especially true for the GNU C library (GLIBC)<br />
used by Linux distributions and the BSD C library used by OS X and<br />
FreeBSD and NetBSD (XBSD). More discussion of the standard is<br />
available from<br />
[http://www.research.att.com/~gsf/testregex/re-interpretation.html Glenn Fowler] at AT&T.<br />
<br />
A capsule summary of the Posix rules:<br />
* regular expressions (REs) take the leftmost starting match, and the longest match starting there<br />
* earlier subpatterns have leftmost-longest priority over later subpatterns<br />
* higher-level subpatterns have leftmost-longest priority over their component subpatterns<br />
* REs have right associative concatenation which can be changed with parenthesis<br />
* parenthesized subexpressions return the match from their last usage<br />
* text of component subexpressions must be contained in the text of the higher-level subexpressions<br />
* if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions<br />
* if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string<br />
<br />
All of the above rules will be violated by one or another bug described below.<br />
<br />
= regex-posix-unittest =<br />
<br />
Many of these unittests come from<br />
[http://www.research.att.com/~gsf/testregex/testregex.html testregex by AT&T]<br />
, the rest from the author of regex-tdfa ([[ChrisKuklewicz]]). Note<br />
that all of these tests are passed by regex-tdfa-0.97.1 (but not below<br />
this). Since then additional bugs have been found and fixed, so regex-tdfa is now<br />
on at least version 0.97.4, and a future regex-posix-unittest may include the<br />
additional test cases.<br />
<br />
These problems are documented here, but examples of the bugs are<br />
testable on your system by way of the regex-posix-unittest package on<br />
Hackage. This package can be installed either as "--user" or<br />
"--global", and provides three things:<br />
<br />
# The "regex-posix-unittest" binary executable<br />
# The "test-manifest.txt" file which lists files to load test from<br />
# A set of ".txt" files containing one unit test per line<br />
<br />
Running the regex-posix-unittest program produces output for the<br />
passing or failing of each test in each file listed in "test-manifest.txt",<br />
which is shipped listing all the test files. This is followed a<br />
summary list of failing test id numbers.<br />
<br />
Each unit test line has four fields, separated by white-space (one tab<br />
in the above):<br />
<br />
# The test identification Int, negative if this is expected to fail<br />
# The regular expression pattern (extended regular expression)<br />
# The text to search (no white-space) # The expected result<br />
## NOMATCH if the matching should fail ## "(n,m)(..)" if the match succeeds<br />
##* Each pair of numbers denotes a substring of the text to search as two 0-based indexes<br />
##* The length of the substring is (m-n)<br />
##* The first pair is the whole match, further pairs are for parenthesized subexpression captures<br />
##* Subexpressions that do not match are lists as "(?,?)" instead of with numbers<br />
<br />
You can add and delete but please do not change existing tests.<br />
<br />
If your platform is not mentioned in this wiki page (see below) then<br />
please either add it and your summary results or email the results to<br />
"TestRegexLazy -at- mightyreason -dot- com".<br />
<br />
One can also pass two strings to "regex-posix-unittest" to run your own test:<br />
<pre><br />
user@ubuntu810desktop:~/test/regex-posix-unittest$ ~/.cabal/bin/regex-posix-unittest "abcd" "([^c])+"<br />
Test for [Char], Version {versionBranch = [1,0], versionTags = []}<br />
"abcd"<br />
"([^c])+"<br />
Posix<br />
("","ab","cd",["b"])<br />
POSIX count = (2,2)<br />
array (0,1) [(0,(0,2)),(1,(1,1))]<br />
array (0,1) [(0,(3,1)),(1,(3,1))]<br />
</pre><br />
<br />
This actually is a slightly return format. The strings are echoed and<br />
the ("","ab","cd",["b"]) is the "" before the first match, the "ab" of<br />
the whole first match, and the "cd" after the first match, which the<br />
captures in the list ["b"]. The arrays lines are the results of<br />
consecutive non-overlapping matches, in (capture #, (start index,<br />
length)) format. So (1,(1,1)) is the 1st parenthesis and starts after<br />
the 1st character and is 1 character long: "b". The (1,(3,1)) is "d".<br />
<br />
= Types of failure =<br />
<br />
There are four classes of failures:<br />
# Critical failures that find the wrong whole match (XBSD as of January 2009, Solaris)<br />
# Serious failures that find impossible subexpressions (XBSD)<br />
# Serious failures that violate well-formed subexpressions (GLIBC)<br />
# Failure to choose Posix captures when the answer is ambiguous (GLIBC, XBSD, Solaris)<br />
<br />
Example of an ambiguous situation: Matching "a" with the the pattern<br />
"(.)?(a)?" or the pattern "(.)|(a)". Each pattern can use either<br />
"(.)" or "(a)" to match the "a', but not both; thus only one of the<br />
two subexpressions must succeed and the other must fail. In this case<br />
the right answer is that "(.)" succeeds in both cases, so the full<br />
answer is (0,1)(0,1)(?,?). The first pattern uses the right<br />
associativity rule for concatenation and the second pattern uses the<br />
leftmost subpattern bias.<br />
<br />
= Results and Bugs =<br />
<br />
== OS X, FreeBSD, NetBSD ==<br />
<br />
On a G4 version of OS X 10.5.6 I see:<br />
<pre><br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/basic3.txt",[])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/class.txt",[3,9])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/left-assoc.txt",[])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/right-assoc.txt",[])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/forced-assoc.txt",[5,6,7,8,15,16,21,22])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/nullsub3.txt",[50,51])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/repetition2.txt",[101,102,103,104,105,106,107,<br />
110,111,112,113,114,115,116,117,260,261,262,263,266,267,268,270,271])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/totest.txt",[5,30,33,209])<br />
("/Users/chrisk/local/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[1,-1,2,-2,3,-3,14,-14])<br />
</pre><br />
The "osx-bsd-critical.txt" unexpected failures are:<br />
<pre><br />
############################# Unexpected Fail # 1 #############################<br />
<br />
Searched text: "ab"<br />
Regex pattern: "(()|.)(b)"<br />
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"<br />
Actual result : "(1,2)(1,1)(1,1)(1,2)"<br />
<br />
############################# Unexpected Fail # 2 #############################<br />
<br />
Searched text: "ab"<br />
Regex pattern: "(()|[ab])(b)"<br />
Expected output: "(0,2)(0,1)(-1,-1)(1,2)"<br />
Actual result : "(1,2)(1,1)(1,1)(1,2)"<br />
<br />
############################# Unexpected Fail # 3 #############################<br />
<br />
Searched text: "aaab"<br />
Regex pattern: "(()|[ab])+b"<br />
Expected output: "(0,4)(2,3)(-1,-1)"<br />
Actual result : "(3,4)(3,3)(3,3)"<br />
<br />
############################# Unexpected Fail # 14 #############################<br />
<br />
Searched text: "aaab"<br />
Regex pattern: "([ab]|())+b"<br />
Expected output: "(0,4)(2,3)(-1,-1)"<br />
Actual result : "(0,4)(3,3)(3,3)"<br />
</pre><br />
<br />
Tests 1 and 2 should match the whole (0,2) or "ab" but only match the<br />
(1,2) of "b". Test 3 should match the (0,3) of "aabb" but only<br />
matches the (3,4) of "b". These are critical failures to match the<br />
whole text properly. This problem was found by QuickCheck and<br />
reported to OS X and FreeBSD and NetBSD (the latter two checked over<br />
IRC). OpenBSD reportedly did not show this bug (anecdote on IRC).<br />
<br />
This also infects sed, and can be seen as<br />
<pre><br />
prompt$ echo "ab" | sed -En 's/(()|.)(b)/[&][\1]/p'<br />
a[b][]<br />
</pre><br />
instead of the correct "[ab][a]" output.<br />
<br />
The failure in Test 14 is different. Instead of the whole match being<br />
wrong the pattern being repeated by the "+" operator is matched 3<br />
times against "a" and "a" and "a", followed by a fourth empty match at<br />
(3,3) before the final "b". Once the repeated pattern "[ab]|()" has<br />
matched a non-empty string it should not be used to match an<br />
empty-string.<br />
<br />
OS X also gets this wrong in "totext.txt" #209:<br />
<pre><br />
Searched text: "yyyyyy"<br />
Regex pattern: "(yyy|(x?)){2,4}"<br />
Expected output: "(0,6)(3,6)(-1,-1)"<br />
Actual result : "(0,6)(6,6)(6,6)"<br />
</pre><br />
where the (x?) matches the empty string at (6,6) instead of stopping<br />
after 2 repetitions of "yyy".<br />
<br />
But OS X is getting this right on numerous tests in "nullsub3.txt" such as<br />
<pre><br />
Expected Pass #8<br />
text and pattern: "a"<br />
Regex pattern: "(a*)*"<br />
Outputs agree: "(0,1)(0,1)"<br />
<br />
Expected Pass #47<br />
text and pattern: "ax"<br />
Regex pattern: "(a*)*(x)"<br />
Outputs agree: "(0,2)(0,1)(1,2)"<br />
</pre><br />
<br />
Where (a*) matches "a" and is reported as (0,1) and the no additional repetition<br />
matches the empty string at (1,1).<br />
<br />
The OS X failure in "totest.txt" #33 illustrates the "impossible" capture:<br />
<pre><br />
Searched text: "ababa"<br />
Regex pattern: "(aba|ab|a)*"<br />
Expected output: "(0,5)(2,5)"<br />
Actual result : "(0,5)(0,3)"<br />
</pre><br />
The whole match is (0,5) and the correct repetition has "ab" then<br />
"aba" matching, so the correct last repetition is the "aba" between<br />
indexes (2,5). The OS X code recognizes the whole match then does a<br />
second pass of greedy repetitions, so the first repetition is "aba"<br />
between indexes (0,3). The OS X code then fails to match the "ba" in<br />
this second pass but not care. It is logically impossible that the<br />
last iteration of the above subexpression ends before the whole match,<br />
but OS X reports the last repetition ended at index 3 even though the<br />
whole match ends at index 5.<br />
<br />
This is also shown in numerous failed tests in "repetition2.txt" such as #260:<br />
<pre><br />
Searched text: "ababcd"<br />
Regex pattern: "(a|ab|c|bcd){0,}(d*)"<br />
Expected output: "(0,6)(3,6)(6,6)"<br />
Actual result : "(0,6)(4,5)(6,6)"<br />
</pre><br />
The correct repetitions are "ab" then "a" then "bcd", reported as (3,6). The incorrect greedy pass uses by OS X matches "ab" then "c" as (4,5) and then gives up even though (5,6) is still not matched in the second pass. As a sed test:<br />
<pre><br />
prompt$ echo "ababcd" | sed -En 's/(a|ab|c|bcd)*/[&][\1]/p'<br />
[ababcd][c]<br />
</pre><br />
The above output should be impossible; the correct output is [ababcd][bcd].<br />
<br />
The "forced-assoc.txt" failures show that one cannot use parenthesis<br />
to force the naturally right associative concatenation to be left<br />
associative. This is a violation of the Posix standard, and of the<br />
"man 7 re_format" page form OS X 10.5.6: "Subexpressions also match<br />
the longest possible substrings, subject to the constraint that the<br />
whole match be as long as possible, with subexpressions starting<br />
earlier in the RE taking priority over ones starting later. Note that<br />
higher-level subexpressions thus take priority over their lower-level<br />
component subexpressions". Take #15 as an example:<br />
<pre><br />
Searched text: "abc"<br />
Regex pattern: "((a*)(b|abc))(c*)"<br />
Expected output: "(0,3)(0,3)(0,0)(0,3)(3,3)"<br />
Actual result : "(0,3)(0,2)(0,1)(1,2)(2,3)"<br />
</pre><br />
Here ((a*)(b|abc))(c*) should be parsed as ((a*)(b|abc)) followed by<br />
(c*). As ((a*)(b|abc)) started earlier it should be as long as<br />
possible, at the expense of (c*). The longest match for this is the<br />
correct (0,3) not the incorrect (0,2) returned by OS X. And<br />
((a*)(b|abc)) is a higher-level than its lower-level component (a*),<br />
so (a*) should match with the side constraint that ((a*)(b|abc))<br />
matches (0,3), which means (a*) should match (0,0) instead of the<br />
longer (0,1). OS X gets this associativity this wrong.<br />
<br />
In "repetitions2.txt" there is a series:<br />
<pre><br />
Expected Pass #100<br />
text and pattern: "X1234567Y"<br />
Regex pattern: "X(.?){0,}Y"<br />
Outputs agree: "(0,9)(7,8)"<br />
<br />
############################# Unexpected Fail # 101 #############################<br />
<br />
Searched text: "X1234567Y"<br />
Regex pattern: "X(.?){1,}Y"<br />
Expected output: "(0,9)(7,8)"<br />
Actual result : "(0,9)(8,8)"<br />
...<br />
############################# Unexpected Fail # 107 #############################<br />
<br />
Searched text: "X1234567Y"<br />
Regex pattern: "X(.?){7,}Y"<br />
Expected output: "(0,9)(7,8)"<br />
Actual result : "(0,9)(8,8)"<br />
<br />
Expected Pass #108<br />
text and pattern: "X1234567Y"<br />
Regex pattern: "X(.?){8,}Y"<br />
Outputs agree: "(0,9)(8,8)"<br />
<br />
############################# Unexpected Fail # 110 #############################<br />
<br />
Searched text: "X1234567Y"<br />
Regex pattern: "X(.?){0,8}Y"<br />
Expected output: "(0,9)(7,8)"<br />
Actual result : "(0,9)(8,8)"<br />
<br />
############################# Unexpected Fail # 111 #############################<br />
<br />
Searched text: "X1234567Y"<br />
Regex pattern: "X(.?){1,8}Y"<br />
Expected output: "(0,9)(7,8)"<br />
Actual result : "(0,9)(8,8)"<br />
...<br />
<br />
############################# Unexpected Fail # 117 #############################<br />
<br />
Searched text: "X1234567Y"<br />
Regex pattern: "X(.?){7,8}Y"<br />
Expected output: "(0,9)(7,8)"<br />
Actual result : "(0,9)(8,8)"<br />
<br />
Expected Pass #118<br />
text and pattern: "X1234567Y"<br />
Regex pattern: "X(.?){8,8}Y"<br />
Outputs agree: "(0,9)(8,8)"<br />
</pre><br />
<br />
The good news is that #108 and #118 are correct, as the last<br />
repetition is forced to match the empty string after the "7" and<br />
before the "Y". The {1,} through {7,} and {1,8} through {7,8} are all<br />
wrong: they all should match the "7" at (7,8) but XBSD returns the<br />
null at (8,8). The {0,} answer is right, but {0,8} and {1,} are both<br />
wrong. So {0,} seems to handled specially, probably because {0,} is<br />
precisely the "*" operator, though the wrong {1,} is precisely the "+"<br />
operator.<br />
<br />
The failures of XBSD in "repetitions2.txt" in the 260 to 271 range<br />
show a different ambiguity where {0,} is not matched correctly. It<br />
gets these wrong by being greedy in the expanded form of the {n,m}<br />
operators. The "a|ab|c|bcd" should choose "ab" then "a" then "bcd" at<br />
(3,6) to match "ababcd", then (d*) should match nothing at (6,6). But<br />
XBSD is matching "ab" then "ab" then "c" at (4,5) and then uses (d*)<br />
to match the last "d" at (5,6). Changing the pattern to<br />
"((a|ab|c|bcd){3,10})(d*)" fails to force the grouping and<br />
associativity.<br />
<br />
== OpenBSD ==<br />
<br />
There is no report from OpenBSD. One would be quite welcome!<br />
<br />
== GNU C library (aka GLIBC) ==<br />
<br />
The system being tested is a current Ubuntu 8.10 distribution:<br />
<pre><br />
user@ubuntu810desktop:~$ uname -a<br />
Linux ubuntu810desktop 2.6.27-11-generic #1 SMP Wed Jan 28 00:02:01 UTC 2009 i686 GNU/Linux<br />
</pre><br />
The failing tests are<br />
<pre><br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/basic3.txt",[71])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/class.txt",[3,5,10])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/left-assoc.txt",[-1,-2,-9,-10])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/right-assoc.txt",[1,2,9,10])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/forced-assoc.txt",[7,8,9,10,15,16,21,22,27])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/nullsub3.txt",[41])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/repetition2.txt",[26,28,34,41,42,108,110,111,112,113,114,115,116])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/totest.txt",[5,24,26,27,36,42,83,84,110,206,207,208,209,215,250,251])<br />
("/home/user/.cabal/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[3])<br />
</pre><br />
<br />
The GLIBC engine has an API for its own GNU regular expression<br />
standard which differs from Posix. It also exposes a Posix compatible<br />
"regex.h" API, and GNU sed has a "--posix" switch. But the library is<br />
not really trying to get the Posix answer.<br />
<br />
First is nested subexpressions. In Posix sets of nested parenthesis<br />
that return captures must return nested captures: the inner<br />
subexpressions return substrings of the outer substrings. Consider<br />
"totest.txt" #215:<br />
<pre><br />
Searched text: "ab"<br />
Regex pattern: "((a)|(b)){2,}"<br />
Expected output: "(0,2)(1,2)(-1,-1)(1,2)"<br />
Actual result : "(0,2)(1,2)(0,1)(1,2)"<br />
</pre><br />
Where "(a)" matches the "a" at (0,1), and the next repetion matches<br />
"(b)" matches the "b" at (1,2). Under Posix rules the "(a)" did not<br />
match in the last use of the parent subexpression ((a)|(b)), so it<br />
should be (-1,-1). Under GLIBC rules it should return (0,1). The<br />
above could largely be fixed by a filter that unsets child patterns<br />
that are not properly nested, though I worry about edge cases and<br />
empty matches.<br />
<br />
In GLIBC this is not the case, take "osx-bsd-critical" #3 and #14 as an example:<br />
<pre><br />
############################# Unexpected Fail # 3 #############################<br />
Searched text: "aaab"<br />
Regex pattern: "(()|[ab])+b"<br />
Expected output: "(0,4)(2,3)(-1,-1)"<br />
Actual result : "(0,4)(2,3)(0,0)"<br />
<br />
Expected Pass #14<br />
text and pattern: "aaab"<br />
Regex pattern: "([ab]|())+b"<br />
Outputs agree: "(0,4)(2,3)(-1,-1)"<br />
</pre><br />
<br />
In #3, the "()" does match at the initial position, so it gets its<br />
capture reported, even though (0,0) is not a substring of (2,3). This<br />
is contradicted by #14 where "()" does not match at the initial<br />
position. This is an example of where "p|q" should be the same as<br />
"q|p" but returns a different answer.<br />
<br />
The failure of some of the "right-assoc.txt" tests shows that GLIBC<br />
uses neither right nor left associative concatenation:<br />
<pre><br />
############################# Unexpected Fail # 1 #############################<br />
<br />
Searched text: "abcd"<br />
Regex pattern: "(a|ab)(c|bcd)(d*)"<br />
Expected output: "(0,4)(0,2)(2,3)(3,4)"<br />
Actual result : "(0,4)(0,1)(1,4)(4,4)"<br />
<br />
############################# Unexpected Fail # 2 #############################<br />
<br />
Searched text: "abcd"<br />
Regex pattern: "(a|ab)(bcd|c)(d*)"<br />
Expected output: "(0,4)(0,2)(2,3)(3,4)"<br />
Actual result : "(0,4)(0,1)(1,4)(4,4)"<br />
<br />
Expected Pass #3<br />
text and pattern: "abcd"<br />
Regex pattern: "(ab|a)(c|bcd)(d*)"<br />
Outputs agree: "(0,4)(0,2)(2,3)(3,4)"<br />
<br />
Expected Pass #4<br />
text and pattern: "abcd"<br />
Regex pattern: "(ab|a)(bcd|c)(d*)"<br />
Outputs agree: "(0,4)(0,2)(2,3)(3,4)"<br />
</pre><br />
<br />
Test #1 through #4 check all four orderings of "a" versus "ab" and "c"<br />
versus "bcd". And GLIBC seems to be applying a greedy strategy to the<br />
"a" or "ab" subpattern. Consider these in "forced-assoc.txt":<br />
<pre><br />
Expected Pass #5<br />
text and pattern: "abcd"<br />
Regex pattern: "((a|ab)(c|bcd))(d*)"<br />
Outputs agree: "(0,4)(0,4)(0,1)(1,4)(4,4)"<br />
<br />
Expected Pass #6<br />
text and pattern: "abcd"<br />
Regex pattern: "((a|ab)(bcd|c))(d*)"<br />
Outputs agree: "(0,4)(0,4)(0,1)(1,4)(4,4)"<br />
<br />
############################# Unexpected Fail # 7 #############################<br />
<br />
Searched text: "abcd"<br />
Regex pattern: "((ab|a)(c|bcd))(d*)"<br />
Expected output: "(0,4)(0,4)(0,1)(1,4)(4,4)"<br />
Actual result : "(0,4)(0,3)(0,2)(2,3)(3,4)"<br />
<br />
############################# Unexpected Fail # 8 #############################<br />
<br />
Searched text: "abcd"<br />
Regex pattern: "((ab|a)(bcd|c))(d*)"<br />
Expected output: "(0,4)(0,4)(0,1)(1,4)(4,4)"<br />
Actual result : "(0,4)(0,3)(0,2)(2,3)(3,4)"<br />
</pre><br />
In the above the extra parenthesis around the first two subexpression<br />
ought to have forced the pair to match the longest possible string<br />
"abcd" from (0,4). But #7 and #8 failed to match this because "ab"<br />
was greedily chosen.<br />
<br />
This greedy approach is not entirely consistent, see "repetition2.txt" #108:<br />
<pre><br />
Searched text: "X1234567Y"<br />
Regex pattern: "X(.?){8,}Y"<br />
Expected output: "(0,9)(8,8)"<br />
Actual result : "(0,9)(7,8)"<br />
</pre><br />
The first seven repetitions should match 1234567 and the 8th<br />
repetition would then match an empty string at (8,8) after the "7" and<br />
before the "Y". But GLIBC does something I do not understand and<br />
ensures that the last repetition is not empty.<br />
<br />
Another not-greedy-enough example is "basic3.txt" #71:<br />
<pre><br />
Searched text: "-"<br />
Regex pattern: "(^)*"<br />
Expected output: "(0,0)(0,0)"<br />
Actual result : "(0,0)(-1,-1)"<br />
</pre><br />
The ^ should match and capture at (0,0), but GLIBC uses zero<br />
repetitions of (^). Changing "*" to "+" causes GLIBC to correctly<br />
return the capture at (0,0).<br />
<br />
== Solaris C library ==<br />
<br />
Christian Maeder ran regex-posix-unittest and sent in Solaris 10 results with the message:<br />
<pre><br />
Same results for:<br />
SunOS 5.10 Generic_138889-03 i86pc i386 i86pc<br />
SunOS 5.10 Generic_138888-02 sun4u sparc SUNW,Sun-Fire-280R<br />
</pre><br />
<br />
The summary is <br />
<pre><br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/basic3.txt",[4,65,67,71,76])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/class.txt",[3,5])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/left-assoc.txt",[-1,-2,-9,-10])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/right-assoc.txt",[1,2,9,10])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/forced-assoc.txt",[7,8,9,10,13,14,15,16,19,20,21,22,27])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/nullsub3.txt",[2,18,26,38,40,46])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/repetition2.txt",[])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/totest.txt",[3,5,24,26,27,42,43,44,45,110,204,211,212])<br />
("/home/maeder/.cabal/share/regex-posix-unittest-1.0/osx-bsd-critical.txt",[])<br />
</pre><br />
<br />
The above is similar to but not the same as the GLIBC bugs.<br />
<br />
The "basic3.txt" #4 is a syntax error reported for the "^$" pattern,<br />
which seems allowed to me: Error message: (ReturnCode 22,"unknown regex error")<br />
<br />
The "basic3.txt" #65 is a new bug:<br />
<pre><br />
Searched text: "-"<br />
Regex pattern: "(a*)*"<br />
Expected output: "(0,0)(0,0)"<br />
Actual result : "(0,0)(-1,-1)"<br />
</pre><br />
<br />
The "a*" can accept 0 characters so (a*)* should match once at (0,0). There are several other examples of this in "basic3.txt" and "nullsub3.txt"<br />
<br />
The too-greedy behavior of GLIBC and the lack of forced association is present in the Solaris bugs.<br />
<br />
But in "forced-assoc.txt" #13 I see something far more critical! A critical bug indeed:<br />
<pre><br />
Searched text: "abc"<br />
Regex pattern: "(a*)(b|abc)"<br />
Expected output: "(0,3)(0,0)(0,3)"<br />
Actual result : "(0,2)(0,1)(1,2)"<br />
</pre><br />
Note that the whole match should be "abc" but Solaris matches only "ab". (a*) is far far too greedy!<br />
<br />
This is also caught in "totest.txt" #3, #43, #44, #45, showing #3 below:<br />
<pre><br />
Searched text: "ab"<br />
Regex pattern: "(a?)((ab)?)"<br />
Expected output: "(0,2)(0,0)(0,2)(0,2)"<br />
Actual result : "(0,1)(0,1)(1,1)(-1,-1)"<br />
</pre><br />
<br />
== AT&T Software Technology (aka AST or libast) ==<br />
<br />
AT&T Research has an [http://www.research.att.com/sw/download/ open source suite] of libraries and binaries. This includes a [http://www.research.att.com/~gsf/testregex/re-interpretation.html POSIX regular expression implementation] in the "libast" library. Most of the unit tests used here are from their [http://www.research.att.com/~gsf/testregex/ "testregex"] program.<br />
<br />
I, [[ChrisKuklewicz]], have managed to compile a "regex-ast" Haskell wrapper to this POSIX engine to allow for testing. The difficulties with the AT&T headers prevent me from releasing this to hackage. The AST engine passes the whole set of unit tests.<br />
<br />
But randomized testing has found a few (probably 2, maybe 3) bugs in the AST engine as of 2009-02-24. In return, testing against AST has lead to the regex-tdfa-0.97.4 bug fix release. As regex-tdfa and AST improve they will hopefully reach a state where no differences can be found; this will likely be a fully correct state since their designs and implementations are completely different.<br />
<br />
<pre><br />
Bug in AST engine<br />
Pattern: "((.?)?)*."<br />
Text to search: "x"<br />
Expected: (0,1)(0,0)(0,0)<br />
Returned: (0,1)(0,0)(-1,-1)<br />
<br />
For the next group the Text to search is "x" and<br />
Expected: (0,0)(0,0)(0,0)<br />
Pattern and Returned:<br />
"((.?)?){0,}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){1,}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){2,}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){3,}." (0,0)(0,0)(-1,-1)<br />
<br />
"((.?)?){0,1}." (0,0)(0,0)(0,0)<br />
<br />
"((.?)?){0,2}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){1,2}." (0,0)(0,0)(0,0)<br />
<br />
"((.?)?){0,3}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){1,3}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){2,3}." (0,0)(0,0)(0,0)<br />
<br />
"((.?)?){0,4}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){1,4}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){2,4}." (0,0)(0,0)(-1,-1)<br />
"((.?)?){3,4}." (0,0)(0,0)(0,0) <br />
</pre><br />
<br />
If someone else is interested in "regex-ast" then write to me at "haskell _at_ mightyreason _dot_ com".<br />
<br />
= Other implementations =<br />
<br />
== Boost (C++ library) ==<br />
<br />
There is a [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html Boost Regex] library, which includes [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/ref/posix.html "POSIX Compatible C API's"]. But their definition of the [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/syntax/leftmost_longest_rule.html "The Leftmost Longest Rule"] is a different interpretation than the POSIX standard described above.<br />
<br />
In the POSIX standard, the rule applies to subpatterns, regardless of capturing. Boost only applies it to captured subexpressions. Their [http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/background_information/faq.html FAQ] explains:<br />
<br />
; Q. Why does using parenthesis in a POSIX regular expression change the result of a match?<br />
<br />
: A. For POSIX (extended and basic) regular expressions, but not for perl regexes, parentheses don't only mark; they determine what the best match is as well. When the expression is compiled as a POSIX basic or extended regex then Boost.Regex follows the POSIX standard leftmost longest rule for determining what matched. So if there is more than one possible match after considering the whole expression, it looks next at the first sub-expression and then the second sub-expression and so on. So...<br />
<br />
: "(0*)([0-9]*)" against "00123" would produce $1 = "00" $2 = "123"<br />
<br />
: where as<br />
<br />
: "0*([0-9])*" against "00123" would produce $1 = "00123"<br />
<br />
:If you think about it, had $1 only matched the "123", this would be "less good" than the match "00123" which is both further to the left and longer. If you want $1 to match only the "123" part, then you need to use something like:<br />
<br />
: "0*([1-9][0-9]*)"<br />
<br />
The POSIX subpattern rules would always have the "0*" match "00" and the "[0-9]*" match "123", regardless of whether either is parenthesized.<br />
<br />
== HSRex / Postgresql / TCL ==<br />
<br />
This is from [http://www.arglist.com/regex/] which is named after Henry Spencer. This has been used in TCL 8.2 (at least) and postgresql 7.4 (at least) and wxWidgets. I tested the "Walter Waldo's port" listed on that page (hsrex.tar.gz). It has a different spectrum of non-Posix bugs. An interesting new one is:<br />
<br />
<pre><br />
test '(aba|ab|a)*' 'ababa'<br />
(0,5)(4,5)<br />
</pre><br />
<br />
Where the expected answer is (0,5)(2,5). It has picked a possible match, but has not maximized the second capture, using (ab) instead of (aba). It finds the right answer for explicit repetitions because it does find the longest match:<br />
<br />
<pre><br />
test '(aba|ab|a)(aba|ab|a)' 'ababa'<br />
(0,5)(0,2)(2,5)<br />
</pre><br />
<br />
<pre><br />
test '(aba|ab|a)(aba|ab|a)(aba|ab|a)' 'ababa'<br />
(0,5)(0,2)(2,4)(4,5)<br />
</pre></div>Sergv