Es/Guía de Haskell para autoestopistas

From HaskellWiki
Revision as of 19:29, 10 November 2006 by Miguel (talk | contribs)
Jump to navigation Jump to search


Aviso: Esta página está en proceso de traducción.

Preámbulo: ¡NO CORRER!

Experiencias recientes de algunos compañeros programadores de C++/Java indican que leyeron varios tutoriales sobre Haskell con “velocidad exponencial” (piensa en como las sesiones de TCP/IP aumentan). Al principio son pocas y prudentes, pero cuando ven que las primeras 3-5 páginas no contienen “nada interesante” en términos de código y ejemplos, empiezan a saltarse párrafos, después capítuos, después páginas enteras, tan sólo decelerando - a menudo hasta parar completamente - alrededor de la página 50, encontrándose con el grueso de conceptos como “clases de tipos”, “constructores de tipos”, “IO monádica”, punto en el que normalmente les entra el pánico, piensa en una escusa perfectamente racional para no seguir leyendo, y se olvida alegremente de este triste y escalofriante encuentro con Haskell (ya que los seres humanos tendemos a olvidar las cosas tristas y escalofriantes).

Este texto pretende introducir al lector a los aspectos prácticos de Haskell desde el principio del todo (los planes para el primer capítulo incluyen: I/O, darcs, Parsec, QuickCheck, depurar y perfilar, por mencionar algunos). Se supone que el lector sabe (donde encontrar) por lo menos los primeros pasos de Haskell: como ejecutar “hugs” o “ghci”, ese diseño es 2-dimensional, etc. Aparte de eso, no esperamos avanzar radicalmente, e iremos paso por paso para no perder a los lectores por el camino. Así que NO CORRER, coge la toalla contigo y continua leyendo.

Si te has saltado el párrafo anterior, me gustaría destacar una vez más que Haskell es sensible a la indentación y espaciado, así que presta atención cuando hagas copias o alineación manual de código en el editor de texto con fuentes proporcionales.

Ah, casi me olvido: el autor está muy interesado en CUALQUIER opinión. Escríbele algunas líneas o palabras (mira Adept para la información de contacto) o propón cambios al tutorial mediante darcs (el repositorio está aquí) o directamente a este Wiki.

Capítulo 1: Omnipresente “¡Hola mundo!” y otras formas de hacer IO en Haskell

Cada capítulo será dedicado a una pequeña tarea real que completaremos desde el principio.

Así que aquí está la tarea para este capítulo: para liberar espacio en tu disco duro para todo el código de haskell que vas a escribir en un futuro cercano, vas a guardar algo de la vieja y polvorienta información en CDs y DVDs. Mientras que grabar CDs (o DVDs) es fácil hoy en día, normalmente ocupa algo (o mucho) tiempo decidir como grabar algunos GB de fotos digitales en CD-Rs, cuando los directorios con las imágenes varían desde 10 a 300 Mb de espacio, y no quieres grabar CDs medio llenos (o medio vacíos).

El ejercicio consiste en escribir un programa que nos ayude a poner un conjunto de directorios en la cantidad mínima posible de discos, al mismo tiempo que aprovechamos cada disco lo máximo posible. Llamemos a este programa “cd-fit”.

Oh. Espera. Hagamos el usual programa “hola mundo”, antes de que nos olvidemos, y después sigamos con cosas más interesante:

-- Cogido de 'hola.hs'
-- A partir de ahora, un comentario al principio del trozo de código
-- especificará el archivo que contiene el programa entero del que fue cogido
-- el trozo. Puedes coger el código del repositorio de darcs
-- "http://adept.linux.kiev.ua/repos/hhgtth" introduciendo el comando 
-- "darcs get http://adept.linux.kiev.ua/repos/hhgtth"
module Main where
main = putStrLn "¡Hola mundo!"

Ejecútalo:

$ runhaskell ./hola.hs
¡Hola mundo!

Vale, ya lo hicimos. Sigamos ahora, no hay nada interesante aquí :)

Cualquier desarrollo serio se debe hacer con un sistema de control de versiones, y no haremos una excepción. Usaremos el moderno sistema de control de versiones distribuido “darcs”. “Moderno” significa que está escrito en Haskell, “distribuido” significa que cada copia que funcione es un repositorio en si mismo.

Primero, creemos un directorio vacío para nuestro todo nuestro código, y ejecuta “darcs init” en él, que creará un subdirectorio “_darcs” para guardar todo lo relacionado con el control de versiones dentro de él.

Inicia tu editor favorito y crea un fichero nuevo llamado “cd-fits.hs” en nuestro directorio de trabajo. Ahora pensemos por un momento como funcionará nuestro programa y expresémoslo en pseudocódigo:

main = leer lista de directorio y sus tamaños
       decidir como ajustarlos a los CD-Rs
       imprimir la solución

¿Parece lógio? Pienso que si.

Vamos a simplificar nuestra vida un poco y asumir por ahora que vamos calcular el espacio de los directorios de alguna forma fuera de nuestro programa (por ejemplo, con “du -sb *”) y leer esa información desde stdin. Convirtamos esto a Haskell:

-- Cogido de 'cd-fit-1-1.hs'
module Main where
 
main = do input <- getContents
          putStrLn ("DEBUG: tengo entrada" ++ input)
	  -- calcular solución e imprimirla

No funciona en realidad, pero bastante parecido al Español, ¿no? Paremos por un momento y miremos más de cerca que hemos escrito línea por línea.

Empecemos desde arriba:

-- Cogido de 'cd-fit-1-1.hs'
input <- getContents

Esto es un ejemplo de la sintaxis de Haskell para hacer IO (en este caso, entrada de datos). Esta línea es una instrucción para leer toda la información disponible desde stdin, devolverla como una única cadena, y unirla al símbolo “input”, de forma que podemos procesar esta cadena de la forma que querramos.

¿Cómo lo sabía? ¿Memoricé todas las funciones? ¡Claro que no! Cada función tiene un tipo, que con su nombre, tiende a decir mucho sobre lo que hace la función.

Lanza un entorno interactivo de Haskell y examinemos esta función de cerca:

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.4.1, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base-1.0 ... linking ... done.
Prelude> :type getContents
getContents :: IO String
Prelude> 

Vemos que “getContents” es una función sin argumentos, que devuelve un “IO String”. El prefijo “IO” significa que es una acción de IO. Devolverá un String, cuando se evalue. La acción será evaluada cuando usemos “<-” para enlazar su resultado a un símbolo.

Ten en cuenta que “<-” no es una forma bonita de asignar un valor a una variable. Es una forma de evaluar (ejecutar) acciones de IO, en otras palabras - para hacer alguna operación de I/O y devolver su resultado (si es que tiene).

Podemos escoger no evaluar la acción que obtenemos de “getContents”, y en vez de eso dejarla por ahí y evaluarla más tarde:

let x = getContents
-- 300 líneas de código
input <- x

Como puedes ver, las acciones de IO pueden ser usadas como valores ordinarios. Supón que hemos construído una lista de acciones IO y hemos encontrado una forma de ejecutarlas una por una. Sería una forma de simular programación imperativa con su notación de “orden de ejecución”.

Haskell te permite hacerlo mejor.

La librería estandard del lenguaje (llamada “Prelude”) nos da acceso a muchas funcinoes que devuelven primitivas de acciones de IO muy útiles. Para combinarlas entre ellas para producir acciones más complejas, usamos “do”:

c = do a <- algunaAcción
       b <- algunaOtraAcción
       print (bar b)
       print (foo a)
       putStrLn "hecho"

De esta forma asociamos “c” a una acción con el siguiente “escenario”:

  • evalua la acción “algunaAcción” y asocia su resultado a “a”
  • después, evalua “algunaOtraAcción” y asocia su resultado a “b”
  • después, procesa “b” con la función “bar” e imprime su resultado
  • después, procesa “a” con la función “foo” e imprime su resultado
  • después, imprime la palabra “hecho”

¿Cuando se ejecutará todo esto en realidad? Respuesta: tan rápido como evaluemos “c” usando “<-” (si devuelve un resultado, como hace “getContents”) o usándola como el nombre de una función (si no devuelve un resultado, como hace “print”):

process = do putStrLn "Procesará algo"
             c
             putStrLn "Hecho"

Date cuenta de que hemos cogido unas cuantas funciones (“algunaAcción”, “algunaOtraAcción”, “print”, “putStrLn”) y usando “do” creamos a partir de ellas una nueva función, que enlazamos al símbolo “c”. Ahora podemos usar “c” como una pieza de construcción para construir una función más compleja “proceso”, y podríamos continuar haciendo esto más y más. Finalmente, algunas de las funciones las mencionaremos en el código de la función “main”, función a la cual la última y más importante acción de IO en cualquier programa de Haskell está asociada.

¿Cuándo se ejececutará/evaluará/forzará “main”? Tan rápido como ejecutemos el programa. Lee esto dos veces y trata de comprenderlo:

La ejecución de un programa de Haskell es una evaluación del símbolo “main” a la que hemos asociado una acción de IO. Mediante esta evaluación obtenemos el resultado de la acción.

Los lectores que estén familiriazados con programación en C++ o Java avanzado y ese conjunto de conocimiento arcano llamado “OOP Patrones de Diseño” se darán cuenta de que “construir acciones a partir de acciones” y “evaluar acciones para obtener resultados” es esencialmente un “Patrón de comando” y “Patrón de composición” combinados. Buenas noticias: en Haskell las tienes para todo tu IO, y lo tienes gratis :)


Ejercicio: Fíjate en el siguiente código:

-- Cogido de 'exercise-1-1.hs'
module Main where
c = putStrLn "C!"

combine before after =
  do before
     putStrLn "En el medio"
     after

main = do combine c c
          let b = combine (putStrLn "¡Hola!") (putStrLn "¡Adios!")
          let d = combine (b) (combine c c)
          putStrLn "¡Tanto tiempo!"

¿Te das cuenta de como indentamos las líneas con cuidado para que el código parezca limpio? En realidad, el código de Haskell tiene que estar alineado de esta forma, o sino no compilará. Si usas tabulados para indentar tus códigos fuente, ten en cuenta que los compiladores de Haskell aumen que el tabstop tiene de ancho 8 caracteres.

A menudo las personas se quejan de que es una forma muy difícil de escribir en Haskell porque requiere que alinees el código. En realidad no es verdad. Si alineas tu código, el compilador adivinará cual es el principio y final de los bloques sintácticos. Sin embargo, si no indentas tu código, puedes especificarlos explícitamente en cada una de las expresiones y usar una distribución arbitraria como en este ejemplo:

-- Cogido de 'exercise-1-2.hs'
combine before after = 
 do { before; 
  putStrLn "En el medio"; 
        after; };
 
main = 
  do { combine c c; let { b = combine (putStrLn "¡Hola!") (putStrLn "¡Adios!")};
         let {d = combine (b) (combine c c)}; 
   putStrLn "¡Tanto tiempo!" };

De vuelta al ejercicio - ¿ves como hacemos código desde la nada? Trata de imaginar que hará este código, después ejecútalo y compruébalo por ti mismo.

¿Entiendas por qué “¡Hola!” y “¡Adiós!” no se imprimen?


Examinemos nuestra función “main” más de cerca:

Prelude> :load cd-fit.hs
Compiling Main             ( ./cd-fit.hs, interpreted )
Ok, modules loaded: Main.
*Main> :type main
main :: IO ()
*Main> 

Vemos que “main” es en realidad una acción IO que no devuelve nada cuando la evaluamos. Cuando combinamos acciones con “do”, el tipo del resultado será el tipo de la última acción, y “putStrLn algo” tiene tipo “IO ()”:

*Main> :type putStrLn "¡Hola mundo!"
putStrLn "¡Hola mundo!" :: IO ()
*Main> 

Ah, por cierto: ¿te has dado cuenta que hemos compilado nuestro primer programa de Haskell para examinar “main”? :)

celebrémoslo poniéndolo bajo control de versión: ejecuta “darcs add cd-fit.hs” y “darcs record”, responder “y” a todas las preguntas y dale un comentario al añadido “Esqueleto de cd-fit.hs”

Vamos a probar a ejecutarlo:

$ echo "foo" | runhaskell cd-fit.hs
DEBUG: tengo entrada foo

Ejercicios:

  • Trata de escribir un programa que coja tu nombre de stdin y te felicite (palabras clave: getLine, putStrLn);
  • Trata de escribir un programa que pregunte por tu nombre, lo lea, te felicite, pregunte por tu color favotiro, y lo devuelva (palabras clave: getLine, putStrLn).

Capítulo 2: procesando la entrada

Bien, ahora que tenemos un entendimiento cabal de los poderes de ES en Haskell (y estamos deslumbrados por ellos, espero), nos olvidemos un poco de ES y hagamos algún trabajo útil.

Como tu recordarás, nos habíamos propuesto poner datos tomados de varios directorios en tan pocos discos CD-Rs como sea posible. Asumimos que "du -sb" calculará el tamaño de los directorios de entrada y nos dará como salida algo como lo siguiente:

 65572 /home/adept/photos/raw-to-burn/dir1
 68268 /home/adept/photos/raw-to-burn/dir2
 53372 /home/adept/photos/raw-to-burn/dir3
 713124  /home/adept/photos/raw-to-burn/dir4
 437952  /home/adept/photos/raw-to-burn/dir5

Nuestra siguiente tarea es procesar esa entrada y convertirla en alguna representación interna más cómoda.

Para ello usaremos la poderosa librería de combinadores de parsers llamada "Parsec", que está presente en la mayoría de las implementaciones de Haskell.

Como muchas de las facilidades para ES que hemos visto en el capítulo anterior, esta libreria provee un conjunto de parsers básicos y medios de combinarlos para obtener construcciones de parseo más complejas.

A diferencia de otras herramientas en esta área (lex/yacc o JavaCC, para nombrar algunas), los parsers de Parsec no requieren una etapa de prepocesamiento previo. Dado que Haskell podeoms devolver funciones como resultado de funciones y de esta manera construir funciones a partir de "mero aire", no hay necesidad de una sintáxis separada para la descripción de parsers. Pero ya hicimos suficiente propaganda, hagamos algún parseo:

-- Tomado de 'cd-fit-2-1.hs'
import Text.ParserCombinators.Parsec

-- parseInput parsea la entrada de "du -sb", que consiste de muchas líneas,
-- cada una de ellas describe un solo directorio
parseInput = 
  do dirs <- many dirAndSize
     eof
     return dirs

-- El tipo de dato Dir tiene información sobre un solo directorio:
-- su tamaño y su nombre
data Dir = Dir Int String deriving Show

-- `dirAndSize` parsea la información de un solo directorio, que es:
-- un tamaño en bytes (número), algunos espacios, y luego el nombre del
-- directorio que llega hasta el fin de la línea
dirAndSize = 
  do size <- many1 digit
     spaces
     dir_name <- anyChar `manyTill` newline
     return (Dir (read size) dir_name)

Simplemente agrega esas líneas al inicio de "cd-fit.hs". Aquí vemos muchas cosas nuevas, y muchas que ya conocíamos.

Primero que nada, notemos la conocida construcción "do", que, como sabemos, se usa para combinar acciones de ES para producir nuevas acciones de ES. Aquí la usamos para combinar acciones de "parseo" en nuevas acciones de "parseo". Significa eso que "parseo" implica "hacer ES"? Absolutamente no. Es decir, tengo que admitir que te mentí - "do" no sólo se usa para combinar acciones de ES. "do" se usa para combinar cualquier tipo de las así llamadas acciones monádicas o valores monádicos.

Piensa en las mónadas como un "patrón de diseño" en el mundo funcional. Las mónadas son una forma de ocultar al usuario (programador) toda la maquinaria necesaria para que operen funcionalidades complejas.

Como habrás oido, Haskell no tiene ninguna noción de "assignación", "estado mutable" ni "variables", y es un "lenguaje funcional puro", que significa que toda función invocada con los mismos parámetros devolverá exactamente los mismos resultados. A la vez, "hacer ES" requiere acarrear descriptores de ficheros y sus estados y lidiar con errores de ES. El "parseo" requiere llevar registro de la posición en la entrada y lidiar con errores de parseo.

En ambos casos Hombres Sabios que Escribieron Librerías tomaron las precauciones por nosotros y nos ocultaron toda la complejidad, exponiendo la API de sus librerías (ES y parseo) en la forma de "acciones monádicas", que nosotros podemos combinar libremente como nos parezca.

Piensa en la programación con mónadas como hacer un remodelamiento con ayuda de un grupo de profesionales de remodelamiento. Tu describes secuencias de acciones en el papel (eso somos nosotros escribiendo con la notación "do"), y después, cuando es necesario, esa secuencia será evaluada por los profesionales ("en la mónada") lo que te dará el resultado final, ocultandote toda la complejidad subyacente (cómo preparar la pintuea, qué clavos elegir, etc).

Usemos el entorno interactivo de Haskell para descifrar todas las instrucciones que hemos escritos para la librería de parseo. Como es usual, iremos de arriba hacia abajo:

 *Main> :reload
 Ok, modules loaded: Main.
 *Main> :t parseInput
 parseInput :: GenParser Char st [Dir]
 *Main> :t dirAndSize
 dirAndSize :: GenParser Char st Dir
 *Main> 

Assuming (well, take my word for it) that "GenParser Char st" is our parsing monad, we could see that "parseInput", when evaluated, will produce a list of "Dir", and "dirAndSize", when evaluated, will produce "Dir". Assuming that "Dir" somehow represents information about single directory, that is pretty much what we wanted, isn't it?

Let's see what a "Dir" means. We defined datatype Dir as a record, which holds an Int and a String:

-- Taken from 'cd-fit-2-1.hs'
data Dir = Dir Int String deriving Show

In order to construct such records, we must use data constructor Dir:

 *Main> :t Dir 1 "foo"
 Dir 1 "foo" :: Dir

In order to reduce confusion for newbies, we could have written:

data Dir = D Int String deriving Show

, which would define datatype "Dir" with data constructor "D". However, traditionally name of the datatype and its constructor are chosen to be the same.

Clause "deriving Show" instructs the compiler to make enough code "behind the curtains" to make this datatype conform to the interface of the type class Show. We will explain type classes later, for now let's just say that this will allow us to "print" instances of "Dir".

Exercises:

  • examine types of "digit", "anyChar", "many", "many1" and "manyTill" to see how they are used to build more complex parsers from single ones.
  • compare types of "manyTill", "manyTill anyChar" and "manyTill anyChar newline". Note that "anyChar `manyTill` newline" is just another syntax sugar. Note that when function is supplied with less arguments that it actually needs, we get not a value, but a new function, which is called partial application.


OK. So, we combined a lot of primitive parsing actions to get ourselves a parser for output of "du -sb". How can we actually parse something? the Parsec library supplies us with function "parse":

 *Main> :t parse
 parse :: GenParser tok () a
 	 -> SourceName
 	 -> [tok]
 	 -> Either ParseError a
 *Main> :t parse parseInput
 parse parseInput :: SourceName -> [Char] -> Either ParseError [Dir]
 *Main> 

At first the type might be a bit cryptic, but once we supply "parse" with the parser we made, the compiler gets more information and presents us with a more concise type.

Stop and consider this for a moment. The compiler figured out type of the function without a single type annotation supplied by us! Imagine if a Java compiler deduced types for you, and you wouldn't have to specify types of arguments and return values of methods, ever.

OK, back to the code. We can observe that the "parser" is a function, which, given a parser, a name of the source file or channel (f.e. "stdin"), and source data (String, which is a list of "Char"s, which is written "[Char]"), will either produce parse error, or parse us a list of "Dir".

Datatype "Either" is an example of datatype whose constructor has name, different from the name of the datatype. In fact, "Either" has two constructors:

data Either a b = Left a | Right b

In order to understand better what does this mean consider the following example:

 *Main> :t Left 'a'
 Left 'a' :: Either Char b
 *Main> :t Right "aaa"
 Right "aaa" :: Either a [Char]
 *Main> 

You see that "Either" is a union (much like the C/C++ "union") which could hold value of one of the two distinct types. However, unlike C/C++ "union", when presented with value of type "Either Int Char" we could immediately see whether its an Int or a Char - by looking at the constructor which was used to produce the value. Such datatypes are called "tagged unions", and they are another power tool in the Haskell toolset.

Did you also notice that we provide "parse" with parser, which is a monadic value, but receive not a new monadic value, but a parsing result? That is because "parse" is an evaluator for "Parser" monad, much like the GHC or Hugs runtime is an evaluator for the IO monad. The function "parser" implements all monadic machinery: it tracks errors and positions in input, implements backtracking and lookahead, etc.

let's extend our "main" function to use "parse" and actually parse the input and show us the parsed data structures:

-- Taken from 'cd-fit-2-1.hs'
main = do input <- getContents
          putStrLn ("DEBUG: got input " ++ input)
          let dirs = case parse parseInput "stdin" input of
                          Left err -> error $ "Input:\n" ++ show input ++ 
                                              "\nError:\n" ++ show err
                          Right result -> result
          putStrLn "DEBUG: parsed:"; print dirs

Exercise:

  • In order to understand this snippet of code better, examine (with ghci or hugs) the difference between 'drop 1 ( drop 1 ( drop 1 ( drop 1 ( drop 1 "foobar" ))))' and 'drop 1 $ drop 1 $ drop 1 $ drop 1 $ drop 1 "foobar"'. Examine type of ($).
  • Try putStrLn "aaa" and print "aaa" and see the difference, examine their types.
  • Try print (Dir 1 "foo") and putStrLn (Dir 1 "foo"). Examine types of print and putStrLn to understand the behavior in both cases.

Let's try to run what we have so far:

 $ du -sb * | runhaskell ./cd-fit.hs
 
 DEBUG: got input 22325  Article.txt
 18928   Article.txt~
 1706    cd-fit.hs
 964     cd-fit.hs~
 61609   _darcs
 
 DEBUG: parsed:
 [Dir 22325 "Article.txt",Dir 18928 "Article.txt~",Dir 1706 "cd-fit.hs",Dir 964 "cd-fit.hs~",Dir 61609 "_darcs"]

Seems to be doing exactly as planned. Now let's try some erroneous input:

 $ echo "foo" | runhaskell cd-fit.hs
 DEBUG: got input foo
 
 DEBUG: parsed:
 *** Exception: Input:
 "foo\n"
 Error:
 "stdin" (line 1, column 1):
 unexpected "f"
 expecting digit or end of input

Seems to be doing fine.

If you followed advice to put your code under version control, you could now use "darcs whatsnew" or "darcs diff -u" to examine your changes to the previous version. Use "darcs record" to commit them. As an exercise, first record the changes "outside" of function "main" and then record the changes in "main". Do "darcs changes" to examine a list of changes you've recorded so far.

Chapter 3: Packing the knapsack and testing it with class, too (and don't forget your towel!)

Enough preliminaries already. let's go pack some CDs.

As you might already have recognized, our problem is a classical one. It is called a "knapsack problem" (google it up, if you don't know already what it is. There are more than 100000 links).

let's start from the greedy solution, but first let's slightly modify our "Dir" datatype to allow easy extraction of its components:

-- Taken from 'cd-fit-3-1.hs'
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

Exercise: examine types of "Dir", "dir_size" and "dir_name"


From now on, we could use "dir_size d" to get a size of directory, and "dir_name d" to get its name, provided that "d" is of type "Dir".

The Greedy algorithm sorts directories from the biggest down, and tries to puts them on CD one by one, until there is no room for more. We will need to track which directories we added to CD, so let's add another datatype, and code this simple packing algorithm:

-- Taken from 'cd-fit-3-1.hs'
import Data.List (sortBy)

-- DirPack holds a set of directories which are to be stored on single CD.
-- 'pack_size' could be calculated, but we will store it separately to reduce
-- amount of calculation
data DirPack = DirPack {pack_size::Int, dirs::[Dir]} deriving Show

-- For simplicity, let's assume that we deal with standard 700 Mb CDs for now
media_size = 700*1024*1024

-- Greedy packer tries to add directories one by one to initially empty 'DirPack'
greedy_pack dirs = foldl maybe_add_dir (DirPack 0 []) $ sortBy cmpSize dirs
  where
  cmpSize d1 d2 = compare (dir_size d1) (dir_size d2)

-- Helper function, which only adds directory "d" to the pack "p" when new
-- total size does not exceed media_size
maybe_add_dir p d =
  let new_size = pack_size p + dir_size d
      new_dirs = d:(dirs p)
      in if new_size > media_size then p else DirPack new_size new_dirs

I'll highlight the areas which you could explore on your own (using other nice tutorials out there, of which I especially recommend "Yet Another Haskell Tutorial" by Hal Daume):

  • We choose to import a single function "sortBy" from a module Data.List, not the whole thing.
  • Instead of coding case-by-case recursive definition of "greedy_pack", we go with higher-order approach, choosing "foldl" as a vehicle for list traversal. Examine its type. Other useful function from the same category are "map", "foldr", "scanl" and "scanr". Look them up!
  • To sort list of "Dir" by size only, we use custom sort function and parameterized sort - "sortBy". This sort of setup where the user may provide a custom "modifier" for a generic library function is quite common: look up "deleteBy", "deleteFirstsBy", "groupBy", "insertBy", "intersectBy", "maximumBy", "minimumBy", "sortBy", "unionBy".
  • To code the quite complex function "maybe_add_dir", we introduced several local definitions in the "let" clause, which we can reuse within the function body. We used a "where" clause in the "greedy_pack" function to achieve the same effect. Read about "let" and "where" clauses and the differences between them.
  • Note that in order to construct a new value of type "DirPack" (in function "maybe_add_dir") we haven't used the helper accessor functions "pack_size" and "dirs"

In order to actually use our greedy packer we must call it from our "main" function, so let's add a lines:

-- Taken from 'cd-fit-3-1.hs'
main = do ...
          -- compute solution and print it
          putStrLn "Solution:" ; print (greedy_pack dirs)

Verify integrity of our definitions by (re)loading our code in ghci. Compiles? Thought so :) Now, do "darcs record" and add some sensible commit message.

Now it is time to test our creation. We could do it by actually running it in the wild like this:

 $ du -sb ~/DOWNLOADS/* | runhaskell ./cd-fit.hs

This will prove that our code seems to be working. At least, this once. How about establishing with reasonable degree of certainty that our code, parts and the whole, works properly, and doing so in re-usable manner? In other words, how about writing some test?

Java programmers used to JUnit probably thought about screens of boiler-plate code and hand-coded method invocations. Never fear, we will not do anything as silly :)

Enter QuickCheck.

QuickCheck is a tool to do automated testing of your functions using (semi)random input data. In the spirit of "100b of code examples is worth 1kb of praise" let's show the code for testing the following property: An attempt to pack directories returned by "greedy_pack" should return "DirPack" of exactly the same pack:

-- Taken from 'cd-fit-3-2.hs'
import Test.QuickCheck
import Control.Monad (liftM2)

-- We must teach QuickCheck how to generate arbitrary "Dir"s
instance Arbitrary Dir where
  -- Let's just skip "coarbitrary" for now, ok? 
  -- I promise, we will get back to it later :)
  coarbitrary = undefined
  -- We generate arbitrary "Dir" by generating random size and random name
  -- and stuffing them inside "Dir"
  arbitrary = liftM2 Dir gen_size gen_name
          -- Generate random size between 10 and 1400 Mb
    where gen_size = do s <- choose (10,1400)
                        return (s*1024*1024)
          -- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" 
          gen_name = do n <- choose (1,300)
                        sequence $ take (n*10+1) $ repeat (elements "fubar/")

-- For convenience and by tradition, all QuickCheck tests begin with prefix "prop_".
-- Assume that "ds" will be a random list of "Dir"s and code your test.
prop_greedy_pack_is_fixpoint ds =
  let pack = greedy_pack ds 
      in pack_size pack == pack_size (greedy_pack (dirs pack))

let's run the test, after which I'll explain how it all works:

 Prelude> :r
 Compiling Main             ( ./cd-fit.hs, interpreted )
 Ok, modules loaded: Main.
 *Main> quickCheck prop_greedy_pack_is_fixpoint
 [numbers spinning]
 OK, passed 100 tests.
 *Main> 

We've just seen our "greedy_pack" run on a 100 completely (well, almost completely) random lists of "Dir"s, and it seems that property indeed holds.

let's dissect the code. The most intriguing part is "instance Arbitrary Dir where", which declares that "Dir" is an instance of typeclass "Arbitrary". Whoa, that's a whole lot of unknown words! :) Let's slow down a bit.

What is a typeclass? A typeclass is a Haskell way of dealing with the following situation: suppose that you are writing a library of usefull functions and you dont know in advance how exactly they will be used, so you want to make them generic. Now, on one hand you dont want to restrict your users to certain type (e.g. String). On the other hand, you want to enforce the convention that arguments for your function must satisfy a certain set of constraints. That is where typeclass comes in handy.

Think of typeclass as a contract (or "interface", in Java terms) that your type must fulfill in order to be admitted as an argument to certain functions.

Let's examine the typeclass "Arbitrary":

 *Main> :i Arbitrary
 class Arbitrary a where
   arbitrary :: Gen a
   coarbitrary :: a -> Gen b -> Gen b
   	-- Imported from Test.QuickCheck
 instance Arbitrary Dir
   	-- Defined at ./cd-fit.hs:61:0
 instance Arbitrary Bool 	-- Imported from Test.QuickCheck
 instance Arbitrary Double 	-- Imported from Test.QuickCheck
 instance Arbitrary Float 	-- Imported from Test.QuickCheck
 instance Arbitrary Int 	-- Imported from Test.QuickCheck
 instance Arbitrary Integer 	-- Imported from Test.QuickCheck
 -- rest skipped --

It could be read this way: "Any type (let's name it 'a') could be a member of the class Arbitrary as soon as we define two functions for it: "arbitrary" and "coarbitrary", with signatures shown. For types Dir, Bool, Double, Float, Int and Integer such definitions were provided, so all those types are instance of class Arbitrary".

Now, if you write a function which operates on its arguments solely by means of "arbitrary" and "coarbitrary", you can be sure that this function will work on any type which is an instance of "Arbitrary"!

let's say it again. Someone (maybe even you) writes the code (API or library), which requires that input values implement certain interfaces, which is described in terms of functions. Once you show how your type implements this interface you are free to use API or library.

Consider the function "sort" from standard library:

 *Main> :t Data.List.sort
 Data.List.sort :: (Ord a) => [a] -> [a]

We see that it sorts lists of any values which are instance of typeclass "Ord". Let's examine that class:

 *Main> :i Ord
 class Eq a => Ord a where
   compare :: a -> a -> Ordering
   (<) :: a -> a -> Bool
   (>=) :: a -> a -> Bool
   (>) :: a -> a -> Bool
   (<=) :: a -> a -> Bool
   max :: a -> a -> a
   min :: a -> a -> a
 -- skip
 instance Ord Double 	-- Imported from GHC.Float
 instance Ord Float 	-- Imported from GHC.Float
 instance Ord Bool 	-- Imported from GHC.Base
 instance Ord Char 	-- Imported from GHC.Base
 instance Ord Integer 	-- Imported from GHC.Num
 instance Ord Int 	-- Imported from GHC.Base
 -- skip
 *Main> 

We see a couple of interesting things: first, there is an additional requirement listed: in order to be an instance of "Ord", type must first be an instance of typeclass "Eq". Then, we see that there is an awful lot of functions to define in order to be an instance of "Ord". Wait a second, isn't it silly to define both (<) and (>) when one could be expressed via another?

Right you are! Usually, typeclass contains several "default" implementation for its functions, when it is possible to express them through each other (as it is with "Ord"). In this case it is possible to supply only a minimal definition (which in case of "Ord" consists of any single function) and others will be automatically derived. If you supplied fewer functions than are required for minimal implementation, the compiler/interpreter will say so and explain which functions you still have to define.

Once again, we see that a lot of types are already instances of typeclass Ord, and thus we are able to sort them.

Now, let's take a look back to the definition of "Dir":

-- Taken from 'cd-fit-3-2.hs'
data Dir = Dir {dir_size::Int, dir_name::String} deriving Show

See that "deriving" clause? It instructs the compiler to automatically derive code to make "Dir" an instance of typeclass Show. The compiler knows about a bunch of standard typeclasses (Eq, Ord, Show, Enum, Bound, Typeable to name a few) and knows how to make a type into a "suitably good" instance of any of them. If you want to derive instances of more than one typeclass, say it this way: "deriving (Eq,Ord,Show)". Voila! Now we can compare, sort and print data of that type!

Side note for Java programmers: just imagine java compiler which derives code for "implements Storable" for you...

Side note for C++ programmers: just imagine that deep copy constructors are being written for you by compiler....


Exercises:

  • Examine typeclasses Eq and Show
  • Examine types of (==) and "print"
  • Try to make "Dir" instance of "Eq"

OK, back to our tests. So, what we have had to do in order to make "Dir" an instance of "Arbitrary"? Minimal definition consists of "arbitrary". Let's examine it up close:

 *Main> :t arbitrary
 arbitrary :: (Arbitrary a) => Gen a

See that "Gen a"? Reminds you of something? Right! Think of "IO a" and "Parser a" which we've seen already. This is yet another example of action-returning function, which could be used inside "do"-notation. (You might ask yourself, wouldn't it be useful to generalize that convenient concept of actions and "do"? Of course! It is already done, the concept is called "Monad" and we will talk about it in Chapter 400 :) )

Since 'a' here is a type variable which is an instance of "Arbitrary", we could substitute "Dir" here. So, how we can make and return an action of type "Gen Dir"?

Let's look at the code:

-- Taken from 'cd-fit-3-2.hs'
arbitrary = liftM2 Dir gen_size gen_name
        -- Generate random size between 10 and 1400 Mb
  where gen_size = do s <- choose (10,1400)
                      return (s*1024*1024)
        -- Generate random name 1 to 300 chars long, consisting of symbols "fubar/" 
        gen_name = do n <- choose (1,300)
                      sequence $ take (n*10+1) $ repeat (elements "fubar/")

We have used the library-provided functions "choose" and "elements" to build up "gen_size :: Gen Int" and "gen_name :: Gen String" (exercise: don't take my word on that. Find a way to check types of "gen_name" and "gen_size"). Since "Int" and "String" are components of "Dir", we sure must be able to use "Gen Int" and "Gen String" to build "Gen Dir". But where is the "do" block for that? There is none, and there is only single call to "liftM2".

Let's examine it:

 *Main> :t liftM2
 liftM2 :: (Monad m) => (a1 -> a2 -> r) -> m a1 -> m a2 -> m r

Kind of scary, right? Let's provide typechecker with more context:

 *Main> :t liftM2 Dir
 liftM2 Dir :: (Monad m) => m Int -> m String -> m Dir

Since you already heard that "Gen" is a "Monad", you could substitute "Gen" for "m" here, obtaining "liftM2 Dir :: (Monad Gen) => Gen Int -> Gen String -> Gen Dir". Exactly what we wanted!

Consider "liftM2" to be "advanced topic" of this chapter (which we will cover later) and just note for now that:

  • "2" is a number of arguments for data constructor "Dir" and we have used "liftM2" to construct "Gen Dir" out of "Dir"
  • There are also "liftM", "liftM3", "liftM4", "liftM5"
  • "liftM2" is defined as "liftM2 f a1 a2 = do x<-a1; y<-a2; return (f x y)"

Hopefully, this will all make sense after you read it for the third time ;)

Oh, by the way - dont forget to "darcs record" your changes!

Chapter 4: REALLY packing the knapsack this time

In this chapter we are going to write another not-so-trivial packing method, compare packing methods efficiency, and learn something new about debugging and profiling of the Haskell programs along the way.

It might not be immediately obvious whether our packing algorithm is effective, and if yes - in which particular way? Whether it's runtime, memory consumption or result are of sufficient quality, are there any alternative algorithms, and how do they compare to each other?

Let's code another solution to the knapsack packing problem, called the "dynamic programming method" and put both variants to the test.

This time, I'll not dissect the listing and explain it bit by bit. Instead, comments are provided in the code:

-- Taken from 'cd-fit-4-1.hs'
----------------------------------------------------------------------------------
-- Dynamic programming solution to the knapsack (or, rather, disk) packing problem
--
-- Let the `bestDisk x' be the "most tightly packed" disk of total 
-- size no more than `x'.
precomputeDisksFor :: [Dir] -> [DirPack]
precomputeDisksFor dirs = 
      -- By calculating `bestDisk' for all possible disk sizes, we could
      -- obtain a solution for particular case by simple lookup in our list of
      -- solutions :)
  let precomp = map bestDisk [0..] 

      -- How to calculate `bestDisk'? Lets opt for a recursive definition:
      -- Recursion base: best packed disk of size 0 is empty
      bestDisk 0 = DirPack 0 []
      -- Recursion step: for size `limit`, bigger than 0, best packed disk is
      -- computed as follows:
      bestDisk limit = 
         -- 1. Take all non-empty dirs that could possibly fit to that disk by itself.
         --    Consider them one by one. Let the size of particular dir be `dir_size d'.
         --    Let's add it to the best-packed disk of size <= (limit - dir_size d), thus
         --    producing the disk of size <= limit. Lets do that for all "candidate" dirs that
         --    are not yet on our disk:
         case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs
                                                , dir_size d > 0
                                                , let (DirPack s ds)=precomp!!(limit - dir_size d)
                                                , d `notElem` ds
              ] of
                -- We either fail to add any dirs (probably, because all of them too big).
                -- Well, just report that disk must be left empty:
                [] -> DirPack 0 []
                -- Or we produce some alternative packings. Let's choose the best of them all:
                packs -> maximumBy cmpSize packs

      cmpSize a b = compare (pack_size a) (pack_size b)

      in precomp

-- When we precomputed disk of all possible sizes for the given set of dirs, solution to 
-- particular problem is simple: just take the solution for the required 'media_size' and
-- that's it!
dynamic_pack dirs = (precomputeDisksFor dirs)!!media_size

Notice that it took almost the same amount of text to describe algorithm and to write implementation for it. Nice, eh?


Exercises:

  • Make all necessary amendments to the previously written code to make this example compile. Hints: browse modules Data.List and Data.Ix for functions that are "missing" - maybe you will find them there (use ":browse Module.Name" at ghci prompt). Have you had to define some new instances of some classes? How did you do that?
  • [ other_function local_binding | x <- some_list, x > 0, let local_binding = some_function x ] is called a "list comprehension". This is another example of "syntactic sugar", which could lead to nicely readable code, but, when abused, could lead to syntactic caries :) Do you understand what does this sample do: let solve x = [ y | x <- [0..], y<-[0..], y == x * x ]? Could write (with help of decent tutorial) write de-sugared version of this? (Yes, I know that finding a square root does not require list traversals, but for the sake of self-education try and do it)
  • Notice that in order to code quite complex implementation of precomputeDisksFor we split it up in several smaller pieces and put them as a local bindings inside let clause.
  • Notice that we use pattern matching to both define bestKnap on case-by-case basis and to "peer into" (de-construct) DirPack in the let (DirPack s ds)=precomp!!(limit - dir_size d) line
  • Notice how we use function composition to compose complex condition to filter the list of dirs

Before we move any further, let's do a small cosmetic change to our code. Right now our solution uses 'Int' to store directory size. In Haskell, 'Int' is a platform-dependent integer, which imposes certain limitations on the values of this type. Attempt to compute the value of type 'Int' that exceeds the bounds will result in overflow error. Standard Haskell libraries have special typeclass Bounded, which allows to define and examine such bounds:

 Prelude> :i Bounded     
 class Bounded a where
   minBound :: a
   maxBound :: a
 -- skip --
 instance Bounded Int    -- Imported from GHC.Enum

We see that 'Int' is indeed bounded. Let's examine the bounds:

 Prelude> minBound :: Int    
 -2147483648
 Prelude> maxBound :: Int
 2147483647
 Prelude>  
 

Those of you who are C-literate, will spot at once that in this case the 'Int' is so-called "signed 32-bit integer", which means that we would run into errors trying to operate on directories/directory packs which are bigger than 2 Gb.

Luckily for us, Haskell has integers of arbitrary precision (limited only by the amount of available memory). The appropriate type is called 'Integer':

 Prelude> (2^50) :: Int
 0 -- overflow
 Prelude> (2^50) :: Integer
 1125899906842624 -- no overflow
 Prelude>
 

Lets change definitions of 'Dir' and 'DirPack' to allow for bigger directory sizes:

-- Taken from 'cd-fit-4-2.hs'
data Dir = Dir {dir_size::Integer, dir_name::String} deriving (Eq,Show)
data DirPack = DirPack {pack_size::Integer, dirs::[Dir]} deriving Show

Try to compile the code or load it into ghci. You will get the following errors:

 cd-fit-4-2.hs:73:79:
     Couldn't match `Int' against `Integer'
       Expected type: Int
       Inferred type: Integer
     In the expression: limit - (dir_size d)
     In the second argument of `(!!)', namely `(limit - (dir_size d))'
 
 cd-fit-4-2.hs:89:47:
     Couldn't match `Int' against `Integer'
       Expected type: Int
       Inferred type: Integer
     In the second argument of `(!!)', namely `media_size'
     In the definition of `dynamic_pack':
         dynamic_pack dirs = (precomputeDisksFor dirs) !! media_size
 

It seems like Haskell have some troubles using 'Integer' with '(!!)'. Let's see why:

 Prelude> :t (!!)
 (!!) :: [a] -> Int -> a

Seems like definition of '(!!)' demands that index will be 'Int', not 'Integer'. Haskell never converts any type to some other type automatically - programmer have to explicitly ask for that.

I will not repeat the section "Standard Haskell Classes" from the Haskell Report and explain, why typeclasses for various numbers organized the way they are organized. I will just say that standard typeclass Num demands that numeric types implement method fromInteger:

 Prelude> :i Num
 class (Eq a, Show a) => Num a where
   (+) :: a -> a -> a
   (*) :: a -> a -> a
   (-) :: a -> a -> a
   negate :: a -> a
   abs :: a -> a
   signum :: a -> a
   fromInteger :: Integer -> a
         -- Imported from GHC.Num
 instance Num Float      -- Imported from GHC.Float
 instance Num Double     -- Imported from GHC.Float
 instance Num Integer    -- Imported from GHC.Num
 instance Num Int        -- Imported from GHC.Num
 

We see that Integer is a member of typeclass Num, thus we could use fromInteger to make the type errors go away:

-- Taken from 'cd-fit-4-2.hs'
-- snip
         case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs
                                                , dir_size d > 0
                                                , let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))
                                                , d `notElem` ds
              ] of
-- snip
dynamic_pack dirs = (precomputeDisksFor dirs)!!(fromInteger media_size)
-- snip

Type errors went away, but careful reader will spot at once that when expression (limit - dir_size d) will exceed the bounds for Int, overflow will occur, and we will not access the correct list element. Dont worry, we will deal with this in a short while.

Now, lets code the QuickCheck test for this function along the lines of the test for greedy_pack:

-- Taken from 'cd-fit-4-2.hs'
prop_dynamic_pack_is_fixpoint ds =
  let pack = dynamic_pack ds 
      in pack_size pack == pack_size (dynamic_pack (dirs pack))

Now, lets try to run (DONT PANIC and save all you work in other applications first!):

 *Main> quickCheck prop_dynamic_pack_is_fixpoint

Now, you took my advice seriously, don't you? And you did have your Ctrl-C handy, didn't you? Most probably, the attempt to run the test resulted in all your memory being taken by ghci process, which you hopefully interrupted soon enough by pressing Ctrl-C.

What happened? Who ate all the memory? How to debug this problem? GHC comes with profiling abilities, but we could not use them - they produce report after program terminates, and our doesn't seem to do so without consuming several terabytes of memory first. Still, there is a lot of room for maneuver.

Let's see. Since the have called dynamic_pack and it ate all the memory, let's not do this again. Instead, let's see what this function does and tweak it a bit to explore it's behavior.

Since we already know that random lists of "Dir"s generated for our QuickCheck tests are of modest size (after all, greedy_pack munches them without significant memory consumption), the size of the input most probably is not the issue. However, dynamic_pack_is_fixpoint is building quite a huge list internally (via precomputeDisksFor). Could this be a problem?

Let's turn the timing/memory stats on (":set +s" on ghci prompt) and try to peek into various elements of list returned by precomputeDisksFor:

 Prelude> :l cd-fit.hs
 Compiling Main             ( cd-fit.hs, interpreted )
 Ok, modules loaded: Main.
 *Main> :set +s
 *Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 0
 DirPack {pack_size = 0, dirs = []}
 (0.06 secs, 1277972 bytes)
 *Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10
 DirPack {pack_size = 0, dirs = []}
 (0.00 secs, 0 bytes)
 *Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100
 DirPack {pack_size = 0, dirs = []}
 (0.01 secs, 1519064 bytes)
 *Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 1000
 DirPack {pack_size = 0, dirs = []}
 (0.03 secs, 1081808 bytes)
 *Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 10000
 DirPack {pack_size = 0, dirs = []}
 (1.39 secs, 12714088 bytes)
 *Main> (precomputeDisksFor [Dir 1 "aaa"]) !! 100000
 Interrupted.
 

Aha! This seems to be a problem, since computation of 100000 fails to terminate in "reasonable" time, and to think that we have tried to compute 700*1024*1024th element...

Lets modify our code a bit, to allow disk size to be tweaked:

-- Taken from 'cd-fit-4-3.hs'
dynamic_pack limit dirs = (precomputeDisksFor dirs)!!(fromInteger limit)

prop_dynamic_pack_is_fixpoint ds =
  let pack = dynamic_pack media_size ds 
      in pack_size pack == pack_size (dynamic_pack media_size (dirs pack))

prop_dynamic_pack_small_disk ds =
  let pack = dynamic_pack 50000 ds
      in pack_size pack == pack_size (dynamic_pack 50000 (dirs pack))

-- rename "old" main to "moin"
main = quickCheck prop_dynamic_pack_small_disk

Compute a profiling version of you code with ghc -O --make -prof -auto-all -o cd-fit cd-fit.hs and run it like this:

 $ ./cd-fit +RTS -p
 OK, passed 100 tests.

First thing, note that our code satisfies at least one simple property. Good. Now let's examine profile. Look into file "cd-fit.prof", which was produced in your current directory.

Most probably, you'll see something like this:

            cd-fit +RTS -p -RTS
 
         total time  =        2.18 secs   (109 ticks @ 20 ms)
         total alloc = 721,433,008 bytes  (excludes profiling overheads)
 
 COST CENTRE                    MODULE               %time %alloc
 
 precomputeDisksFor             Main                  88.1   99.8
 dynamic_pack                   Main                  11.0    0.0
                                                                                                individual    inherited
 COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc
 
 MAIN                     MAIN                                                   1           0   0.0    0.0   100.0  100.0
  CAF                     Main                                                 174          11   0.9    0.2   100.0  100.0
   prop_dynamic_pack_small_disk Main                                           181         100   0.0    0.0    99.1   99.8
    dynamic_pack          Main                                                 182         200  11.0    0.0    99.1   99.8
     precomputeDisksFor   Main                                                 183         200  88.1   99.8    88.1   99.8
   main                   Main                                                 180           1   0.0    0.0     0.0    0.0

Examine column of "individual %alloc". As we thought, all memory was allocated within precomputeDisksFor. However, amount of memory allocated (more than 700 mb, according to the line "total alloc") seems to be a little too much for our simple task. We will dig deeper and find where we a wasting it.

Let's examine memory consumption a little closer via so-called "heap profiles". Run ./cd-fit +RTS -hb. This produces "biographical heap profile", which tells us how various parts of the memory were used during the program run time. Heap profile was saved to "cd-fit.hp". It is next to impossible to read and comprehend it as is, so use "hp2ps cd-fit.hp" to produce a nice PostScript picture which is worth a thousand words. View it with "gv" or "ghostview" or "full Adobe Acrobat (not Reader)". (This and subsequent pictures are not attached here).

Notice that most of the graph is taken up by region marked as "VOID". This means that memory allocated was never used. Notice that there is no areas marked as "USE", "LAG" or "DRAG". Seems like our program hardly uses any of the allocated memory at all. Wait a minute! How could that be? Surely it must use something when it packs to the imaginary disks of 50000 bytes those random-generated directories which are 10 to 1400 Mb in size.... Oops. Severe size mismatch. We should have spotted it earlier, when we were timing precomputeDisksFor. Scroll back and observe how each run returned the very same result - empty directory set.

Our random directories are too big, but nevertheless code spends time and memory trying to "pack" them. Obviously, precomputeDisksFor (which is responsible for 90% of total memory consumption and run time) is flawed in some way.

Let's take a closer look at what takes up so much memory. Run ./cd-fit +RTS -h -hbvoid and produce PostScript picture for this memory profile. This will give us detailed breakdown of all memory whose "biography" shows that it's been "VOID" (unused). My picture (and I presume that yours as well) shows that VOID memory comprises of "thunks" labeled "precomputeDisksFor/pre...". We could safely assume that second word would be "precomp" (You wonder why? Look again at the code and try to find function named "pre.*" which is called from inside precomputeDisksFor)

This means that memory has been taken by the list generated inside "precomp". Rumor has it that memory leaks with Haskell are caused by either too little laziness or too much laziness. It seems like we have too little laziness here: we evaluate more elements of the list that we actually need and keep them from being garbage-collected.

Note how we look up element from "precomp" in this piece of code:

case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs
                                       , dir_size d > 0
                                       , let (DirPack s ds)=precomp!!(fromInteger (limit - dir_size d))
                                       , d `notElem` ds


Obviously, the whole list generated by "precomp" must be kept in memory for such lookups, since we can't be sure that some element could be garbage collected and will not be needed again.

Let's rewrite the code to eliminate the list:

-- Taken from 'cd-fit-4-4.hs'
-- Let the `bestDisk x' be the "most tightly packed" disk of total 
-- size no more than `x'.
-- How to calculate `bestDisk'? Lets opt for a recursive definition:
-- Recursion base: best packed disk of size 0 is empty and best-packed
-- disk for empty list of directories on it is also empty.
bestDisk 0 _  = DirPack 0 []
bestDisk _ [] = DirPack 0 []
-- Recursion step: for size `limit`, bigger than 0, best packed disk is
-- comptued as follows:
bestDisk limit dirs =
   -- Take all non-empty dirs that could possibly fit to that disk by itself.
   -- Consider them one by one. Let the size of particular dir be `dir_size d'.
   -- Let's add it to the best-packed disk of size <= (limit - dir_size d), thus
   -- producing the disk of size <= limit. Lets do that for all "candidate" dirs that
   -- are not yet on our disk:
   case [ DirPack (dir_size d + s) (d:ds) | d <- filter ( (inRange (1,limit)).dir_size ) dirs
                                          , dir_size d > 0
                                          , let (DirPack s ds)= bestDisk (limit - dir_size d) dirs 
                                          , d `notElem` ds
        ] of
          -- We either fail to add any dirs (probably, because all of them too big).
          -- Well, just report that disk must be left empty:
          [] -> DirPack 0 []
          -- Or we produce some alternative packings. Let's choose the best of them all:
          packs -> maximumBy cmpSize packs

cmpSize a b = compare (pack_size a) (pack_size b)

dynamic_pack limit dirs = bestDisk limit dirs


Compile the profiling version of this code and obtain the overall execution profile (with "+RTS -p"). You'll get something like this:

            cd-fit +RTS -p -RTS
 
         total time  =        0.00 secs   (0 ticks @ 20 ms)
         total alloc =   1,129,520 bytes  (excludes profiling overheads)
 
 COST CENTRE                    MODULE               %time %alloc
 
 CAF                            GHC.Float              0.0    4.4
 main                           Main                   0.0   93.9
 
                                                                                                individual    inherited
 COST CENTRE              MODULE                                               no.    entries  %time %alloc   %time %alloc
 MAIN                     MAIN                                                   1           0   0.0    0.0     0.0  100.0
  main                    Main                                                 180           1   0.0   93.9     0.0   94.2
   prop_dynamic_pack_small_disk Main                                                 181         100   0.0    0.0     0.0    0.3
    dynamic_pack          Main                                                 182         200   0.0    0.2     0.0    0.3
     bestDisk             Main                                                 183         200   0.0    0.1     0.0    0.1

We achieved the major improvement: memory consumption is reduced by factor of 700! Now we could test the code on the "real task" - change the code to run the test for packing the full-sized disk:

main = quickCheck prop_dynamic_pack_is_fixpoint

Compile with profiling and run (with "+RTS -p"). If you are not lucky and a considerably big test set would be randomly generated for your runs, you'll have to wait. And wait even more. And more.

Go make some tea. Drink it. Read some Tolstoi (Do you have "War and peace" handy?). Chances are that by the time you are done with Tolstoi, program will still be running (just take my word on it, don't check).

If you are lucky, your program will finish fast enough and leave you with profile. According to a profile, program spends 99% of its time inside bestDisk. Could we speed up bestDisk somehow?

Note that bestDisk performs several simple calculation for which it must call itself. However, it is done rather inefficiently - each time we pass to bestDisk the exact same set of directories as it was called with, even if we have already "packed" some of them. Let's amend this:

-- Taken from 'cd-fit-4-5.hs'
case [ DirPack (dir_size d + s) (d:ds) | let small_enough = filter ( (inRange (0,limit)).dir_size ) dirs
                                       , d <- small_enough
                                       , dir_size d > 0
                                       , let (DirPack s ds)= bestDisk (limit - dir_size d) (delete d small_enough)
     ] of

Recompile and run again. Runtimes could be lengthy, but bearable, and number of times bestDisk is called (according to the profile) should decrease significantly.

Finally, let's compare both packing algorithms. Intuitively, we feel that greedy algorithm should produce worse results, don't we? Lets put this feeling to the test:

-- Taken from 'cd-fit-4-5.hs'
prop_greedy_pack_is_no_better_than_dynamic_pack ds =
  pack_size (greedy_pack ds) <= pack_size (dynamic_pack media_size ds)

Verify that it is indeed so by running quickCheck for this test several time. I feel that this concludes our knapsacking exercises.

Adventurous readers could continue further by implementing so-called "scaling" for dynamic_pack where we divide all directory sizes and medium size by the size of the smallest directory to proceed with smaller numbers (which promises faster runtimes).

Chapter 5: (Ab)using monads and destructing constructors for fun and profit

We already mentioned monads quite a few times. They are described in numerous articles and tutorial (See Chapter 400). It's hard to read a daily dose of any Haskell mailing list and not to come across a word "monad" a dozen times.

Since we already made quite a progress with Haskell, it's time we revisit the monads once again. I will let the other sources teach you theory behind the monads, overall usefulness of the concept, etc. Instead, I will focus on providing you with examples.

Let's take a part of the real world program which involves XML processing. We will work with XML tag attributes, which are essentially named values:

-- Taken from 'chapter5-1.hs'
type Attribute = (Name, AttValue)

'Name' is a plain string, and value could be either string or references (also strings) to another attributes which holds the actual value (now, this is not a valid XML thing, but for the sake of providing a nice example, let's accept this). Word "either" suggests that we use 'Either' datatype:

type AttValue = Either Value [Reference]
type Name = String
type Value = String
type Reference = String

-- Sample list of simple attributes:
simple_attrs = [ ( "xml:lang", Left "en" )
               , ( "xmlns", Left "jabber:client" )
               , ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]

-- Sample list of attributes with references:
complex_attrs = [ ( "xml:lang", Right ["lang"] )
                , ( "lang", Left "en" )
                , ( "xmlns", Right ["ns","subns"] )
                , ( "ns", Left "jabber" )
                , ( "subns", Left "client" )
                , ( "xmlns:stream", Left "http://etherx.jabber.org/streams" ) ]

Our task is: to write a function that will look up a value of attribute by it's name from the given list of attributes. When attribute contains reference(s), we resolve them (looking for the referenced attribute in the same list) and concatenate their values, separated by semicolon. Thus, lookup of attribute "xmlns" form both sample sets of attributes should return the same value.

Following the example set by the Data.List.lookup from the standard libraries, we will call our function lookupAttr and it will return Maybe Value, allowing for lookup errors:

-- Taken from 'chapter5-1.hs'
lookupAttr :: Name -> [Attribute] -> Maybe Value
-- Since we dont have code for 'lookupAttr', but want
-- to compile code already, we use the function 'undefined' to
-- provide default, "always-fail-with-runtime-error" function body.
lookupAttr = undefined

Let's try to code lookupAttr using lookup in a very straightforward way:

-- Taken from 'chapter5-1.hs'
import Data.List

lookupAttr :: Name -> [Attribute] -> Maybe Value
lookupAttr nm attrs = 
  -- First, we lookup 'Maybe AttValue' by name and
  -- check whether we are successful:
  case (lookup nm attrs) of
       -- Pass the lookup error through.
       Nothing   -> Nothing 
       -- If given name exist, see if it is value of reference:
       Just attv -> case attv of
                         -- It's a value. Return it!
                         Left val -> Just val
                         -- It's a list of references :(
                         -- We have to look them up, accounting for
                         -- possible failures.
                         -- First, we will perform lookup of all references ...
                         Right refs -> let vals = [ lookupAttr ref attrs | ref <- refs ]
                                           -- .. then, we will exclude lookup failures
                                           wo_failures = filter (/=Nothing) vals
                                           -- ... find a way to remove annoying 'Just' wrapper
                                           stripJust (Just v) = v
                                           -- ... use it to extract all lookup results as strings
                                           strings = map stripJust wo_failures
                                           in
                                           -- ... finally, combine them into single String. 
                                           -- If all lookups failed, we should pass failure to caller.
                                           case null strings of
                                             True  -> Nothing
                                             False -> Just (concat (intersperse ":" strings))

Testing:

 *Main> lookupAttr "xmlns" complex_attrs
 Just "jabber:client"
 *Main> lookupAttr "xmlns" simple_attrs
 Just "jabber:client"
 *Main>

It works, but ... It seems strange that such a boatload of code required for quite simple task. If you examine the code closely, you'll see that the code bloat is caused by:

  • the fact that after each step we check whether the error occurred
  • unwrapping Strings from Maybe and Either data constructors and wrapping them back.

At this point C++/Java programmers would say that since we just pass errors upstream, all those cases could be replaced by the single "try ... catch ..." block, and they would be right. Does this mean that Haskell programmers are reduced to using "case"s, which were already obsolete 10 years ago?

Monads to the rescue! As you can read elsewhere (see section 400), monads are used in advanced ways to construct computations from other computations. Just what we need - we want to combine several simple steps (lookup value, lookup reference, ...) into function lookupAttr in a way that would take into account possible failures.

Lets start from the code and dissect in afterwards:

-- Taken from 'chapter5-2.hs'
import Control.Monad

lookupAttr' nm attrs = do
  -- First, we lookup 'AttValue' by name
  attv <- lookup nm attrs
  -- See if it is value of reference:
  case attv of
    -- It's a value. Return it!
    Left val -> Just val
    -- It's a list of references :(
    -- We have to look them up, accounting for
    -- possible failures.
    -- First, we will perform lookup of all references ...
    Right refs -> do vals <- sequence $ map (flip lookupAttr' attrs) refs
                     -- ... since all failures are already excluded by "monad magic",
                     -- ... all all 'Just's have been removed likewise,
                     -- ... we just combine values into single String,
                     -- ... and return failure if it is empty. 
                     guard (not (null vals))
                     return (concat (intersperse ":" vals))

Exercise: compile the code, test that lookupAttr and lookupAttr' really behave in the same way. Try to write a QuickCheck test for that, defining the instance Arbitrary Name such that arbitrary names will be taken from names available in simple_attrs.

Well, back to the story. Noticed the drastic reduction in code size? If you drop comments, the code will occupy mere 7 lines instead of 13 - almost two-fold reduction. How we achieved this?

First, notice that we never ever check whether some computation returns Nothing anymore. Yet, try to lookup some non-existing attribute name, and lookupAttr' will return Nothing. How does this happen? Secret lies in the fact that type constructor Maybe is a "monad".

We use keyword do to indicate that following block of code is a sequence of monadic actions, where monadic magic have to happen when we use '<-', 'return' or move from one action to another.

Different monads have different magic. Library code says that type constructor Maybe is such a monad that we could use <- to "extract" values from wrapper Just and use return to put them back in form of Just some_value. When we move from one action in the "do" block to another a check happens. If the action returned Nothing, all subsequent computations will be skipped and the whole "do" block will return Nothing.

Try this to understand it all better:

*Main> let foo x = do v <- x; return (v+1) in foo (Just 5)
Just 6
*Main> let foo x = do v <- x; return (v+1) in foo Nothing 
Nothing
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo (Just 'a')
Just 97
*Main> let foo x = do v <- x; return (Data.Char.ord v) in foo Nothing   
Nothing
*Main>

Do not mind sequence and guard just for now - we will get to them in the little while.

Since we already removed one reason for code bloat, it is time to deal with the other one. Notice that we have to use case to deconstruct the value of type Either Value [Reference]. Surely we are not the first to do this, and such use case have to be quite a common one.

Indeed, there is a simple remedy for our case, and it is called either:

 *Main> :t either
 either :: (a -> c) -> (b -> c) -> Either a b -> c

Scary type signature, but here are examples to help you grok it:

 *Main> :t either (+1) (length)              
 either (+1) (length) :: Either Int [a] -> Int
 *Main> either (+1) (length) (Left 5)
 6
 *Main> either (+1) (length) (Right "foo")
 3
 *Main> 

Seems like this is exactly our case. Let's replace the case with invocation of either:

-- Taken from 'chapter5-3.hs'
lookupAttr'' nm attrs = do
  attv <- lookup nm attrs
  either Just (dereference attrs) attv
  where
    dereference attrs refs = do 
      vals <- sequence $ map (flip lookupAttr'' attrs) refs
      guard (not (null vals))
      return (concat (intersperse ":" vals))

It keeps getting better and better :)

Now, as semi-exercise, try to understand the meaning of "sequence", "guard" and "flip" looking at the following ghci sessions:

 *Main> :t sequence
 sequence :: (Monad m) => [m a] -> m [a]
 *Main> :t [Just 'a', Just 'b', Nothing, Just 'c']
 [Just 'a', Just 'b', Nothing, Just 'c'] :: [Maybe Char]
 *Main> :t sequence [Just 'a', Just 'b', Nothing, Just 'c']
 sequence [Just 'a', Just 'b', Nothing, Just 'c'] :: Maybe [Char]
 *Main> sequence [Just 'a', Just 'b', Nothing, Just 'c']
 Nothing
 *Main> sequence [Just 'a', Just 'b', Nothing]
 Nothing
 *Main> sequence [Just 'a', Just 'b']
 Just "ab"
 *Main> :t [putStrLn "a", putStrLn "b"]
 [putStrLn "a", putStrLn "b"] :: [IO ()]
 *Main> :t sequence [putStrLn "a", putStrLn "b"]
 sequence [putStrLn "a", putStrLn "b"] :: IO [()]
 *Main> sequence [putStrLn "a", putStrLn "b"]
 a
 b
 *Main> :t [putStrLn "a", fail "stop here", putStrLn "b"]
 [putStrLn "a", fail "stop here", putStrLn "b"] :: [IO ()]
 *Main> :t sequence [putStrLn "a", fail "stop here", putStrLn "b"]
 sequence [putStrLn "a", fail "stop here", putStrLn "b"] :: IO [()]
 *Main> sequence [putStrLn "a", fail "stop here", putStrLn "b"]
 a
 *** Exception: user error (stop here)

Notice that for monad Maybe sequence continues execution until the first Nothing. The same behavior could be observed for IO monad. Take into account that different behaviors are not hardcoded into the definition of sequence!

Now, let's examine guard:

 *Main> let foo x = do v <- x; guard (v/=5); return (v+1) in map foo [Just 4, Just 5, Just 6] 
 [Just 5,Nothing,Just 7]

As you can see, it's just a simple way to "stop" execution at some condition.

If you have been hooked on monads, I urge you to read "All About Monads" right now (link in Chapter 400).

Chapter 6: Where do you want to go tomorrow?

As the name implies, the author is open for proposals - where should we go next? I had networking + xml/xmpp in mind, but it might be too heavy and too narrow for most of the readers.

What do you think? Drop me a line.

Chapter 400: Monads up close

Read this wikibook chapter. Then, read "All about monads". 'Nuff said :)

Chapter 500: IO up close

Shows that:

c = do a <- someAction
       b <- someOtherAction
       print (bar b)
       print (foo a)
       print "done"

really is just a syntax sugar for:

c = someAction >>= \a ->
    someOtherAction >>= \b ->
    print (bar b) >>
    print (foo a) >>
    print "done"

and explains about ">>=" and ">>". Oh wait. This was already explained in Chapter 400 :)

Chapter 9999: Installing Haskell Compiler/Interpreter and all necessary software

Plenty of material on this on the web and this wiki. Just go get yourself installation of GHC (6.4 or above) or Hugs (v200311 or above) and "darcs", which we will use for version control.

Chapter 10000: Thanks!

Thanks for comments, proofreading, good advice and kind words go to: Helge, alt, dottedmag, Paul Moore, Ben Rudiak-Gould, Jim Wilkinson, Andrew Zhdanov (avalez), Martin Percossi, SpellingNazi, Davor Cubranic, Brett Giles, Stdrange, Brian Chrisman, Nathan Collins, Anastasia Gornostaeva (ermine), Remi, Ptolomy, Zimbatm, HenkJanVanTuyl, Miguel, Mforbes, Kartik Agaram.

If I should have mentioned YOU and forgot - tell me so.

Without you I would have stopped after Chapter 1 :)


Nota: Esta es una traducción del artículo original en Inglés : Hitchhikers guide to Haskell