Tutorials/Programming Haskell/String IO

This is part two in a series of tutorials on programming Haskell. You can get up to speed by reading yesterday's introductory article.

Today we'll look more into the basic tools at our disposal in the Haskell language, in particular, operations for doing I/O and playing with files and strings.

Administrivia

Before we get started, I should clarify a small point raised by yesterday's article. One issue I forgot to mention was that there are slight differences between running Haskell in GHCi, the bytecode interpreter, and compiling it to native code with GHC.

Haskell programs are executed by evaluating the special main definition:

    import Data.List

    mylength = foldr (const (+1)) 0
    main = print (mylength "haskell")

To compile this to native code, we would feed the source file to the compiler:

   $ ghc A.hs
   $ ./a.out
   7

For a faster turnaround, we can run the code directly through the bytecode interpreter, GHCi, using the runhaskell program:

   $ runhaskell A.hs
   7

GHCi, the interactive Haskell environment, is a little bit different. As it is an interactive system, GHCi must execute your code sequentially, as you define each line. This is different to normal Haskell, where the order of definition is irrelevant. GHCi effectively executes your code inside a do-block. Therefore you can use the do-notation at the GHCi prompt to define new functions:

   $ ghci
   Prelude> :m + Data.List

   Prelude> let mylength = foldr (const (+1)) 0

   Prelude> :t mylength
   mylength :: [a] -> Integer

   Prelude> mylength "haskell"
   7

For this tutorial I will be developing code in a source file, and either compiling it as above, or loading the source file into GHCi for testing. To load a source file into GHCi, we do:

   $ ghci
   Prelude> :load A.hs

   *Main> :t main
   main :: IO ()

   *Main> :t mylength
   mylength :: [a] -> Integer

   *Main> mylength "foo"
   3

   *Main> main
   7

Now, let's get into the code!

I/O

As the Camel Book says:

Unless you're using artificial intelligence to model a solipsistic philosopher, your program needs some way to communicate with the outside world.

In yesterday's tutorial, I briefly introduced readFile, for reading a string from a file on disk. Let's consider now I/O in more detail. The most common I/O operations are defined in the System.IO library.

For the most basic stdin/stdout Unix-style programs in Haskell, we can use the interact function:

    interact    ::  (String -> String) -> IO ()

This higher-order function takes, as an argument, some function for processing a string (of type String -> String). It runs this function over the standard input stream, printing the result to standard output. A surprisingly large number of useful programs can be written this way. For example, we can write the cat unix program as:

    main = interact id

Yes, that's it! Let's compile and run this program:

   $ ghc -O A.hs

   $ cat A.hs | ./a.out
   main = interact id

How does this work? Firstly, interact is defined as:

    interact f = do s <- getContents
                    putStr (f s)

So it reads a string from standard input, and writes to standard output the result of applying its argument function to that string.

As for id, it has the type:

    id :: a -> a

id is a function of one argument, of any type (the lowercase a in the type means any type can be used in that position, i.e. it is a polymorphic function, or a generic function in some other languages). id takes a value of some type a, and returns a value of the same type. There's only one (non-trivial) function of this type:

    id a = a

So interact id will print to the input string to standard output unmodified.

Let's now write a basic version of the wc program:

    main    = interact count
    count s = show (length s) ++ "\n"

This will print the length of the input string, that is, the number of chars:

   $ runhaskell A.hs < A.hs
   57

Line oriented I/O

Only a small number of programs operate on unstructured input streams. It is far more common to treat an input stream as a list of lines. So let's do that. To break a string up into lines, we'll use the ... lines function, defined in the Data.List library:

    lines :: String -> [String]

The type, once again, tells the story. lines takes a string, and breaks it up into a list of strings, splitting on newlines. To join a list of strings back into a single string, inserting newlines, we'd use the ... unlines function:

    unlines :: [String] -> String

There are also similar functions for splitting on words, namely words and unwords. Now, an example. To count the number of lines in a file:

    main = interact (count . lines)

We can run this as:

   $ ghc -O A.hs

   $ ./a.out < A.hs
   3

Here we reuse the 'count' function from above, by composing it with the lines function.

On composition

This nice code reuse via composition is achieved using the (.) function, pronounced "compose". Let's look at how that works. (Feel free to skip this section, if you want to just get things done).

The (.) function is just a normal everyday Haskell function, defined as:

    (.) f g x = f (g x)

This looks a little like magic (or line noise), but it's pretty easy. The (.) function simply takes two functions as arguments, along with another value. It applies the g function to the value x, and then applies f to the result, returning this final value. The functions may be of any type. The type of (.) is actually:

    (.) :: (b -> c) -> (a -> b) -> a -> c

which might look a bit hairy, but it essentially specifies what types of arguments make sense to compose. That is, only those where:

    f :: b -> c
    g :: a -> b
    x :: a

can be composed, yielding a new function of type:

    (f . g) :: a -> c

The nice thing is that this composition makes sense (and works) for all types a, b and c.

How does this relate to code reuse? Well, since our count function is polymorphic, it works equally well counting the length of a string, or the length of a list of strings. Our littler wc program is the epitome of the phrase: "higher order + polymorphic = reusable". That is, functions which take other functions as arguments, when combined with functions that work over any type, produce great reusable "glue". You only need vary the argument function to gain excellent code reuse (and the strong type checking ensures you can only reuse code in ways that work).

More on lines

Another little example, let's reverse each line of a file (like the unix rev command):

    main = interact (unlines . map reverse . lines)

Which when run, reverses the input lines:

   $ ./a.out < B.hs
   rahC.ataD tropmi
   ebyaM.ataD tropmi
   tsiL.ataD tropmi

So we take the input string, split it into lines, and the loop over that list of lines, reversing each of them, using the map function. Finally, once we've reversed each line, we join them back into a single string with unlines, and print it out.

The map function is a fundamental control structure of functional programming, similar to the foreach keyword in a number of imperative languages. map however is just a function on lists, not built-in syntax, and has the type:

    map :: (a -> b) -> [a] -> [b]

That is, it takes some function, and a list, and applies that function to each element of the list, returning a new list as a result. Since loops are so common in programming, we'll be using map a lot. Just for reference, map is implemented as:

    map _ []     = []
    map f (x:xs) = f x : map f xs

File I/O

Operating on stdin/stdout is good for scripts (and this is how tools like sed or perl -p work), but for "real" programs we'll at least need to do some file I/O. The basic operations of files are:

readFile  :: FilePath -> IO String
writeFile :: FilePath -> String -> IO ()

readFile takes a file name as an argument, does some I/O, and returns the file's contents as a string. writeFile takes a file name, a string, and does some I/O (writing that string to the file), before returning the void (or unit) value, ().

We could implement a basic cp program on files, as:

    import System.Environment

    main = do
        [f,g] <- getArgs
        s     <- readFile f
        writeFile g s

Running this program:

   $ ghc -O A.hs

   $ ./a.out A.hs Z.hs

   $ cat Z.hs
   import System.Environment

   main = do
       [f,g] <- getArgs
       s     <- readFile f
       writeFile g s

Since we're doing I/O (the type of readFile and writeFile enforce this), the code runs inside a do-block, using the monadic IO type. "Using the IO monad" just means that we wish to use an imperative, sequential order of evaluation. (As an aside, a wide range of other monads exist, for programming different program evaluation strategies, such as Prolog-style backtracking, or continuation-based evaluation. All of imperative programming is just one subset of possible evaluation strategies you can use in Haskell, via monads).

In do-notation, whenever you wish to run an action, for its side effect, and save the result to a variable, you would write:

    v <- action

For example, to run the readFile action, which has the side effect of reading a file from disk, we say:

    s <- readFile f

Finally, we can use the appendFile function to append to an existing file.

File Handles

The most generic interface to files is provided via handles. Sometimes we need to keep a file open, for multiple reads or writes. To do this we use the Handle type, an abstraction much like the underlying system's file handles.

To open up a file, and get its handle, we use:

    openFile :: FilePath -> IOMode -> IO Handle

So to open a file for reading only, in GHCi:

    Prelude System.IO> h <- openFile "A.hs" ReadMode
    {handle: A.hs}

Which returns a handle onto the file "A.hs". We can read a line from this handle:

    Prelude System.IO> hGetLine h
    "main = do"

To close a handle, and flush the buffer:

    hClose :: Handle -> IO ()

Once a handle is closed, we can no longer read from it:

    Prelude System.IO> hClose h
    Prelude System.IO> hGetLine h
    *** Exception: A.hs: hGetLine: illegal operation (handle is closed)

We can also flush explicitly with:

    hFlush :: Handle -> IO ()

Other useful operations for reading from handles:

    hGetChar     :: Handle -> IO Char
    hGetLine     :: Handle -> IO [Char]
    hGetContents :: Handle -> IO [Char]

To write to handles:

    hPutChar    :: Handle -> Char -> IO ()
    hPutStr     :: Handle -> [Char] -> IO ()
    hPutStrLn   :: Handle -> [Char] -> IO ()
    hPrint      :: Show a => Handle -> a -> IO ()

Some other useful actions:

    hSeek     :: Handle -> SeekMode -> Integer -> IO ()
    hTell     :: Handle -> IO Integer
    hFileSize :: Handle -> IO Integer
    hIsEOF    :: Handle -> IO Bool

An example: spell-checking

Here is a small example of combining the Data.Set and List data structures from yesterday's tutorial, with more I/O operations. We'll implement a little spell-checker, building the dictionary in a Set data type. First, some libraries to import:

    import System.Environment
    import Control.Monad
    import Data.Set

And the complete program:

    main = do
        [s] <- getArgs
        f   <- readFile "/usr/share/dict/words"
        g   <- readFile s
        let dict = fromList (lines f)
        mapM_ (spell dict) (words g)

    spell d w = when (w `notMember` d) (putStrLn w)

Running this program, on its own source, and it reports the following words are not found in the dictionary:

   $ ghc -O Spell.hs -o spell

   $ ./spell A.hs
   Data.Char
   =
   <-
   (map
   toUpper
   n)
   =
   <-
   getLine
   1

Writing the results out

If we wanted to write the results out to a temporary file, we can do so. Let's import a couple of other modules:

    import Data.Set
    import Data.Maybe
    import Text.Printf
    import System.IO
    import System.Environment
    import System.Posix.Temp

Refactoring the main definition to separate out the reading and writing phases into their own function, we end up with the core code:

    main = do
        (f, g) <- readFiles
        let dict = fromList (lines f)
            errs = mapMaybe (spell dict) (words g)
        write errs

    spell d w | w `notMember` d = Just w
              | otherwise       = Nothing

Where reading is now its own function:

    readFiles = do
        [s] <- getArgs
        f   <- readFile "/usr/share/dict/words"
        g   <- readFile s
        return (f,g)

And writing errors out to their own file:

    write errs = do
        (t,h) <- mkstemp "/tmp/spell.XXXXXX"
        mapM_ (hPutStrLn h) errs
        hClose h
        printf "%d spelling errors written to '%s'\n" (length errs) t

Pretty simple! Running this program:

   $ ghc --make -O Spell.hs -o myspell
   [1 of 1] Compiling Main             ( Spell.hs, Spell.o )
   Linking myspell ...

   $ ./myspell Spell.hs
   67 spelling errors written to '/tmp/spell.ia8256'

Extension: using SMP parallelism

Finally, just for some bonus fun ... and hold on to your hat 'cause I'm going to go fast ... we'll add some parallelism to the mix.

Haskell was designed from the start to support easy parallelisation, and since GHC 6.6, multithreaded code will run transparently on multicore systems using as many cores as you specify. Let's look at how we'd parallelise our little program to exploit multiple cores. We'll use an explicit threading model, via Control.Concurrent. You can also make your code implicitly parallel, using Control.Parallel.Strategies, but we'll leave that for another tutorial.

Here's the source, for you to ponder. First some imports:

    import Data.Set hiding (map)
    import Data.Maybe
    import Data.Char
    import Text.Printf
    import System.IO
    import System.Environment
    import Control.Concurrent
    import Control.Monad

The entry point, modified to break the word list into chunks, and then dispatching a chunk to each thread:

    main = do
        (f, g, n) <- readFiles
        let dict = fromList (lines f)
            work = chunk n (words g)
        run n dict work

    chunk :: Int -> [a] -> [[a]]
    chunk _ [] = []
    chunk n xs = (take n xs) : (chunk n (drop n xs))

The run function sets up a channel between the main thread and all children thread (errs), and prints spelling errors as they arrive on the channel from the children. It then forks off n children threads on each piece of the work list:

    run n dict work = do
        chan <- newChan
        errs <- getChanContents chan    -- errors returned back to main thread
        mapM_ (forkIO . thread chan dict) (zip [1..n] work)
        wait n errs 0

The main thread then just waits on all the threads to finish, printing any spelling errors they pass up:

    wait n xs i = when (i < n) $ case xs of
        Nothing : ys -> wait n ys $! i+1
        Just s  : ys -> putStrLn s >> wait n ys i

Each thread spell-checks its own piece of the work list. If it finds a spelling error, it passes the offending word back over the channel to the main thread.

    thread chan dict (me, xs) = do
        mapM_ spellit xs
        writeChan chan Nothing

     where
        spellit w = when (spell dict w) $
            writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w

The spell function is simplified a little:

    spell d w = w `notMember` d

which we could also write as:

    spell = flip notMember

We modify the readFiles phase to take an additional numeric command line argument, specifying the number of threads to run:

    readFiles = do
        [s,n] <- getArgs
        f     <- readFile "/usr/share/dict/words"
        g     <- readFile s
        return (f,g, read n)

We compile this with the GHC SMP parallel runtime system:

   $ ghc -O --make -threaded Spell.hs -o spell

Now, we can run n worker threads (lightweight Haskell threads), mapped onto m OS threads. Since I'm using a 4-core server, we'll play around with 4 OS threads. First, running everything in a single thread:

   $ time ./spell test.txt 1 +RTS -N1
   ...
   Thread 1: week:
   Thread 1: IO!
   ./spell test.txt 1 +RTS -N1 99% cpu 2.533 total

Ok, now we change the command line flag to run it with 4 OS threads, to try to utilise all 4 cores:

   $ time ./spell 4 +RTS -N4
   ...
   Thread 2: week:
   Thread 3: IO!
   ./spell test.txt 4 +RTS -N4 111% cpu 2.335 total

Ok. Good... A little bit faster, uses a little bit more cpu. It turns out however the program is bound currently by the time spent in the main thread building the initial dictionary. Actual searching time is only some 10% of the running time. Nonetheless, it was fairly painless to break up the initial simple program into a parallel version.

If the program running time was extended (as the case for a server), the parallelism would be a win. Additionally, should we buy more cores for the server, all we need to is change the +RTS -N argument to the program, to start utilising these extra cores.

Next week

In upcoming tutorials we'll look more into implicitly parallel programs, and the use of the new high performance ByteString data type for string processing.