Tutorials/Programming Haskell/String IO
This is part two in a series of tutorials on programming Haskell. You can get up to speed by reading yesterday's introductory article.
Today we'll look more into the basic tools at our disposal in the Haskell language, in particular, operations for doing I/O and playing with files and strings.
Administrivia
Before we get started, I should clarify a small point raised by yesterday's article. One issue I forgot to mention was that there are slight differences between running Haskell in GHCi, the bytecode interpreter, and compiling it to native code with GHC.
Haskell programs are executed by evaluating the special main
definition:
import Data.List
mylength = foldr (const (+1)) 0
main = print (mylength "haskell")
To compile this to native code, we would feed the source file to the compiler:
$ ghc A.hs $ ./a.out 7
For a faster turnaround, we can run the code directly through the bytecode interpreter, GHCi, using the runhaskell program:
$ runhaskell A.hs 7
GHCi, the interactive Haskell environment, is a little bit different.
As it is an interactive system, GHCi must execute your code
sequentially, as you define each line. This is different to normal
Haskell, where the order of definition is irrelevant. GHCi effectively
executes your code inside a do
-block. Therefore you can use the
do
-notation at the GHCi prompt to define new functions:
$ ghci Prelude> :m + Data.List
Prelude> let mylength = foldr (const (+1)) 0
Prelude> :t mylength mylength :: [a] -> Integer
Prelude> mylength "haskell" 7
For this tutorial I will be developing code in a source file, and either compiling it as above, or loading the source file into GHCi for testing. To load a source file into GHCi, we do:
$ ghci Prelude> :load A.hs
*Main> :t main main :: IO ()
*Main> :t mylength mylength :: [a] -> Integer
*Main> mylength "foo" 3
*Main> main 7
Now, let's get into the code!
I/O
As the Camel Book says:
Unless you're using artificial intelligence to model a solipsistic philosopher, your program needs some way to communicate with the outside world.
In yesterday's tutorial, I briefly introduced readFile
, for reading a
string from a file on disk. Let's consider now I/O in more detail.
The most common I/O operations are defined in the System.IO library.
For the most basic stdin/stdout Unix-style programs in Haskell, we can
use the interact
function:
interact :: (String -> String) -> IO ()
This higher-order function takes, as an argument, some function for
processing a string (of type String -> String
). It runs this function
over the standard input stream, printing the result to standard output.
A surprisingly large number of useful programs can be written this way.
For example, we can write the cat unix program as:
main = interact id
Yes, that's it! Let's compile and run this program:
$ ghc -O A.hs
$ cat A.hs | ./a.out main = interact id
How does this work? Firstly, interact
is defined as:
interact f = do s <- getContents
putStr (f s)
So it reads a string from standard input, and writes to standard output the result of applying its argument function to that string.
As for id
, it has the type:
id :: a -> a
id
is a function of one argument, of any type (the lowercase a
in
the type means any type can be used in that position, i.e. it is a
polymorphic function, or a generic function in some
other languages). id
takes a value of some type a
, and returns a value of
the same type. There's only one (non-trivial) function of this type:
id a = a
So interact id
will print to the input string to standard output
unmodified.
Let's now write a basic version of the wc program:
main = interact count
count s = show (length s) ++ "\n"
This will print the length of the input string, that is, the number of chars:
$ runhaskell A.hs < A.hs 57
Line oriented I/O
Only a small number of programs operate on unstructured input streams.
It is far more common to treat an input stream as a list of lines. So
let's do that. To break a string up into lines, we'll use the ...
lines
function, defined in the Data.List library:
lines :: String -> [String]
The type, once again, tells the story. lines
takes a string, and
breaks it up into a list of strings, splitting on newlines.
To join a list of strings back into a single string, inserting newlines,
we'd use the ... unlines
function:
unlines :: [String] -> String
There are also similar functions for splitting on words, namely words
and unwords
. Now, an example. To count the number of lines in a file:
main = interact (count . lines)
We can run this as:
$ ghc -O A.hs
$ ./a.out < A.hs 3
Here we reuse the 'count' function from above, by composing it with the lines function.
On composition
This nice code reuse via composition is achieved using the (.)
function,
pronounced "compose". Let's look at how that works. (Feel free to skip
this section, if you want to just get things done).
The (.)
function is just a normal everyday Haskell function, defined as:
(.) f g x = f (g x)
This looks a little like magic (or line noise), but it's pretty easy. The
(.)
function simply takes two functions as arguments, along with
another value. It applies the g
function to the value x
, and then
applies f
to the result, returning this final value. The functions may
be of any type. The type of (.)
is actually:
(.) :: (b -> c) -> (a -> b) -> a -> c
which might look a bit hairy, but it essentially specifies what types of arguments make sense to compose. That is, only those where:
f :: b -> c
g :: a -> b
x :: a
can be composed, yielding a new function of type:
(f . g) :: a -> c
The nice thing is that this composition makes sense (and works) for all types a
, b
and c
.
How does this relate to code reuse? Well, since our count
function is
polymorphic, it works equally well counting the length of a
string, or the length of a list of strings. Our littler wc program is
the epitome of the phrase: "higher order + polymorphic =
reusable". That is, functions which take other functions as
arguments, when combined with functions that work over any type, produce
great reusable "glue". You only need vary the argument function to gain
excellent code reuse (and the strong type checking ensures you can only
reuse code in ways that work).
More on lines
Another little example, let's reverse each line of a file (like the unix rev command):
main = interact (unlines . map reverse . lines)
Which when run, reverses the input lines:
$ ./a.out < B.hs rahC.ataD tropmi ebyaM.ataD tropmi tsiL.ataD tropmi
So we take the input string, split it into lines, and the loop over that
list of lines, reversing each of them, using the map
function.
Finally, once we've reversed each line, we join them back into a single
string with unlines
, and print it out.
The map
function is a fundamental control structure of functional
programming, similar to the foreach keyword in a number of imperative
languages. map
however is just a function on lists, not built-in
syntax, and has the type:
map :: (a -> b) -> [a] -> [b]
That is, it takes some function, and a list, and applies that function
to each element of the list, returning a new list as a result. Since
loops are so common in programming, we'll be using map
a lot.
Just for reference, map
is implemented as:
map _ [] = []
map f (x:xs) = f x : map f xs
File I/O
Operating on stdin/stdout is good for scripts (and this is how tools like sed or perl -p work), but for "real" programs we'll at least need to do some file I/O. The basic operations of files are:
readFile :: FilePath -> IO String
writeFile :: FilePath -> String -> IO ()
readFile
takes a file name as an argument, does some I/O, and returns the
file's contents as a string. writeFile
takes a file name, a string,
and does some I/O (writing that string to the file), before returning the
void (or unit) value, ()
.
We could implement a basic cp program on files, as:
import System.Environment
main = do
[f,g] <- getArgs
s <- readFile f
writeFile g s
Running this program:
$ ghc -O A.hs
$ ./a.out A.hs Z.hs
$ cat Z.hs import System.Environment
main = do [f,g] <- getArgs s <- readFile f writeFile g s
Since we're doing I/O (the type of readFile
and writeFile
enforce this),
the code runs inside a do
-block, using the monadic IO
type. "Using the
IO monad" just means that we wish to use an imperative, sequential order
of evaluation. (As an aside, a wide range of other monads exist, for
programming different program evaluation strategies, such as
Prolog-style backtracking, or continuation-based evaluation. All of
imperative programming is just one subset of possible evaluation
strategies you can use in Haskell, via monads).
In do
-notation, whenever you wish to run an action, for its side
effect, and save the result to a variable, you would write:
v <- action
For example, to run the readFile
action, which has the side effect of
reading a file from disk, we say:
s <- readFile f
Finally, we can use the appendFile
function to append to an existing
file.
File Handles
The most generic interface to files is provided via handles. Sometimes
we need to keep a file open, for multiple reads or writes. To do this we
use the Handle
type, an abstraction much like the underlying system's file
handles.
To open up a file, and get its handle, we use:
openFile :: FilePath -> IOMode -> IO Handle
So to open a file for reading only, in GHCi:
Prelude System.IO> h <- openFile "A.hs" ReadMode
{handle: A.hs}
Which returns a handle onto the file "A.hs". We can read a line from this handle:
Prelude System.IO> hGetLine h
"main = do"
To close a handle, and flush the buffer:
hClose :: Handle -> IO ()
Once a handle is closed, we can no longer read from it:
Prelude System.IO> hClose h
Prelude System.IO> hGetLine h
*** Exception: A.hs: hGetLine: illegal operation (handle is closed)
We can also flush explicitly with:
hFlush :: Handle -> IO ()
Other useful operations for reading from handles:
hGetChar :: Handle -> IO Char
hGetLine :: Handle -> IO [Char]
hGetContents :: Handle -> IO [Char]
To write to handles:
hPutChar :: Handle -> Char -> IO ()
hPutStr :: Handle -> [Char] -> IO ()
hPutStrLn :: Handle -> [Char] -> IO ()
hPrint :: Show a => Handle -> a -> IO ()
Some other useful actions:
hSeek :: Handle -> SeekMode -> Integer -> IO ()
hTell :: Handle -> IO Integer
hFileSize :: Handle -> IO Integer
hIsEOF :: Handle -> IO Bool
An example: spell-checking
Here is a small example of combining the Data.Set
and List data structures from yesterday's tutorial, with more I/O
operations. We'll implement a little spell-checker, building the
dictionary in a Set
data type. First, some libraries to import:
import System.Environment
import Control.Monad
import Data.Set
And the complete program:
main = do
[s] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
let dict = fromList (lines f)
mapM_ (spell dict) (words g)
spell d w = when (w `notMember` d) (putStrLn w)
Running this program, on its own source, and it reports the following words are not found in the dictionary:
$ ghc -O Spell.hs -o spell
$ ./spell A.hs Data.Char = <- (map toUpper n) = <- getLine 1
Writing the results out
If we wanted to write the results out to a temporary file, we can do so. Let's import a couple of other modules:
import Data.Set
import Data.Maybe
import Text.Printf
import System.IO
import System.Environment
import System.Posix.Temp
Refactoring the main
definition to separate out the reading and writing phases
into their own function, we end up with the core code:
main = do
(f, g) <- readFiles
let dict = fromList (lines f)
errs = mapMaybe (spell dict) (words g)
write errs
spell d w | w `notMember` d = Just w
| otherwise = Nothing
Where reading is now its own function:
readFiles = do
[s] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g)
And writing errors out to their own file:
write errs = do
(t,h) <- mkstemp "/tmp/spell.XXXXXX"
mapM_ (hPutStrLn h) errs
hClose h
printf "%d spelling errors written to '%s'\n" (length errs) t
Pretty simple! Running this program:
$ ghc --make -O Spell.hs -o myspell [1 of 1] Compiling Main ( Spell.hs, Spell.o ) Linking myspell ...
$ ./myspell Spell.hs 67 spelling errors written to '/tmp/spell.ia8256'
Extension: using SMP parallelism
Finally, just for some bonus fun ... and hold on to your hat 'cause I'm going to go fast ... we'll add some parallelism to the mix.
Haskell was designed from the start to support easy parallelisation, and
since GHC 6.6, multithreaded code will run transparently on multicore
systems using as many cores as you specify. Let's look at how we'd
parallelise our little program to exploit multiple cores. We'll use an
explicit threading model, via Control.Concurrent. You can also make your code implicitly
parallel, using Control.Parallel.Strategies,
but we'll leave that for another tutorial.
Here's the source, for you to ponder. First some imports:
import Data.Set hiding (map)
import Data.Maybe
import Data.Char
import Text.Printf
import System.IO
import System.Environment
import Control.Concurrent
import Control.Monad
The entry point, modified to break the word list into chunks, and then dispatching a chunk to each thread:
main = do
(f, g, n) <- readFiles
let dict = fromList (lines f)
work = chunk n (words g)
run n dict work
chunk :: Int -> [a] -> [[a]]
chunk _ [] = []
chunk n xs = (take n xs) : (chunk n (drop n xs))
The run
function sets up a channel between the main thread and all
children thread (errs
), and prints spelling errors as they arrive on
the channel from the children. It then forks off n
children threads on
each piece of the work list:
run n dict work = do
chan <- newChan
errs <- getChanContents chan -- errors returned back to main thread
mapM_ (forkIO . thread chan dict) (zip [1..n] work)
wait n errs 0
The main thread then just waits on all the threads to finish, printing
any spelling errors they pass up:
wait n xs i = when (i < n) $ case xs of
Nothing : ys -> wait n ys $! i+1
Just s : ys -> putStrLn s >> wait n ys i
Each thread spell-checks its own piece of the work list. If it finds a spelling error, it passes the offending word back over the channel to the main thread.
thread chan dict (me, xs) = do
mapM_ spellit xs
writeChan chan Nothing
where
spellit w = when (spell dict w) $
writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w
The spell
function is simplified a little:
spell d w = w `notMember` d
which we could also write as:
spell = flip notMember
We modify the readFiles phase to take an additional numeric command line argument, specifying the number of threads to run:
readFiles = do
[s,n] <- getArgs
f <- readFile "/usr/share/dict/words"
g <- readFile s
return (f,g, read n)
We compile this with the GHC SMP parallel runtime system:
$ ghc -O --make -threaded Spell.hs -o spell
Now, we can run n worker threads (lightweight Haskell threads), mapped onto m OS threads. Since I'm using a 4-core server, we'll play around with 4 OS threads. First, running everything in a single thread:
$ time ./spell test.txt 1 +RTS -N1 ... Thread 1: week: Thread 1: IO! ./spell test.txt 1 +RTS -N1 99% cpu 2.533 total
Ok, now we change the command line flag to run it with 4 OS threads, to try to utilise all 4 cores:
$ time ./spell 4 +RTS -N4 ... Thread 2: week: Thread 3: IO! ./spell test.txt 4 +RTS -N4 111% cpu 2.335 total
Ok. Good... A little bit faster, uses a little bit more cpu. It turns out however the program is bound currently by the time spent in the main thread building the initial dictionary. Actual searching time is only some 10% of the running time. Nonetheless, it was fairly painless to break up the initial simple program into a parallel version.
If the program running time was extended (as the case for a server), the
parallelism would be a win. Additionally, should we buy more cores for
the server, all we need to is change the +RTS -N argument to the
program, to start utilising these extra cores.
Next week
In upcoming tutorials we'll look more into implicitly parallel programs,
and the use of the new high performance ByteString
data type for string
processing.