Difference between revisions of "Tutorials/Programming Haskell/String IO"

From HaskellWiki
Jump to: navigation, search
(Add chunk definition.)
m (Various formatting and other minor changes)
 
Line 4: Line 4:
   
 
Today we'll look more into the basic tools at our disposal in the
 
Today we'll look more into the basic tools at our disposal in the
[[Haskell]] language, in particular, operations for doing IO and playing
+
[[Haskell]] language, in particular, operations for doing I/O and playing
 
with files and strings.
 
with files and strings.
  +
   
 
==Administrivia==
 
==Administrivia==
Line 12: Line 13:
 
[[Tutorials/Programming Haskell/Introduction|yesterday's article]].
 
[[Tutorials/Programming Haskell/Introduction|yesterday's article]].
 
One issue I forgot to mention was that there are slight differences
 
One issue I forgot to mention was that there are slight differences
between running Haskell in ghci, the bytecode interpreter, and compiling
+
between running Haskell in GHCi, the bytecode interpreter, and compiling
 
it to native code with GHC.
 
it to native code with GHC.
   
Haskell programs are executed by evaluating the special 'main' function.
+
Haskell programs are executed by evaluating the special <code>main</code> definition:
   
 
<haskell>
 
<haskell>
Line 31: Line 32:
   
 
For a faster turnaround, we can run the code directly through
 
For a faster turnaround, we can run the code directly through
the bytecode interpreter, GHCi, using the 'runhaskell' program:
+
the bytecode interpreter, GHCi, using the <tt>runhaskell</tt> program:
   
 
$ runhaskell A.hs
 
$ runhaskell A.hs
Line 40: Line 41:
 
sequentially, as you define each line. This is different to normal
 
sequentially, as you define each line. This is different to normal
 
Haskell, where the order of definition is irrelevant. GHCi effectively
 
Haskell, where the order of definition is irrelevant. GHCi effectively
executes your code inside a <i>do</i>-block. Therefore you can use the
+
executes your code inside a <code>do</code>-block. Therefore you can use the
<i>do</i>-notation at the GHCi prompt to define new functions:
+
<code>do</code>-notation at the GHCi prompt to define new functions:
   
 
$ ghci
 
$ ghci
Line 75: Line 76:
 
Now, let's get into the code!
 
Now, let's get into the code!
   
==IO==
 
   
As the Camel Book says:
 
  +
==I/O==
   
<blockquote>
 
  +
As the [https://www.oreilly.com/library/view/programming-perl-4th/9781449321451 Camel Book] says:
Unless you're using artificial intelligence to model a solipsistic
 
philosopher, your program needs some way to communicate with the
 
outside world.
 
</blockquote>
 
   
  +
<div style="border-left:1px solid lightgray; padding: 1em" alt="blockquote">
  +
Unless you're using artificial intelligence to model a solipsistic
  +
philosopher, your program needs some way to communicate with the
  +
outside world.
  +
</div>
   
In yesterday's tutorial, I briefly introduced 'readFile', for reading a
+
In yesterday's tutorial, I briefly introduced <code>readFile</code>, for reading a
String from a file on disk. Let's consider now IO in more detail.
+
string from a file on disk. Let's consider now I/O in more detail.
The most common IO operations are defined in the [http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html System.IO] library.
+
The most common I/O operations are defined in the [http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html System.IO] library.
   
For the most basic stdin/stdout Unix-style programs in Haskell, we can
+
For the most basic <tt>stdin/stdout</tt> Unix-style programs in Haskell, we can
use the 'interact' function:
+
use the <code>interact</code> function:
   
 
<haskell>
 
<haskell>
Line 97: Line 98:
 
</haskell>
 
</haskell>
   
This <i>higher order</i> function takes, as an argument, some function for
+
This <i>higher-order</i> function takes, as an argument, some function for
processing a string (of type String -> String). It runs this function
+
processing a string (of type <code>String -> String</code>). It runs this function
 
over the standard input stream, printing the result to standard output.
 
over the standard input stream, printing the result to standard output.
 
A surprisingly large number of useful programs can be written this way.
 
A surprisingly large number of useful programs can be written this way.
For example, we can write the 'cat' unix program as:
+
For example, we can write the <tt>cat</tt> unix program as:
   
 
<haskell>
 
<haskell>
Line 114: Line 115:
 
main = interact id
 
main = interact id
   
How does this work? Firstly, 'interact' is defined as:
+
How does this work? Firstly, <code>interact</code> is defined as:
   
 
<haskell>
 
<haskell>
Line 122: Line 123:
   
 
So it reads a string from standard input, and writes to standard output
 
So it reads a string from standard input, and writes to standard output
the result of applying its argument function to that string. The 'id'
+
the result of applying its argument function to that string.
function itself has the type:
+
  +
As for <code>id</code>, it has the type:
   
 
<haskell>
 
<haskell>
Line 129: Line 130:
 
</haskell>
 
</haskell>
   
'id' is a function of one argument, of any type (the lowercase 'a' in
+
<code>id</code> is a function of one argument, of any type (the lowercase <code>a</code> in
 
the type means any type can be used in that position, i.e. it is a
 
the type means any type can be used in that position, i.e. it is a
polymorphic function (also called a generic function in some
+
polymorphic function, or a <i>generic function</i> in some
languages)). 'id' takes a value of some type 'a', and returns a value of
+
other languages). <code>id</code> takes a value of some type <code>a</code>, and returns a value of
the same type. There's is only one (non-trivial) function of this type:
+
the same type. There's only one (non-trivial) function of this type:
   
 
<haskell>
 
<haskell>
Line 139: Line 140:
 
</haskell>
 
</haskell>
   
So 'interact id' will print to the input string to standard output
+
So <code>interact id</code> will print to the input string to standard output
 
unmodified.
 
unmodified.
   
Let's now write the 'wc' program:
+
Let's now write a basic version of the <tt>wc</tt> program:
   
 
<haskell>
 
<haskell>
Line 155: Line 156:
 
57
 
57
   
==Line oriented IO==
 
  +
  +
==Line oriented I/O==
   
 
Only a small number of programs operate on unstructured input streams.
 
Only a small number of programs operate on unstructured input streams.
 
It is far more common to treat an input stream as a list of lines. So
 
It is far more common to treat an input stream as a list of lines. So
 
let's do that. To break a string up into lines, we'll use the ...
 
let's do that. To break a string up into lines, we'll use the ...
'lines' function, defined in the [http://haskell.org/ghc/docs/latest/html/libraries/base/Data-List.html Data.List] library:
+
<code>lines</code> function, defined in the [http://haskell.org/ghc/docs/latest/html/libraries/base/Data-List.html Data.List] library:
   
 
<haskell>
 
<haskell>
Line 166: Line 168:
 
</haskell>
 
</haskell>
   
The type, once again, tells the story. 'lines' takes a String, and
+
The type, once again, tells the story. <code>lines</code> takes a string, and
 
breaks it up into a list of strings, splitting on newlines.
 
breaks it up into a list of strings, splitting on newlines.
 
To join a list of strings back into a single string, inserting newlines,
 
To join a list of strings back into a single string, inserting newlines,
we'd use the ... 'unlines' function:
+
we'd use the ... <code>unlines</code> function:
   
 
<haskell>
 
<haskell>
Line 175: Line 177:
 
</haskell>
 
</haskell>
   
There are also similar functions for splitting on words, namely 'words'
+
There are also similar functions for splitting on words, namely <code>words</code>
and 'unwords'. Now, an example. To count the number of lines in a file:
+
and <code>unwords</code>. Now, an example. To count the number of lines in a file:
   
 
<haskell>
 
<haskell>
Line 192: Line 194:
 
with the lines function.
 
with the lines function.
   
===On composition===
 
   
  +
===On composition===
   
This nice code reuse via composition is achieved using the (.) function,
+
This nice code reuse via composition is achieved using the <code>(.)</code> function,
pronounced 'compose'. Let's look at how that works. (Feel free to skip
+
pronounced <i>"compose"</i>. Let's look at how that works. (Feel free to skip
 
this section, if you want to just get things done).
 
this section, if you want to just get things done).
   
   
The (.) function is just a normal everyday Haskell function, defined as:
+
The <code>(.)</code> function is just a normal everyday Haskell function, defined as:
   
 
<haskell>
 
<haskell>
Line 206: Line 208:
 
</haskell>
 
</haskell>
   
This looks a little like magic (or line noise), but its pretty easy. The
+
This looks a little like magic (or line noise), but it's pretty easy. The
(.) function simply takes <i>two functions</i> as arguments, along with
+
<code>(.)</code> function simply takes <i>two functions</i> as arguments, along with
another value. It applies the 'g' function to the value 'x', and then
+
another value. It applies the <code>g</code> function to the value <code>x</code>, and then
applies 'f' to the result, returning this final value. The functions may
+
applies <code>f</code> to the result, returning this final value. The functions may
be of any type. The type of (.) is actually:
+
be of any type. The type of <code>(.)</code> is actually:
   
 
<haskell>
 
<haskell>
Line 231: Line 233:
 
</haskell>
 
</haskell>
   
The nice thing is that this composition makes sense (and works) <i>for all types a, b and
+
The nice thing is that this composition makes sense (and works) <i>for all types</i> <code>a</code>, <code>b</code> and <code>c</code>.
c</i>.
 
   
 
  +
How does this relate to code reuse? Well, since our <code>count</code> function is
How does this relate to code reuse? Well, since our 'count' function is
 
 
<i>polymorphic</i>, it works equally well counting the length of a
 
<i>polymorphic</i>, it works equally well counting the length of a
string, or the length of a list of strings. Our littler 'wc' program is
+
string, or the length of a list of strings. Our littler <tt>wc</tt> program is
 
the epitome of the phrase: <i>"higher order + polymorphic =
 
the epitome of the phrase: <i>"higher order + polymorphic =
 
reusable"</i>. That is, functions which take other functions as
 
reusable"</i>. That is, functions which take other functions as
 
arguments, when combined with functions that work over any type, produce
 
arguments, when combined with functions that work over any type, produce
great reusable 'glue'. You only need vary the argument function to gain
+
great reusable "glue". You only need vary the argument function to gain
terrific code reuse (and the strong type checking ensures you can only
+
excellent code reuse (and the strong type checking ensures you can only
 
reuse code in ways that work).
 
reuse code in ways that work).
   
===More on lines===
 
   
  +
===More on lines===
   
 
Another little example, let's reverse each line of a file (like the unix
 
Another little example, let's reverse each line of a file (like the unix
'rev' command):
+
<tt>rev</tt> command):
   
 
<haskell>
 
<haskell>
Line 262: Line 263:
   
 
So we take the input string, split it into lines, and the loop over that
 
So we take the input string, split it into lines, and the loop over that
list of lines, reversing each of them, using the 'map' function.
+
list of lines, reversing each of them, using the <code>map</code> function.
 
Finally, once we've reversed each line, we join them back into a single
 
Finally, once we've reversed each line, we join them back into a single
string with unlines, and print it out.
+
string with <code>unlines</code>, and print it out.
   
   
The 'map' function is a fundamental control structure of functional
+
The <code>map</code> function is a fundamental control structure of functional
programming, similar to the 'foreach' keyword in a number of imperative
+
programming, similar to the <tt>foreach</tt> keyword in a number of imperative
languages. 'map' however is just a function on lists, not built in
+
languages. <code>map</code> however is just a function on lists, not built-in
 
syntax, and has the type:
 
syntax, and has the type:
   
Line 278: Line 279:
 
That is, it takes some function, and a list, and applies that function
 
That is, it takes some function, and a list, and applies that function
 
to each element of the list, returning a new list as a result. Since
 
to each element of the list, returning a new list as a result. Since
loops are so common in programming, we'll be using 'map' a lot.
+
loops are so common in programming, we'll be using <code>map</code> a lot.
Just for reference, 'map' is implemented as:
+
Just for reference, <code>map</code> is implemented as:
   
 
<haskell>
 
<haskell>
Line 286: Line 287:
 
</haskell>
 
</haskell>
   
==File IO==
 
   
Operating on stdin/stdout is good for scripts (and this is how tools
 
  +
==File I/O==
like sed or perl -p work), but for 'real' programs we'll at least need
 
  +
to do some file IO. The basic operations of files are:
 
  +
Operating on <tt>stdin/stdout</tt> is good for scripts (and this is how tools
  +
like <tt>sed</tt> or <tt>perl -p</tt> work), but for "real" programs we'll at least need
  +
to do some file I/O. The basic operations of files are:
   
 
<haskell>
 
<haskell>
Line 297: Line 299:
 
</haskell>
 
</haskell>
   
'readFile' takes a file name as an argument, does some IO, and returns the
+
<code>readFile</code> takes a file name as an argument, does some I/O, and returns the
file's contents as a string. 'writeFile' takes a file name, a string,
+
file's contents as a string. <code>writeFile</code> takes a file name, a string,
and does some IO (writing that string to the file), before returning the
+
and does some I/O (writing that string to the file), before returning the
void (or unit) value, ().
+
void (or unit) value, <code>()</code>.
   
   
We could implement a 'cp' program on files, as:
+
We could implement a basic <tt>cp</tt> program on files, as:
   
 
<haskell>
 
<haskell>
Line 328: Line 330:
 
writeFile g s
 
writeFile g s
   
Since we're doing IO (the type of readFile and writeFile enforce this),
+
Since we're doing I/O (the type of <code>readFile</code> and <code>writeFile</code> enforce this),
the code runs inside a do-block, using the IO <i>monad</i>. "Using the
+
the code runs inside a <code>do</code>-block, using the <i>monadic</i> <code>IO</code> type. <i>"Using the
IO monad" just means that we wish to use an imperative, sequential order
+
</i>IO<i> monad"</i> just means that we wish to use an imperative, sequential order
 
of evaluation. (As an aside, a wide range of other monads exist, for
 
of evaluation. (As an aside, a wide range of other monads exist, for
 
programming different program evaluation strategies, such as
 
programming different program evaluation strategies, such as
Line 338: Line 340:
   
   
In <i>do</i>-notation, whenever you wish to run an action, for its side
+
In <code>do</code>-notation, whenever you wish to run an action, for its side
 
effect, and save the result to a variable, you would write:
 
effect, and save the result to a variable, you would write:
   
Line 345: Line 347:
 
</haskell>
 
</haskell>
   
For example, to run the 'readFile' action, which has the side effect of
+
For example, to run the <code>readFile</code> action, which has the side effect of
 
reading a file from disk, we say:
 
reading a file from disk, we say:
   
Line 352: Line 354:
 
</haskell>
 
</haskell>
   
Finally, we can use the 'appendFile' function to append to an existing
+
Finally, we can use the <code>appendFile</code> function to append to an existing
 
file.
 
file.
   
==File Handles==
 
   
  +
==File Handles==
   
The most generic interface to files is provided via Handles. Sometimes
+
The most generic interface to files is provided via handles. Sometimes
 
we need to keep a file open, for multiple reads or writes. To do this we
 
we need to keep a file open, for multiple reads or writes. To do this we
use Handles, an abstraction much like the underlying system's file
+
use the <code>Handle</code> type, an abstraction much like the underlying system's file
 
handles.
 
handles.
   
   
To open up a file, and get its Handle, we use:
+
To open up a file, and get its handle, we use:
   
 
<haskell>
 
<haskell>
Line 377: Line 379:
 
</haskell>
 
</haskell>
   
Which returns a Handle onto the file "A.hs". We can read a line from this handle:
+
Which returns a handle onto the file <tt>"A.hs"</tt>. We can read a line from this handle:
   
 
<haskell>
 
<haskell>
Line 384: Line 386:
 
</haskell>
 
</haskell>
   
To close a Handle, and flush the buffer:
+
To close a handle, and flush the buffer:
   
 
<haskell>
 
<haskell>
Line 390: Line 392:
 
</haskell>
 
</haskell>
   
Once a Handle is closed, we can no longer read from it:
+
Once a handle is closed, we can no longer read from it:
   
 
<haskell>
 
<haskell>
Line 404: Line 406:
 
</haskell>
 
</haskell>
   
Other useful operations for reading from Handles:
+
Other useful operations for reading from handles:
   
 
<haskell>
 
<haskell>
Line 412: Line 414:
 
</haskell>
 
</haskell>
   
To write to Handles:
+
To write to handles:
   
 
<haskell>
 
<haskell>
Line 430: Line 432:
 
</haskell>
 
</haskell>
   
==An example: spell checking==
 
   
  +
==An example: spell-checking==
   
 
Here is a small example of combining the [http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Set.html Data.Set]
 
Here is a small example of combining the [http://haskell.org/ghc/docs/latest/html/libraries/base/Data-Set.html Data.Set]
and List data structures from yesterday's tutorial, with more IO
+
and <tt>List</tt> data structures from yesterday's tutorial, with more I/O
operations. We'll implement a little spell checker, building the
+
operations. We'll implement a little spell-checker, building the
dictionary in a Set data type. First, some libraries to import:
+
dictionary in a <code>Set</code> data type. First, some libraries to import:
   
 
<haskell>
 
<haskell>
Line 476: Line 478:
   
 
===Writing the results out===
 
===Writing the results out===
 
   
 
If we wanted to write the results out to a temporary file, we can do
 
If we wanted to write the results out to a temporary file, we can do
Line 490: Line 491:
 
</haskell>
 
</haskell>
   
Refactoring the main code to separate out the reading and writing phases
+
Refactoring the <code>main</code> definition to separate out the reading and writing phases
in to their own function, we end up with the core code:
+
into their own function, we end up with the core code:
   
 
<haskell>
 
<haskell>
Line 533: Line 534:
 
67 spelling errors written to '/tmp/spell.ia8256'
 
67 spelling errors written to '/tmp/spell.ia8256'
   
==Extension: using SMP parallelism==
 
   
  +
==Extension: using SMP parallelism==
   
 
Finally, just for some bonus fun ... and hold on to your hat 'cause I'm
 
Finally, just for some bonus fun ... and hold on to your hat 'cause I'm
Line 545: Line 546:
 
parallelise our little program to exploit multiple cores. We'll use an
 
parallelise our little program to exploit multiple cores. We'll use an
 
explicit threading model, via [http://haskell.org/ghc/docs/latest/html/libraries/base/Control-Concurrent.html Control.Concurrent]. You can also make your code implicitly
 
explicit threading model, via [http://haskell.org/ghc/docs/latest/html/libraries/base/Control-Concurrent.html Control.Concurrent]. You can also make your code implicitly
parallel, using [http://haskell.org/ghc/docs/latest/html/libraries/base/Control-Parallel-Strategies.html">Control.Parallel.Strategies],
+
parallel, using [http://haskell.org/ghc/docs/latest/html/libraries/base/Control-Parallel-Strategies.html Control.Parallel.Strategies],
 
but we'll leave that for another tutorial.
 
but we'll leave that for another tutorial.
   
Line 578: Line 579:
   
   
The 'run' function sets up a channel between the main thread and all
+
The <code>run</code> function sets up a channel between the main thread and all
children thread ('errs'), and prints spelling errors as they arrive on
+
children thread (<code>errs</code>), and prints spelling errors as they arrive on
the channel from the children. It then forks off 'n' children threads on
+
the channel from the children. It then forks off <code>n</code> children threads on
 
each piece of the work list:
 
each piece of the work list:
   
Line 601: Line 602:
 
</haskell>
 
</haskell>
   
Each thread spell checks its own piece of the work list. If it finds a
+
Each thread spell-checks its own piece of the work list. If it finds a
 
spelling error, it passes the offending word back over the channel to
 
spelling error, it passes the offending word back over the channel to
 
the main thread.
 
the main thread.
Line 615: Line 616:
 
</haskell>
 
</haskell>
   
The 'spell' function is simplified a little:
+
The <code>spell</code> function is simplified a little:
   
 
<haskell>
 
<haskell>
Line 627: Line 628:
 
</haskell>
 
</haskell>
   
We modify the readFiles phase to take an additional numeric command line
+
We modify the <tt>readFiles</tt> phase to take an additional numeric command line
 
argument, specifying the number of threads to run:
 
argument, specifying the number of threads to run:
   
Line 642: Line 643:
 
$ ghc -O --make -threaded Spell.hs -o spell
 
$ ghc -O --make -threaded Spell.hs -o spell
   
Now, we can run 'n' worker threads (lightweight Haskell threads), mapped
+
Now, we can run <tt>n</tt> worker threads (lightweight Haskell threads), mapped
onto 'm' OS threads. Since I'm using a 4 core linux server, we'll play
+
onto <tt>m</tt> OS threads. Since I'm using a 4-core server, we'll play
 
around with 4 OS threads. First, running everything in a single thread:
 
around with 4 OS threads. First, running everything in a single thread:
   
Line 670: Line 671:
 
If the program running time was extended (as the case for a server), the
 
If the program running time was extended (as the case for a server), the
 
parallelism would be a win. Additionally, should we buy more cores for
 
parallelism would be a win. Additionally, should we buy more cores for
the server, all we need to is change the +RTS -N argument to the
+
the server, all we need to is change the <tt>+RTS -N</tt> argument to the
 
program, to start utilising these extra cores.
 
program, to start utilising these extra cores.
   
==Next week==
 
   
  +
==Next week==
   
 
In upcoming tutorials we'll look more into implicitly parallel programs,
 
In upcoming tutorials we'll look more into implicitly parallel programs,
and the use of the new high performance ByteString data type for string
+
and the use of the new high performance <code>ByteString</code> data type for string
 
processing.
 
processing.

Latest revision as of 04:56, 22 September 2021

This is part two in a series of tutorials on programming Haskell. You can get up to speed by reading yesterday's introductory article.

Today we'll look more into the basic tools at our disposal in the Haskell language, in particular, operations for doing I/O and playing with files and strings.


Administrivia

Before we get started, I should clarify a small point raised by yesterday's article. One issue I forgot to mention was that there are slight differences between running Haskell in GHCi, the bytecode interpreter, and compiling it to native code with GHC.

Haskell programs are executed by evaluating the special main definition:

    import Data.List

    mylength = foldr (const (+1)) 0
    main = print (mylength "haskell")

To compile this to native code, we would feed the source file to the compiler:

   $ ghc A.hs
   $ ./a.out
   7

For a faster turnaround, we can run the code directly through the bytecode interpreter, GHCi, using the runhaskell program:

   $ runhaskell A.hs
   7

GHCi, the interactive Haskell environment, is a little bit different. As it is an interactive system, GHCi must execute your code sequentially, as you define each line. This is different to normal Haskell, where the order of definition is irrelevant. GHCi effectively executes your code inside a do-block. Therefore you can use the do-notation at the GHCi prompt to define new functions:

   $ ghci
   Prelude> :m + Data.List
   Prelude> let mylength = foldr (const (+1)) 0
   Prelude> :t mylength
   mylength :: [a] -> Integer
   Prelude> mylength "haskell"
   7

For this tutorial I will be developing code in a source file, and either compiling it as above, or loading the source file into GHCi for testing. To load a source file into GHCi, we do:

   $ ghci
   Prelude> :load A.hs
   *Main> :t main
   main :: IO ()
   *Main> :t mylength
   mylength :: [a] -> Integer
   *Main> mylength "foo"
   3
   *Main> main
   7

Now, let's get into the code!


I/O

As the Camel Book says:

Unless you're using artificial intelligence to model a solipsistic philosopher, your program needs some way to communicate with the outside world.

In yesterday's tutorial, I briefly introduced readFile, for reading a string from a file on disk. Let's consider now I/O in more detail. The most common I/O operations are defined in the System.IO library.

For the most basic stdin/stdout Unix-style programs in Haskell, we can use the interact function:

    interact    ::  (String -> String) -> IO ()

This higher-order function takes, as an argument, some function for processing a string (of type String -> String). It runs this function over the standard input stream, printing the result to standard output. A surprisingly large number of useful programs can be written this way. For example, we can write the cat unix program as:

    main = interact id

Yes, that's it! Let's compile and run this program:

   $ ghc -O A.hs
   $ cat A.hs | ./a.out
   main = interact id

How does this work? Firstly, interact is defined as:

    interact f = do s <- getContents
                    putStr (f s)

So it reads a string from standard input, and writes to standard output the result of applying its argument function to that string.

As for id, it has the type:

    id :: a -> a

id is a function of one argument, of any type (the lowercase a in the type means any type can be used in that position, i.e. it is a polymorphic function, or a generic function in some other languages). id takes a value of some type a, and returns a value of the same type. There's only one (non-trivial) function of this type:

    id a = a

So interact id will print to the input string to standard output unmodified.

Let's now write a basic version of the wc program:

    main    = interact count
    count s = show (length s) ++ "\n"

This will print the length of the input string, that is, the number of chars:

   $ runhaskell A.hs < A.hs
   57


Line oriented I/O

Only a small number of programs operate on unstructured input streams. It is far more common to treat an input stream as a list of lines. So let's do that. To break a string up into lines, we'll use the ... lines function, defined in the Data.List library:

    lines :: String -> [String]

The type, once again, tells the story. lines takes a string, and breaks it up into a list of strings, splitting on newlines. To join a list of strings back into a single string, inserting newlines, we'd use the ... unlines function:

    unlines :: [String] -> String

There are also similar functions for splitting on words, namely words and unwords. Now, an example. To count the number of lines in a file:

    main = interact (count . lines)

We can run this as:

   $ ghc -O A.hs
   $ ./a.out < A.hs
   3

Here we reuse the 'count' function from above, by composing it with the lines function.


On composition

This nice code reuse via composition is achieved using the (.) function, pronounced "compose". Let's look at how that works. (Feel free to skip this section, if you want to just get things done).


The (.) function is just a normal everyday Haskell function, defined as:

    (.) f g x = f (g x)

This looks a little like magic (or line noise), but it's pretty easy. The (.) function simply takes two functions as arguments, along with another value. It applies the g function to the value x, and then applies f to the result, returning this final value. The functions may be of any type. The type of (.) is actually:

    (.) :: (b -> c) -> (a -> b) -> a -> c

which might look a bit hairy, but it essentially specifies what types of arguments make sense to compose. That is, only those where:

    f :: b -> c
    g :: a -> b
    x :: a

can be composed, yielding a new function of type:

    (f . g) :: a -> c

The nice thing is that this composition makes sense (and works) for all types a, b and c.

How does this relate to code reuse? Well, since our count function is polymorphic, it works equally well counting the length of a string, or the length of a list of strings. Our littler wc program is the epitome of the phrase: "higher order + polymorphic = reusable". That is, functions which take other functions as arguments, when combined with functions that work over any type, produce great reusable "glue". You only need vary the argument function to gain excellent code reuse (and the strong type checking ensures you can only reuse code in ways that work).


More on lines

Another little example, let's reverse each line of a file (like the unix rev command):

    main = interact (unlines . map reverse . lines)

Which when run, reverses the input lines:

   $ ./a.out < B.hs
   rahC.ataD tropmi
   ebyaM.ataD tropmi
   tsiL.ataD tropmi

So we take the input string, split it into lines, and the loop over that list of lines, reversing each of them, using the map function. Finally, once we've reversed each line, we join them back into a single string with unlines, and print it out.


The map function is a fundamental control structure of functional programming, similar to the foreach keyword in a number of imperative languages. map however is just a function on lists, not built-in syntax, and has the type:

    map :: (a -> b) -> [a] -> [b]

That is, it takes some function, and a list, and applies that function to each element of the list, returning a new list as a result. Since loops are so common in programming, we'll be using map a lot. Just for reference, map is implemented as:

    map _ []     = []
    map f (x:xs) = f x : map f xs


File I/O

Operating on stdin/stdout is good for scripts (and this is how tools like sed or perl -p work), but for "real" programs we'll at least need to do some file I/O. The basic operations of files are:

readFile  :: FilePath -> IO String
writeFile :: FilePath -> String -> IO ()

readFile takes a file name as an argument, does some I/O, and returns the file's contents as a string. writeFile takes a file name, a string, and does some I/O (writing that string to the file), before returning the void (or unit) value, ().


We could implement a basic cp program on files, as:

    import System.Environment

    main = do
        [f,g] <- getArgs
        s     <- readFile f
        writeFile g s

Running this program:

   $ ghc -O A.hs
   $ ./a.out A.hs Z.hs
   $ cat Z.hs
   import System.Environment
   main = do
       [f,g] <- getArgs
       s     <- readFile f
       writeFile g s

Since we're doing I/O (the type of readFile and writeFile enforce this), the code runs inside a do-block, using the monadic IO type. "Using the IO monad" just means that we wish to use an imperative, sequential order of evaluation. (As an aside, a wide range of other monads exist, for programming different program evaluation strategies, such as Prolog-style backtracking, or continuation-based evaluation. All of imperative programming is just one subset of possible evaluation strategies you can use in Haskell, via monads).


In do-notation, whenever you wish to run an action, for its side effect, and save the result to a variable, you would write:

    v <- action

For example, to run the readFile action, which has the side effect of reading a file from disk, we say:

    s <- readFile f

Finally, we can use the appendFile function to append to an existing file.


File Handles

The most generic interface to files is provided via handles. Sometimes we need to keep a file open, for multiple reads or writes. To do this we use the Handle type, an abstraction much like the underlying system's file handles.


To open up a file, and get its handle, we use:

    openFile :: FilePath -> IOMode -> IO Handle

So to open a file for reading only, in GHCi:

    Prelude System.IO> h <- openFile "A.hs" ReadMode
    {handle: A.hs}

Which returns a handle onto the file "A.hs". We can read a line from this handle:

    Prelude System.IO> hGetLine h
    "main = do"

To close a handle, and flush the buffer:

    hClose :: Handle -> IO ()

Once a handle is closed, we can no longer read from it:

    Prelude System.IO> hClose h
    Prelude System.IO> hGetLine h
    *** Exception: A.hs: hGetLine: illegal operation (handle is closed)

We can also flush explicitly with:

    hFlush :: Handle -> IO ()

Other useful operations for reading from handles:

    hGetChar     :: Handle -> IO Char
    hGetLine     :: Handle -> IO [Char]
    hGetContents :: Handle -> IO [Char]

To write to handles:

    hPutChar    :: Handle -> Char -> IO ()
    hPutStr     :: Handle -> [Char] -> IO ()
    hPutStrLn   :: Handle -> [Char] -> IO ()
    hPrint      :: Show a => Handle -> a -> IO ()

Some other useful actions:

    hSeek     :: Handle -> SeekMode -> Integer -> IO ()
    hTell     :: Handle -> IO Integer
    hFileSize :: Handle -> IO Integer
    hIsEOF    :: Handle -> IO Bool


An example: spell-checking

Here is a small example of combining the Data.Set and List data structures from yesterday's tutorial, with more I/O operations. We'll implement a little spell-checker, building the dictionary in a Set data type. First, some libraries to import:

    import System.Environment
    import Control.Monad
    import Data.Set

And the complete program:

    main = do
        [s] <- getArgs
        f   <- readFile "/usr/share/dict/words"
        g   <- readFile s
        let dict = fromList (lines f)
        mapM_ (spell dict) (words g)

    spell d w = when (w `notMember` d) (putStrLn w)

Running this program, on its own source, and it reports the following words are not found in the dictionary:

   $ ghc -O Spell.hs -o spell
   $ ./spell A.hs
   Data.Char
   =
   <-
   (map
   toUpper
   n)
   =
   <-
   getLine
   1


Writing the results out

If we wanted to write the results out to a temporary file, we can do so. Let's import a couple of other modules:

    import Data.Set
    import Data.Maybe
    import Text.Printf
    import System.IO
    import System.Environment
    import System.Posix.Temp

Refactoring the main definition to separate out the reading and writing phases into their own function, we end up with the core code:

    main = do
        (f, g) <- readFiles
        let dict = fromList (lines f)
            errs = mapMaybe (spell dict) (words g)
        write errs

    spell d w | w `notMember` d = Just w
              | otherwise       = Nothing

Where reading is now its own function:

    readFiles = do
        [s] <- getArgs
        f   <- readFile "/usr/share/dict/words"
        g   <- readFile s
        return (f,g)

And writing errors out to their own file:

    write errs = do
        (t,h) <- mkstemp "/tmp/spell.XXXXXX"
        mapM_ (hPutStrLn h) errs
        hClose h
        printf "%d spelling errors written to '%s'\n" (length errs) t

Pretty simple! Running this program:

   $ ghc --make -O Spell.hs -o myspell
   [1 of 1] Compiling Main             ( Spell.hs, Spell.o )
   Linking myspell ...
   $ ./myspell Spell.hs
   67 spelling errors written to '/tmp/spell.ia8256'


Extension: using SMP parallelism

Finally, just for some bonus fun ... and hold on to your hat 'cause I'm going to go fast ... we'll add some parallelism to the mix.


Haskell was designed from the start to support easy parallelisation, and since GHC 6.6, multithreaded code will run transparently on multicore systems using as many cores as you specify. Let's look at how we'd parallelise our little program to exploit multiple cores. We'll use an explicit threading model, via Control.Concurrent. You can also make your code implicitly parallel, using Control.Parallel.Strategies, but we'll leave that for another tutorial.


Here's the source, for you to ponder. First some imports:

    import Data.Set hiding (map)
    import Data.Maybe
    import Data.Char
    import Text.Printf
    import System.IO
    import System.Environment
    import Control.Concurrent
    import Control.Monad

The entry point, modified to break the word list into chunks, and then dispatching a chunk to each thread:

    main = do
        (f, g, n) <- readFiles
        let dict = fromList (lines f)
            work = chunk n (words g)
        run n dict work

    chunk :: Int -> [a] -> [[a]]
    chunk _ [] = []
    chunk n xs = (take n xs) : (chunk n (drop n xs))


The run function sets up a channel between the main thread and all children thread (errs), and prints spelling errors as they arrive on the channel from the children. It then forks off n children threads on each piece of the work list:

    run n dict work = do
        chan <- newChan
        errs <- getChanContents chan    -- errors returned back to main thread
        mapM_ (forkIO . thread chan dict) (zip [1..n] work)
        wait n errs 0


The main thread then just waits on all the threads to finish, printing any spelling errors they pass up:

    wait n xs i = when (i < n) $ case xs of
        Nothing : ys -> wait n ys $! i+1
        Just s  : ys -> putStrLn s >> wait n ys i

Each thread spell-checks its own piece of the work list. If it finds a spelling error, it passes the offending word back over the channel to the main thread.

    thread chan dict (me, xs) = do
        mapM_ spellit xs
        writeChan chan Nothing

     where
        spellit w = when (spell dict w) $
            writeChan chan . Just $ printf "Thread %d: %-25s" (me::Int) w

The spell function is simplified a little:

    spell d w = w `notMember` d

which we could also write as:

    spell = flip notMember

We modify the readFiles phase to take an additional numeric command line argument, specifying the number of threads to run:

    readFiles = do
        [s,n] <- getArgs
        f     <- readFile "/usr/share/dict/words"
        g     <- readFile s
        return (f,g, read n)

We compile this with the GHC SMP parallel runtime system:

   $ ghc -O --make -threaded Spell.hs -o spell

Now, we can run n worker threads (lightweight Haskell threads), mapped onto m OS threads. Since I'm using a 4-core server, we'll play around with 4 OS threads. First, running everything in a single thread:

   $ time ./spell test.txt 1 +RTS -N1
   ...
   Thread 1: week:
   Thread 1: IO!
   ./spell test.txt 1 +RTS -N1 99% cpu 2.533 total

Ok, now we change the command line flag to run it with 4 OS threads, to try to utilise all 4 cores:

   $ time ./spell 4 +RTS -N4
   ...
   Thread 2: week:
   Thread 3: IO!
   ./spell test.txt 4 +RTS -N4 111% cpu 2.335 total

Ok. Good... A little bit faster, uses a little bit more cpu. It turns out however the program is bound currently by the time spent in the main thread building the initial dictionary. Actual searching time is only some 10% of the running time. Nonetheless, it was fairly painless to break up the initial simple program into a parallel version.


If the program running time was extended (as the case for a server), the parallelism would be a win. Additionally, should we buy more cores for the server, all we need to is change the +RTS -N argument to the program, to start utilising these extra cores.


Next week

In upcoming tutorials we'll look more into implicitly parallel programs, and the use of the new high performance ByteString data type for string processing.