Difference between revisions of "UTF-8"

Revision as of 07:49, 3 February 2007

A small example showing how to read and write UTF-8 in Haskell.

Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway)

> module Main where

> import Control.Monad (mapM_)
> import Data.Word (Word8)
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
> import System.Environment (getArgs)
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile,
>                   IOMode(ReadMode, WriteMode))

We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ). Note that Pugs also has a UTF-8 library of its own, which if I believe to handle ByteStrings, but I'm sticking with this one because it's what I know.

> import UTF8

We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its lines, and writing it back out with the extension '.reversed'

> main :: IO ()
> main =
>  do args <- getArgs
>     mapM_ reverseUTF8File args

> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
>   do fb <- readFileBytes f
>      case decode fb of
>        (cs, []) -> writeFileBytes (f ++ ".reverse") $ encode $ reverseLines cs
>        (_,  xs) -> fail $ show xs
>   where
>     reverseLines = unlines . map reverse . lines

For this to work, we need to have some helper functions for reading and writing [Word8]. It would be nice is if there were some standard functions for reading and writing [Word8] in files.

> readFileBytes :: FilePath -> IO [Word8]
> readFileBytes f =
>   do h <- openBinaryFile f ReadMode
>      hsize <- fromIntegral `fmap` hFileSize h
>      hGetBytes h hsize
>
> writeFileBytes :: FilePath -> [Word8] -> IO ()
> writeFileBytes f ws =
>  do h <- openBinaryFile f WriteMode
>     hPutBytes h (length ws) ws

> hGetBytes :: Handle -> Int -> IO [Word8]
> hGetBytes h c = allocaArray c $ \p ->
>                   do c' <- hGetBuf h p c
>                      peekArray c' p
>
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
> hPutBytes h c ws = allocaArray c $ \p ->
>                      do pokeArray p ws
>                         hPutBuf h p c

@@ Line 15: / Line 15: @@
 </haskell>
-We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg.  It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ).  Note that Pugs also has a UTF-8 library of its own, which if I believe to handle ByteStrings.
+We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg.  It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ).  Note that Pugs also has a UTF-8 library of its own, which if I believe to handle ByteStrings, but I'm sticking with this one because it's what I know.
 <haskell>

Difference between revisions of "UTF-8"

Revision as of 07:49, 3 February 2007

Navigation menu

Search