Difference between revisions of "UTF-8"
(+cat) |
(rm irritating self-consciousness (I still don't know, but stop saying it all the time)) |
||
Line 1: | Line 1: | ||
[[Category:Code]] |
[[Category:Code]] |
||
− | A small example showing how to |
+ | A small example showing how to read and write UTF-8 in Haskell. |
Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway) |
Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway) |
||
− | |||
− | Note that I don't really know what the best practices are wrt to reading |
||
− | and writing UTF-8, but here's what works for me. |
||
<haskell> |
<haskell> |
||
Line 18: | Line 15: | ||
</haskell> |
</haskell> |
||
− | We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. |
+ | We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ). Note that Pugs also has a UTF-8 library of its own, which if I understand, handles ByteStrings. |
− | I don't know if this is the best one, but it's what darcs uses. |
||
− | See http://abridgegame.org/repos/darcs/UTF8.lhs |
||
<haskell> |
<haskell> |
||
Line 46: | Line 41: | ||
For this to work, we need to have some helper functions for reading and |
For this to work, we need to have some helper functions for reading and |
||
− | writing [Word8]. |
+ | writing [Word8]. It would be nice is if there were some standard functions for reading and writing [Word8] in files. |
<haskell> |
<haskell> |
Revision as of 07:45, 3 February 2007
A small example showing how to read and write UTF-8 in Haskell.
Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway)
> module Main where
> import Control.Monad (mapM_)
> import Data.Word (Word8)
> import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray)
> import System.Environment (getArgs)
> import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile,
> IOMode(ReadMode, WriteMode))
We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. It happens to be the one that darcs uses ( http://abridgegame.org/repos/darcs/UTF8.lhs ). Note that Pugs also has a UTF-8 library of its own, which if I understand, handles ByteStrings.
> import UTF8
We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its lines, and writing it back out with the extension '.reversed'
> main :: IO ()
> main =
> do args <- getArgs
> mapM_ reverseUTF8File args
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
> do fb <- readFileBytes f
> case decode fb of
> (cs, []) -> writeFileBytes (f ++ ".reverse") $ encode $ reverseLines cs
> (_, xs) -> fail $ show xs
> where
> reverseLines = unlines . map reverse . lines
For this to work, we need to have some helper functions for reading and writing [Word8]. It would be nice is if there were some standard functions for reading and writing [Word8] in files.
> readFileBytes :: FilePath -> IO [Word8]
> readFileBytes f =
> do h <- openBinaryFile f ReadMode
> hsize <- fromIntegral `fmap` hFileSize h
> hGetBytes h hsize
>
> writeFileBytes :: FilePath -> [Word8] -> IO ()
> writeFileBytes f ws =
> do h <- openBinaryFile f WriteMode
> hPutBytes h (length ws) ws
> hGetBytes :: Handle -> Int -> IO [Word8]
> hGetBytes h c = allocaArray c $ \p ->
> do c' <- hGetBuf h p c
> peekArray c' p
>
> hPutBytes :: Handle -> Int -> [Word8] -> IO ()
> hPutBytes h c ws = allocaArray c $ \p ->
> do pokeArray p ws
> hPutBuf h p c