Difference between revisions of "UTF-8"
Jump to navigation
Jump to search
(More harm than good?) |
m (Reverted edits by Tomjaguarpaw (talk) to last revision by PhilipNeustrom) |
||
(14 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
+ | [[Category:Code]] |
||
− | A small example showing how to work with UTF-8 in Haskell. |
||
+ | The simplest solution seems to be to use the [http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string utf8-string package] from Galois. It |
||
− | Do whatever you want; it's going in the public domain (Eric Kow on 2007-02-02 says so, anyway) |
||
+ | provides a drop-in replacement for System.IO |
||
+ | ''What about other string encodings?'' |
||
− | Note that I don't really know what the best practices are wrt to reading |
||
+ | |||
− | and writing UTF-8, but here's what works for me. |
||
+ | == Example == |
||
+ | If we use a function from System.IO.UTF8, we should also hide the equivalent one from the Prelude. (Alternatively, we could import the UTF8 module qualified) |
||
<haskell> |
<haskell> |
||
⚫ | |||
− | > module Main where |
||
+ | > import Prelude hiding (readFile, writeFile) |
||
− | |||
− | > import Control.Monad (mapM_) |
||
− | > import Data.Word (Word8) |
||
− | > import Foreign.Marshal.Array (allocaArray, peekArray, pokeArray) |
||
> import System.Environment (getArgs) |
> import System.Environment (getArgs) |
||
− | > import System.IO (hFileSize, Handle, hGetBuf, hPutBuf, openBinaryFile, |
||
− | > IOMode(ReadMode, WriteMode)) |
||
</haskell> |
</haskell> |
||
+ | The readFile and writeFile functions are the same as before... |
||
− | We're going to be using the 2002 UTF-8 implementation by Sven Moritz Hallberg. |
||
− | I don't know if this is the best one, but it's what darcs uses. |
||
− | See http://abridgegame.org/repos/darcs/UTF8.lhs |
||
− | |||
− | <haskell> |
||
⚫ | |||
− | </haskell> |
||
− | |||
− | We perform the demonstration on a list of files, specified as command line arguments. What we want to show is that we can both read and write UTF-8, so the demonstration will be of reading a file in, reverse every one of its |
||
− | lines, and writing it back out with the extension '.reversed' |
||
<haskell> |
<haskell> |
||
Line 36: | Line 25: | ||
> reverseUTF8File :: FilePath -> IO () |
> reverseUTF8File :: FilePath -> IO () |
||
> reverseUTF8File f = |
> reverseUTF8File f = |
||
− | > do |
+ | > do c <- readFile f |
⚫ | |||
− | > case decode fb of |
||
⚫ | |||
− | > (_, xs) -> fail $ show xs |
||
> where |
> where |
||
> reverseLines = unlines . map reverse . lines |
> reverseLines = unlines . map reverse . lines |
||
− | </haskell> |
||
− | |||
− | For this to work, we need to have some helper functions for reading and |
||
− | writing [Word8]. I don't know if this is the right way to go about it. |
||
− | |||
− | <haskell> |
||
− | > readFileBytes :: FilePath -> IO [Word8] |
||
− | > readFileBytes f = |
||
− | > do h <- openBinaryFile f ReadMode |
||
− | > hsize <- fromIntegral `fmap` hFileSize h |
||
− | > hGetBytes h hsize |
||
− | > |
||
− | > writeFileBytes :: FilePath -> [Word8] -> IO () |
||
− | > writeFileBytes f ws = |
||
− | > do h <- openBinaryFile f WriteMode |
||
− | > hPutBytes h (length ws) ws |
||
− | |||
− | > hGetBytes :: Handle -> Int -> IO [Word8] |
||
− | > hGetBytes h c = allocaArray c $ \p -> |
||
− | > do c' <- hGetBuf h p c |
||
− | > peekArray c' p |
||
− | > |
||
− | > hPutBytes :: Handle -> Int -> [Word8] -> IO () |
||
− | > hPutBytes h c ws = allocaArray c $ \p -> |
||
− | > do pokeArray p ws |
||
− | > hPutBuf h p c |
||
</haskell> |
</haskell> |
Latest revision as of 15:20, 6 February 2021
The simplest solution seems to be to use the utf8-string package from Galois. It
provides a drop-in replacement for System.IO
What about other string encodings?
Example
If we use a function from System.IO.UTF8, we should also hide the equivalent one from the Prelude. (Alternatively, we could import the UTF8 module qualified)
> import System.IO.UTF8
> import Prelude hiding (readFile, writeFile)
> import System.Environment (getArgs)
The readFile and writeFile functions are the same as before...
> main :: IO ()
> main =
> do args <- getArgs
> mapM_ reverseUTF8File args
> reverseUTF8File :: FilePath -> IO ()
> reverseUTF8File f =
> do c <- readFile f
> writeFile (f ++ ".rev") $ reverseLines c
> where
> reverseLines = unlines . map reverse . lines