UnicodeByteString
Contents
Motivation
ByteString
provides a faster and more memory efficient data type than [Word8]
for processing raw bytes. By creating a Unicode data type similar to ByteString
that deals in units of characters instead of units of bytes we can achieve similar performance improvements over String
for text processing. A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in ByteString
s. Using functions such as length
on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.
Specification
A new module, Text.Unicode
, defines the efficient Unicode string data type:
data UnicodeString
Functions to encode and decode Unicode strings to and from ByteString
s are provided together with Data.List
like functions.
data Encoding = Ascii | Utf8 | Utf16 | Iso88591
decode :: Encoding -> ByteString -> UnicodeString
encode :: Encoding -> UnicodeString -> ByteString
Error handling
When a ByteString
is decoded using the wrong codec several error handling strategies are possible:
- An exception is raised using
error
. This may be fine for many cases. - Unknown byte sequences are replaced with some character (e.g.
'?'
). This is useful for debugging, etc. where some input/output is better than none. - The decode function returns values of type
Either CodecError UnicodeString
whereCodecError
contains some useful error information.
The final API should provide at least a few error handling strategies of different sophistication.
One example in this design space for error handling is this iconv library: http://haskell.org/~duncan/iconv/ It provides a most general conversion function with type:
:: EncodingName -- ^ Name of input string encoding
-> EncodingName -- ^ Name of output string encoding
-> L.ByteString -- ^ Input text
-> [Span]
data Span =
-- | An ordinary output span in the target encoding
Span !S.ByteString
-- | An error in the conversion process. If this occurs it will be the
-- last span.
| ConversionError !ConversionError
Then the other simpler error handling strategies are wrappers over this interface. One converts strictly and returns Either L.ByteString ConversionError, the other converts lazily and uses exceptions. There is also a fuzzy mode where conversion errors are ignored or transliterated, using similar replacement characters or '?'
.
I/O
Several I/O functions that deal with UnicodeString
s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale). Example:
readFile :: Encoding -> FilePath -> UnicodeString
Open Issues
API duplication
The Data.List
API is already duplicated in large parts in Data.ByteString
. It will be duplicated again here. Will keeping the APIs in sync be a huge pain in the future?
New I/O library
How many new I/O functions are needed? Would it be enough to use ByteString
's I/O interface:
import qualified Data.ByteString as B
echo = do
content <- decode Utf8 <$> B.readFile "myfile.txt"
B.putStrLn $ encode Utf8 content
Different representations
Should the encoding used to represent Unicode code points be included in the type?
data Encoding e => UnicodeString e
This might save some recoding as opposed to always using the same internal encoding for UnicodeString
. It's necessary that UnicodeString can be used between different text processing libraries. Is this possible or will any library end up specifying a particular value for Encoding e
and thus make it harder to interact with that library?
This approach is used by CompactString
http://twan.home.fmf.nl/compact-string/
References
Python 3000 will see an overhaul of their Unicode approach, including a new bytes
type, a merge of str
and unicode
and a new I/O library. This proposals takes many ideas from that overhaul. The relevant PEPs are:
- http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O
- http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object
- http://www.python.org/dev/peps/pep-3137/ - PEP 3137 -- Immutable Bytes and Mutable Buffer