UnicodeByteString
Motivation
ByteString
provides a faster and more memory efficient data type than [Word8]
for processing raw bytes. By creating a Unicode data type similar to ByteString
that deals in units of characters instead of units of bytes we can achieve similar performance improvements over String
for text processing. A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in ByteString
s. Using functions such as length
on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.
Specification
A new module, Text.Unicode
, defines the efficient Unicode string data type:
data UnicodeString
Functions to encode and decode Unicode strings to and from ByteString
s are provided together with Data.List
like functions.
data Encoding = Ascii | Utf8 | Utf16 | Iso88591
decode :: Encoding -> ByteString -> UnicodeString
encode :: Encoding -> UnicodeString -> ByteString
Error handling
When a ByteString
is decoded using the wrong codec several error handling strategies are possible:
- The program exits using
error
. This may be fine for script like programs. - Unknown byte sequences are replaced with some character (e.g.
'?'
). This is useful for debugging, etc. where some input/output is better than none. - An exception is raised.
- The decode function returns values of type
Either CodecError UnicodeString
whereCodecError
contains some useful error information.
The final API should provide at least a few error handling strategies of different sophistication.
I/O
Several I/O functions that deal with UnicodeString
s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale). Example:
readFile :: Encoding -> FilePath -> UnicodeString
Open Issues
API duplication
The Data.List
API is already duplicated in large parts in Data.ByteString
. It will be duplicated again here. Will keeping the APIs in sync be a huge pain in the future?
New I/O library
How many new I/O functions are needed? Would it be enough to use ByteString
's I/O interface:
import qualified Data.ByteString as B
echo = do
content <- decode Utf8 <$> B.readFile "myfile.txt"
B.putStrLn $ encode Utf8 content
Different representations
Should the encoding used to represent Unicode code points be included in the type?
data Encoding e => UnicodeString e
This might save some recoding as opposed to always using the same internal encoding for UnicodeString
. It's necessary that UnicodeString can be used between different text processing libraries. Is this possible or will any library end up specifying a particular value for Encoding e
and thus make it harder to interact with that library?
References
Python 3000 will see an overhaul of their Unicode approach, including a new bytes
type, a merge of str
and unicode
and a new I/O library. This proposals takes many ideas from that overhaul. The relevant PEPs are:
- http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O
- http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object