Difference between revisions of "UnicodeByteString"
JohanTibell (talk | contribs) (Added motiviation section.) |
JohanTibell (talk | contribs) (Added some discussion on open issues.) |
||
Line 1: | Line 1: | ||
− | |||
− | |||
== Motivation == | == Motivation == | ||
− | <hask>ByteString</hask> provides a faster and more memory efficient data type than <hask>[Word8]</hask> for processing raw bytes. By creating a Unicode | + | <hask>ByteString</hask> provides a faster and more memory efficient data type than <hask>[Word8]</hask> for processing raw bytes. By creating a Unicode data type similar to <hask>ByteString</hask> that deals in units of characters instead of units of bytes we can achieve similar performance improvements over <hask>String</hask> for text processing. A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in <hask>ByteString</hask>s. Using functions such as <hask>length</hask> on a Unicode string just works even though different encodings use different numbers of bytes to represent a character. |
== Specification == | == Specification == | ||
+ | |||
+ | A new module, <hask>Text.Unicode</hask>, defines the efficient Unicode string data type: | ||
+ | |||
+ | <haskell>data UnicodeString</haskell> | ||
+ | |||
+ | Functions to encode and decode Unicode strings to and from <hask>ByteString</hask>s are provided together with <hask>Data.List</hask> like functions. | ||
+ | |||
+ | <haskell> | ||
+ | data Encoding = Ascii | Utf8 | Utf16 | Iso88591 | ||
+ | |||
+ | decode :: Encoding -> ByteString -> UnicodeString | ||
+ | encode :: Encoding -> UnicodeString -> ByteString | ||
+ | </haskell> | ||
+ | |||
+ | === Error handling === | ||
+ | |||
+ | When a <hask>ByteString</hask> is decoded using the wrong codec several error handling strategies are possible: | ||
+ | |||
+ | * The program exits using <hask>error</hask>. This may be fine for script like programs. | ||
+ | * Unknown byte sequences are replaced with some character (e.g. <hask>'?'</hask>). This is useful for debugging, etc. where some input/output is better than none. | ||
+ | * An exception is raised. | ||
+ | * The decode function returns values of type <hask>Either CodecError UnicodeString</hask> where <hask>CodecError</hask> contains some useful error information. | ||
+ | |||
+ | The final API should provide at least a few error handling strategies of different sophistication. | ||
+ | |||
+ | === I/O === | ||
+ | |||
+ | Several I/O functions that deal with <hask>UnicodeString</hask>s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale). Example: | ||
+ | |||
+ | <haskell> | ||
+ | readFile :: Encoding -> FilePath -> UnicodeString | ||
+ | </haskell> | ||
== Open Issues == | == Open Issues == | ||
+ | |||
+ | === API duplication === | ||
+ | |||
+ | The <hask>Data.List</hask> API is already duplicated in large parts in <hask>Data.ByteString</hask>. It will be duplicated again here. Will keeping the APIs in sync be a huge pain in the future? | ||
+ | |||
+ | === New I/O library === | ||
+ | |||
+ | How many new I/O functions are needed? Would it be enough to use <hask>ByteString</hask>'s I/O interface: | ||
+ | |||
+ | <haskell> | ||
+ | import qualified Data.ByteString as B | ||
+ | |||
+ | echo = do | ||
+ | content <- decode Utf8 <$> B.readFile "myfile.txt" | ||
+ | B.putStrLn $ encode Utf8 content | ||
+ | </haskell> | ||
+ | |||
+ | === Different representations === | ||
+ | |||
+ | Should the encoding used to represent Unicode code points be included in the type? | ||
+ | |||
+ | <haskell> | ||
+ | data Encoding e => UnicodeString e | ||
+ | </haskell> | ||
+ | |||
+ | This might save some recoding as opposed to always using the same internal encoding for <hask>UnicodeString</hask>. It's necessary that UnicodeString can be used between different text processing libraries. Is this possible or will any library end up specifying a particular value for <hask>Encoding e</hask> and thus make it harder to interact with that library? | ||
== References == | == References == | ||
+ | |||
+ | Python 3000 will see an overhaul of their Unicode approach, including a new <code>bytes</code> type, a merge of <code>str</code> and <code>unicode</code> and a new I/O library. This proposals takes many ideas from that overhaul. The relevant PEPs are: | ||
# http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O | # http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O | ||
# http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object | # http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object |
Revision as of 22:41, 24 September 2007
Contents
Motivation
ByteString
provides a faster and more memory efficient data type than [Word8]
for processing raw bytes. By creating a Unicode data type similar to ByteString
that deals in units of characters instead of units of bytes we can achieve similar performance improvements over String
for text processing. A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in ByteString
s. Using functions such as length
on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.
Specification
A new module, Text.Unicode
, defines the efficient Unicode string data type:
data UnicodeString
Functions to encode and decode Unicode strings to and from ByteString
s are provided together with Data.List
like functions.
data Encoding = Ascii | Utf8 | Utf16 | Iso88591
decode :: Encoding -> ByteString -> UnicodeString
encode :: Encoding -> UnicodeString -> ByteString
Error handling
When a ByteString
is decoded using the wrong codec several error handling strategies are possible:
- The program exits using
error
. This may be fine for script like programs. - Unknown byte sequences are replaced with some character (e.g.
'?'
). This is useful for debugging, etc. where some input/output is better than none. - An exception is raised.
- The decode function returns values of type
Either CodecError UnicodeString
whereCodecError
contains some useful error information.
The final API should provide at least a few error handling strategies of different sophistication.
I/O
Several I/O functions that deal with UnicodeString
s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale). Example:
readFile :: Encoding -> FilePath -> UnicodeString
Open Issues
API duplication
The Data.List
API is already duplicated in large parts in Data.ByteString
. It will be duplicated again here. Will keeping the APIs in sync be a huge pain in the future?
New I/O library
How many new I/O functions are needed? Would it be enough to use ByteString
's I/O interface:
import qualified Data.ByteString as B
echo = do
content <- decode Utf8 <$> B.readFile "myfile.txt"
B.putStrLn $ encode Utf8 content
Different representations
Should the encoding used to represent Unicode code points be included in the type?
data Encoding e => UnicodeString e
This might save some recoding as opposed to always using the same internal encoding for UnicodeString
. It's necessary that UnicodeString can be used between different text processing libraries. Is this possible or will any library end up specifying a particular value for Encoding e
and thus make it harder to interact with that library?
References
Python 3000 will see an overhaul of their Unicode approach, including a new bytes
type, a merge of str
and unicode
and a new I/O library. This proposals takes many ideas from that overhaul. The relevant PEPs are:
- http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O
- http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object