Difference between revisions of "UnicodeByteString"
Jump to navigation
Jump to search
JohanTibell (talk | contribs) |
Tomjaguarpaw (talk | contribs) (Deleting page that hasn't been updated for over 10 years) |
||
Line 1: | Line 1: | ||
− | == Motivation == |
||
− | |||
− | <hask>ByteString</hask> provides a faster and more memory efficient data type than <hask>[Word8]</hask> for processing raw bytes. By creating a Unicode data type similar to <hask>ByteString</hask> that deals in units of characters instead of units of bytes we can achieve similar performance improvements over <hask>String</hask> for text processing. A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in <hask>ByteString</hask>s. Using functions such as <hask>length</hask> on a Unicode string just works even though different encodings use different numbers of bytes to represent a character. |
||
− | |||
− | == Specification == |
||
− | |||
− | A new module, <hask>Text.Unicode</hask>, defines the efficient Unicode string data type: |
||
− | |||
− | <haskell>data UnicodeString</haskell> |
||
− | |||
− | Functions to encode and decode Unicode strings to and from <hask>ByteString</hask>s are provided together with <hask>Data.List</hask> like functions. |
||
− | |||
− | <haskell> |
||
− | data Encoding = Ascii | Utf8 | Utf16 | Iso88591 |
||
− | |||
− | decode :: Encoding -> ByteString -> UnicodeString |
||
− | encode :: Encoding -> UnicodeString -> ByteString |
||
− | </haskell> |
||
− | |||
− | === Error handling === |
||
− | |||
− | When a <hask>ByteString</hask> is decoded using the wrong codec several error handling strategies are possible: |
||
− | |||
− | * An exception is raised using <hask>error</hask>. This may be fine for many cases. |
||
− | * Unknown byte sequences are replaced with some character (e.g. <hask>'?'</hask>). This is useful for debugging, etc. where some input/output is better than none. |
||
− | * The decode function returns values of type <hask>Either CodecError UnicodeString</hask> where <hask>CodecError</hask> contains some useful error information. |
||
− | |||
− | The final API should provide at least a few error handling strategies of different sophistication. |
||
− | |||
− | One example in this design space for error handling is this iconv library: |
||
− | http://haskell.org/~duncan/iconv/ |
||
− | It provides a most general conversion function with type: |
||
− | <haskell> |
||
− | :: EncodingName -- ^ Name of input string encoding |
||
− | -> EncodingName -- ^ Name of output string encoding |
||
− | -> L.ByteString -- ^ Input text |
||
− | -> [Span] |
||
− | |||
− | data Span = |
||
− | -- | An ordinary output span in the target encoding |
||
− | Span !S.ByteString |
||
− | -- | An error in the conversion process. If this occurs it will be the |
||
− | -- last span. |
||
− | | ConversionError !ConversionError |
||
− | </haskell> |
||
− | |||
− | Then the other simpler error handling strategies are wrappers over this interface. One converts strictly and returns Either L.ByteString ConversionError, the other converts lazily and uses exceptions. There is also a fuzzy mode where conversion errors are ignored or transliterated, using similar replacement characters or <hask>'?'</hask>. |
||
− | |||
− | === I/O === |
||
− | |||
− | Several I/O functions that deal with <hask>UnicodeString</hask>s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale). Example: |
||
− | |||
− | <haskell> |
||
− | readFile :: Encoding -> FilePath -> UnicodeString |
||
− | </haskell> |
||
− | |||
− | == Open Issues == |
||
− | |||
− | === API duplication === |
||
− | |||
− | The <hask>Data.List</hask> API is already duplicated in large parts in <hask>Data.ByteString</hask>. It will be duplicated again here. Will keeping the APIs in sync be a huge pain in the future? |
||
− | |||
− | === New I/O library === |
||
− | |||
− | How many new I/O functions are needed? Would it be enough to use <hask>ByteString</hask>'s I/O interface: |
||
− | |||
− | <haskell> |
||
− | import qualified Data.ByteString as B |
||
− | |||
− | echo = do |
||
− | content <- decode Utf8 <$> B.readFile "myfile.txt" |
||
− | B.putStrLn $ encode Utf8 content |
||
− | </haskell> |
||
− | |||
− | === Different representations === |
||
− | |||
− | Should the encoding used to represent Unicode code points be included in the type? |
||
− | |||
− | <haskell> |
||
− | data Encoding e => UnicodeString e |
||
− | </haskell> |
||
− | |||
− | This might save some recoding as opposed to always using the same internal encoding for <hask>UnicodeString</hask>. It's necessary that UnicodeString can be used between different text processing libraries. Is this possible or will any library end up specifying a particular value for <hask>Encoding e</hask> and thus make it harder to interact with that library? |
||
− | |||
− | This approach is used by <hask>CompactString</hask> |
||
− | |||
− | http://twan.home.fmf.nl/compact-string/ |
||
− | |||
− | == References == |
||
− | |||
− | Python 3000 will see an overhaul of their Unicode approach, including a new <code>bytes</code> type, a merge of <code>str</code> and <code>unicode</code> and a new I/O library. This proposals takes many ideas from that overhaul. The relevant PEPs are: |
||
− | |||
− | # http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O |
||
− | # http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object |
||
− | # http://www.python.org/dev/peps/pep-3137/ - PEP 3137 -- Immutable Bytes and Mutable Buffer |