Difference between revisions of "UnicodeByteString"

@@ Line 1: / Line 1: @@
-== Motivation ==
-<hask>ByteString</hask> provides a faster and more memory efficient data type than <hask>[Word8]</hask> for processing raw bytes.  By creating a Unicode data type similar to <hask>ByteString</hask> that deals in units of characters instead of units of bytes we can achieve similar performance improvements over <hask>String</hask> for text processing.  A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in <hask>ByteString</hask>s.  Using functions such as <hask>length</hask> on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.
-== Specification ==
-A new module, <hask>Text.Unicode</hask>, defines the efficient Unicode string data type:
-<haskell>data UnicodeString</haskell>
-Functions to encode and decode Unicode strings to and from <hask>ByteString</hask>s are provided together with <hask>Data.List</hask> like functions.
-<haskell>
-data Encoding = Ascii | Utf8 | Utf16 | Iso88591
-decode :: Encoding -> ByteString -> UnicodeString
-encode :: Encoding -> UnicodeString -> ByteString
-</haskell>
-=== Error handling ===
-When a <hask>ByteString</hask> is decoded using the wrong codec several error handling strategies are possible:
-* An exception is raised using <hask>error</hask>.  This may be fine for many cases.
-* Unknown byte sequences are replaced with some character (e.g. <hask>'?'</hask>).  This is useful for debugging, etc. where some input/output is better than none.
-* The decode function returns values of type <hask>Either CodecError UnicodeString</hask> where <hask>CodecError</hask> contains some useful error information.
-The final API should provide at least a few error handling strategies of different sophistication.
-One example in this design space for error handling is this iconv library:
-http://haskell.org/~duncan/iconv/
-It provides a most general conversion function with type:
-<haskell>
- :: EncodingName  -- ^ Name of input string encoding
- -> EncodingName  -- ^ Name of output string encoding
- -> L.ByteString  -- ^ Input text
- -> [Span]
-data Span =
-    -- | An ordinary output span in the target encoding
-    Span !S.ByteString
-    -- | An error in the conversion process. If this occurs it will be the
-    -- last span.
-  | ConversionError !ConversionError
-</haskell>
-Then the other simpler error handling strategies are wrappers over this interface. One converts strictly and returns Either L.ByteString ConversionError, the other converts lazily and uses exceptions. There is also a fuzzy mode where conversion errors are ignored or transliterated, using similar replacement characters or <hask>'?'</hask>.
-=== I/O ===
-Several I/O functions that deal with <hask>UnicodeString</hask>s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale).  Example:
-<haskell>
-readFile :: Encoding -> FilePath -> UnicodeString
-</haskell>
-== Open Issues ==
-=== API duplication ===
-The <hask>Data.List</hask> API is already duplicated in large parts in <hask>Data.ByteString</hask>.  It will be duplicated again here.  Will keeping the APIs in sync be a huge pain in the future?
-=== New I/O library ===
-How many new I/O functions are needed?  Would it be enough to use <hask>ByteString</hask>'s I/O interface:
-<haskell>
-import qualified Data.ByteString as B
-echo = do
-  content <- decode Utf8 <$> B.readFile "myfile.txt"
-  B.putStrLn $ encode Utf8 content
-</haskell>
-=== Different representations ===
-Should the encoding used to represent Unicode code points be included in the type?
-<haskell>
-data Encoding e => UnicodeString e
-</haskell>
-This might save some recoding as opposed to always using the same internal encoding for <hask>UnicodeString</hask>.  It's necessary that UnicodeString can be used between different text processing libraries.  Is this possible or will any library end up specifying a particular value for <hask>Encoding e</hask> and thus make it harder to interact with that library?
-This approach is used by <hask>CompactString</hask>
-http://twan.home.fmf.nl/compact-string/
-== References ==
-Python 3000 will see an overhaul of their Unicode approach, including a new <code>bytes</code> type, a merge of <code>str</code> and <code>unicode</code> and a new I/O library.  This proposals takes many ideas from that overhaul.  The relevant PEPs are:
-# http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O
-# http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object
-# http://www.python.org/dev/peps/pep-3137/ - PEP 3137 -- Immutable Bytes and Mutable Buffer

Difference between revisions of "UnicodeByteString"

Revision as of 15:04, 6 February 2021

Navigation menu

Search