Difference between revisions of "UnicodeByteString"

Revision as of 22:41, 24 September 2007

Motivation

ByteString provides a faster and more memory efficient data type than [Word8] for processing raw bytes. By creating a Unicode data type similar to ByteString that deals in units of characters instead of units of bytes we can achieve similar performance improvements over String for text processing. A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in ByteStrings. Using functions such as length on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.

Specification

A new module, Text.Unicode, defines the efficient Unicode string data type:

data UnicodeString

Functions to encode and decode Unicode strings to and from ByteStrings are provided together with Data.List like functions.

data Encoding = Ascii | Utf8 | Utf16 | Iso88591

decode :: Encoding -> ByteString -> UnicodeString
encode :: Encoding -> UnicodeString -> ByteString

Error handling

When a ByteString is decoded using the wrong codec several error handling strategies are possible:

The program exits using error. This may be fine for script like programs.
Unknown byte sequences are replaced with some character (e.g. '?'). This is useful for debugging, etc. where some input/output is better than none.
An exception is raised.
The decode function returns values of type Either CodecError UnicodeString where CodecError contains some useful error information.

The final API should provide at least a few error handling strategies of different sophistication.

I/O

Several I/O functions that deal with UnicodeStrings might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale). Example:

readFile :: Encoding -> FilePath -> UnicodeString

Open Issues

API duplication

The Data.List API is already duplicated in large parts in Data.ByteString. It will be duplicated again here. Will keeping the APIs in sync be a huge pain in the future?

New I/O library

How many new I/O functions are needed? Would it be enough to use ByteString's I/O interface:

import qualified Data.ByteString as B

echo = do
  content <- decode Utf8 <$> B.readFile "myfile.txt"
  B.putStrLn $ encode Utf8 content

Different representations

Should the encoding used to represent Unicode code points be included in the type?

data Encoding e => UnicodeString e

This might save some recoding as opposed to always using the same internal encoding for UnicodeString. It's necessary that UnicodeString can be used between different text processing libraries. Is this possible or will any library end up specifying a particular value for Encoding e and thus make it harder to interact with that library?

References

Python 3000 will see an overhaul of their Unicode approach, including a new bytes type, a merge of str and unicode and a new I/O library. This proposals takes many ideas from that overhaul. The relevant PEPs are:

http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O
http://python.org/dev/peps/pep-3116/ - PEP 358 -- The "bytes" Object

@@ Line 1: / Line 1: @@
-'''This draft proposal for a new Unicode layer on top of <hask>ByteString</hask> is still being written.'''
 == Motivation ==
-<hask>ByteString</hask> provides a faster and more memory efficient data type than <hask>[Word8]</hask> for processing raw bytes.  By creating a Unicode layer on top of <hask>ByteString</hask> that deals in units of characters instead of units of bytes we can achieve similar performance improvements over <hask>String</hask> for text processing.  A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in <hask>ByteString</hask>s.  Using functions such as <hask>length</hask> on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.
+<hask>ByteString</hask> provides a faster and more memory efficient data type than <hask>[Word8]</hask> for processing raw bytes.  By creating a Unicode data type similar to <hask>ByteString</hask> that deals in units of characters instead of units of bytes we can achieve similar performance improvements over <hask>String</hask> for text processing.  A Unicode data type also removes the error prone process of keeping track of strings encoded as raw bytes stored in <hask>ByteString</hask>s.  Using functions such as <hask>length</hask> on a Unicode string just works even though different encodings use different numbers of bytes to represent a character.
 == Specification ==
+A new module, <hask>Text.Unicode</hask>, defines the efficient Unicode string data type:
+<haskell>data UnicodeString</haskell>
+Functions to encode and decode Unicode strings to and from <hask>ByteString</hask>s are provided together with <hask>Data.List</hask> like functions.
+<haskell>
+data Encoding = Ascii | Utf8 | Utf16 | Iso88591
+decode :: Encoding -> ByteString -> UnicodeString
+encode :: Encoding -> UnicodeString -> ByteString
+</haskell>
+=== Error handling ===
+When a <hask>ByteString</hask> is decoded using the wrong codec several error handling strategies are possible:
+* The program exits using <hask>error</hask>.  This may be fine for script like programs.
+* Unknown byte sequences are replaced with some character (e.g. <hask>'?'</hask>).  This is useful for debugging, etc. where some input/output is better than none.
+* An exception is raised.
+* The decode function returns values of type <hask>Either CodecError UnicodeString</hask> where <hask>CodecError</hask> contains some useful error information.
+The final API should provide at least a few error handling strategies of different sophistication.
+=== I/O ===
+Several I/O functions that deal with <hask>UnicodeString</hask>s might be needed. All text based I/O should require an explicit encoding or use the default encoding (as set by the user's locale).  Example:
+<haskell>
+readFile :: Encoding -> FilePath -> UnicodeString
+</haskell>
 == Open Issues ==
+=== API duplication ===
+The <hask>Data.List</hask> API is already duplicated in large parts in <hask>Data.ByteString</hask>.  It will be duplicated again here.  Will keeping the APIs in sync be a huge pain in the future?
+=== New I/O library ===
+How many new I/O functions are needed?  Would it be enough to use <hask>ByteString</hask>'s I/O interface:
+<haskell>
+import qualified Data.ByteString as B
+echo = do
+  content <- decode Utf8 <$> B.readFile "myfile.txt"
+  B.putStrLn $ encode Utf8 content
+</haskell>
+=== Different representations ===
+Should the encoding used to represent Unicode code points be included in the type?
+<haskell>
+data Encoding e => UnicodeString e
+</haskell>
+This might save some recoding as opposed to always using the same internal encoding for <hask>UnicodeString</hask>.  It's necessary that UnicodeString can be used between different text processing libraries.  Is this possible or will any library end up specifying a particular value for <hask>Encoding e</hask> and thus make it harder to interact with that library?
 == References ==
+Python 3000 will see an overhaul of their Unicode approach, including a new <code>bytes</code> type, a merge of <code>str</code> and <code>unicode</code> and a new I/O library.  This proposals takes many ideas from that overhaul.  The relevant PEPs are:
 # http://www.python.org/dev/peps/pep-0358/ - PEP 3116 -- New I/O