Difference between revisions of "Encoding issues"
Jump to navigation
Jump to search
(How IO should work) |
Tomjaguarpaw (talk | contribs) (Deleting page that hasn't been edited for over 10 years) |
||
Line 1: | Line 1: | ||
− | ==Introduction== |
||
− | |||
− | Character encoding is a tricky issue. Different systems have different conventions, and they may not be correct for a particular use. There is a distinction between characters (<hask>Char</hask>) and bytes (<hask>Word8</hask>). Conversion between the two can be done in many ways. This page gives an overview of how these issues should be handled. |
||
− | |||
− | It is important to note that the issue of encoding is ''completely'' orthogonal to the use of ByteString/PackageString/WhateverString. Such a type is either a list of bytes, equivalent to <hask>[Word8]</hask> or it is a string, a list of character, <hask>[Char]</hask>. |
||
− | |||
− | Any type should by either simlair to a <hask>[Char]</hask> or to <char>[Word8]</hask>, and behave accordingly. |
||
− | |||
− | ==Encoding/Decoding/Conversion== |
||
− | |||
− | There are three kinds of operations... |
||
− | |||
− | ==IO== |
||
− | |||
− | I/O operations should be split into two parts, binary io and string io. |
||
− | |||
− | ===Binary IO=== |
||
− | Binary IO operates on <hask>[Word8]</hask>. |
||
− | <haskell> |
||
− | put :: Handle -> [Word8] -> IO () |
||
− | get :: Handle -> IO [Word8] |
||
− | </haskell> |
||
− | |||
− | ===Character IO=== |
||
− | Character IO is layered on top of the binary IO functions, using the default encoding and decoding appropriate for the platform. An encoding error will result in an exception. |
||
− | <haskell> |
||
− | putStr :: Handle -> [Char] -> IO () |
||
− | getStr :: Handle -> IO [Char] |
||
− | </haskell> |
||
− | |||
− | The encoding used depends on the platform. For unix it will be the encoding from the current locale (usually UTF-8). On windows it will be based on a Byte order mark for file IO, while the output encoding can be any unciode encoding, again with byte order mark. UTF-8 is probably the safest bet. Handles other than files may have different requirements. |
||
− | |||
− | ===Advanced character IO=== |
||
− | In situations where the default encoding is not correct, or where a different form of error handling is required, encoding/decoding must be done manually. |