Encoding issues

Introduction

Character encoding is a tricky issue. Different systems have different conventions, and they may not be correct for a particular use. There is a distinction between characters (Char) and bytes (Word8). Conversion between the two can be done in many ways. This page gives an overview of how these issues should be handled.

It is important to note that the issue of encoding is completely orthogonal to the use of ByteString/PackageString/WhateverString. Such a type is either a list of bytes, equivalent to [Word8] or it is a string, a list of character, [Char].

Any type should by either simlair to a [Char] or to [Word8], and behave accordingly.

Encoding/Decoding/Conversion

There are three kinds of operations...

IO

I/O operations should be split into two parts, binary io and string io.

Binary IO

Binary IO operates on [Word8].

put :: Handle -> [Word8] -> IO ()
get :: Handle -> IO [Word8]

Character IO

Character IO is layered on top of the binary IO functions, using the default encoding and decoding appropriate for the platform. An encoding error will result in an exception.

putStr :: Handle -> [Char] -> IO ()
getStr :: Handle -> IO [Char]

The encoding used depends on the platform. For unix it will be the encoding from the current locale (usually UTF-8). On windows it will be based on a Byte order mark for file IO, while the output encoding can be any unciode encoding, again with byte order mark. UTF-8 is probably the safest bet. Handles other than files may have different requirements.

Advanced character IO

In situations where the default encoding is not correct, or where a different form of error handling is required, encoding/decoding must be done manually.