JohanTibell/Tar

From HaskellWiki
Jump to navigation Jump to search

Abstract

The "tar" package library is for working with ".tar" archive files. It can read and write a range of common variations of archive format including V7, USTAR, POSIX and GNU formats. It provides support for packing and unpacking portable archives. This makes it suitable for distribution but not backup because details like file ownership and exact permissions are not preserved.

Specification

The full API documentation is here: http://hackage.haskell.org/package/tar

The API is structured so that simple uses only need to

import qualified Codec.Archive.Tar as Tar

Use cases that need more intimate access to the details of the tar format (such as file times, permissions etc) may also use

import qualified Codec.Archive.Tar.Entry as Tar
import qualified Codec.Archive.Tar.Check as Tar

This protects the casual user against the complexity of the details of the tar format and the various versions of the tar format. Note that the API uses short names is designed to be used qualified.

Conceptually, ".tar" format files are just a sequence of entries. Entries represent things like files, directories and symlinks. Each entry has a name, some have content data. All entries have file meta-data like ownership and permissions.

There are four key operations. High level convenience functions and user-defined variations are defined in terms of these.

Firstly there are functions for converting between internal and external representations. The external representation is a lazy ByteString. The internal representation is as a sequence of 'Entry' values:

read  :: ByteString -> Entries
write :: [Entry] -> ByteString

The 'Entries' type is almost just [Entry] but it also handles the case of format errors.

The other key pair of operations are for packing and unpacking actual disk files, to and from this internal representation:

pack   :: FilePath -> [FilePath] -> IO [Entry]
unpack :: FilePath -> Entries    -> IO ()

There are various functions provided, or that the user may define, that operate on 'Entries'. This is the main way that the API provides flexibility. In particular one may check for certain security or portability conditions as passes with type Entries -> Entries

For convenience there are also high level "all in one" operations:

create  :: FilePath -> FilePath -> [FilePath] -> IO ()
extract :: FilePath -> FilePath -> IO ()

It is instructive to see how these are defined since they demonstrate the use of the above primitives:

create tar base paths = BS.writeFile tar . Tar.write =<< Tar.pack base
paths

extract dir tar = Tar.unpack dir . Tar.read =<< BS.readFile tar

The following are examples of variations on the above that the user may define:

createTarGz tar base paths =
  BS.writeFile tar . GZip.compress . Tar.write =<< Tar.pack base paths

extractTarGz dir tar =
  Tar.unpack dir . Tar.read . GZip.decompress =<< BS.readFile tar

Note: these two are not provided by the library because the tar package does not depend on the zlib package. One could argue that it should but it would only save the above trivial compositions and the same argument would apply to a dependency on the bzlib package or other popular compression codecs.

A further example of use is the 'htar' package which is an implementation of a subset of the features of the common 'tar' command line tool. It is a short demo program at only 200 lines (including command line handling) and covers creating, extracting, (de)compression in .tar.gz and .tar.bz2 formats and listing file contents (simple or extended).


Motivation

Manipulating tar files is a fairly common need. The tar format and its variants are not trivial so using an external library or program is sensible. Many existing uses call an external "tar" program. This is not satisfactory because the tar program differs between platforms and Windows does not come with one. In particular the particular format the system "tar" program uses varies somewhat. A better solution is to use a library where we have control over the format and we use the same code on all platforms.

A further advantage of using a library is that it allows tar files to be used without unpacking them. It also gives greater flexibility in the relationship between the location of files on disk and the file name paths within a ".tar" file. In particular programs that currently construct .tar files by preparing a temporary directory of file copies with the desired layout may be able to eliminate the extra set of temporary files and construct the tar file directly.

Some particular uses cases which come to mind:

  • darcs
  • cabal-install
  • hackage-server

The "darcs dist" command calls an external "tar" program. On Windows this does not work unless the user has specially installed a tar.exe. On GNU systems the GNU tar program produces .tar files in GNU tar format which is not as widely portable as the standard USTAR/POSIX format.

The cabal-install package already uses the code from the tar package but copied rather than by an external dependency. Adding the tar package to the platform would allow this to become an external dependency rather than bundled code.

The new hackage-server implementation has to handle uploaded .tar files, checking them for portability etc. It uses the tar package.


Rationale

What motivated the design

The current design evolved from an original tar package written by Bjorn Bringert in 2007. The flaws in that API, as I saw it, are that it provides too many functions none of which are obviously reusable primitives. Also too many of the operations it provides are in IO. By contrast the current API has only two basic functions in IO.

The original implementation used the binary package and the unix-compat package to handle packing/unpacking tar files and preserving file meta-data. The current implementation does not depend on binary and does not attempt to preserve file meta-data like ownership or permissions. This matches the use cases we have encountered so far. The library exposes everything necessary to write alternative implementations of pack/unpack that preserve more file meta-data for use-cases that may require it.

Particular design decisions

Separation of IO and pure operations. Intermediate data type provides flexibility.

The encoding and decoding of the tar format is completely pure. It uses an intermediate data structure and pure operations on it. This gives the API great flexibility without requiring a large number of primitives since it is possible to inspect, consume or modify the intermediate representation before doing an IO operation like packing or unpacking.

Most operations are on lazy sequences

The API is fairly compositional yet allows constant space operations in many cases because it uses lazy sequences. This matches the tar format quite well which is designed to be processed linearly and using constant space.

API partitioned into "simple" and "full" modules.

Christian complained that my original API was too complex. The problem is that unfortunately the tar format is more complex than we would wish and some applications do need to know about some of the details. In particular some applications need to know the format in use (V7, USTAR or GNU) and need access to meta-data like permissions and timestamps.

The solution we arrived at is the partition. Use cases that need more can import an extra module to get access to the details of what a tar Entry actually consists of.

API allows constant-space operations and pure exception handling.

There is no need to use exceptions (catch) to handle errors in decoding tar files. It can be done purely. At the same time it is possible to process large tar files in constant space; e.g. create / extract. This is done using the Entries type which is essentially a list data type but with an extra alternative for decoding errors.

This same approach of marrying exceptions and lazyness is now used in the zlib package because the previous approach of using exceptions proved insufficient for some applications (notably darcs).

Evidence of consensus

The current design has been through several iterations of API review with Christian Maeder. The thread starts here: http://www.haskell.org/pipermail/libraries/2009-February/011320.html Some of the discussion is not on the public mailing list. I think we have addressed almost all the concerns that he and I raised in our discussions.

One remaining issue is that the package provides > getDirectoryContentsRecursive :: FilePath -> IO [FilePath] which might be better in System.Directory. It's pretty useful for the case of constructing a tar archive where you want to use a non-default file path mapping or if you want to filter the list of files included.

The design and implementation have also been tested in real-world use cases in the cabal-install and hackage-server programs.

The cabal-install tool uses (a bundled copy of) the tar code for:

  • the "cabal sdist" feature
  • unpacking .tar.gz cabal packages
  • the hackage index file (00-index.tar.gz)

In particular the last case is one where we need more than simply unpacking a tar file. We read and examine the index file every time the user runs the configure or install commands to discover the set of available packages.