Difference between revisions of "Library/Compression"

From HaskellWiki
Jump to navigation Jump to search
 
(formatting)
Line 1: Line 1:
==Compression-2005 library==
+
== Compression-2005 library ==
   
 
Compression-2005 library includes the most competitive compression libraries as of April'05:
 
Compression-2005 library includes the most competitive compression libraries as of April'05:
   
- LZMA 4.06 by Igor Pavlov. You can download original library at www.7-zip.org. This library features fast extraction speed, minimal memory requirements for extraction, best compression for non-text files
+
- LZMA 4.06 by Igor Pavlov. You can download original library at
  +
http://www.7-zip.org. This library features fast extraction speed, minimal memory
  +
requirements for extraction, best compression for non-text files
   
- PPMd var.I by Dmitry Shkarin, http://www.compression.ru/ds. This library gives best compression for plain text files
+
- PPMd var.I by Dmitry Shkarin, http://www.compression.ru/ds. This library
  +
gives best compression for plain text files
   
- GRZipLib 0.2.4 by Grebnov Ilya, http://magicssoft.ru/. This library provides fastest compression whereas compress better than bzip2
+
- GRZipLib 0.2.4 by Grebnov Ilya, http://magicssoft.ru/. This library
  +
provides fastest compression while compress better than bzip2
   
All these libraries were assembled together by me and simple but powerful common interface to use them all was added. This interface consists of pairs of functions, which performs compression and decompression.
+
All these libraries were assembled together by me and simple but powerful
  +
common interface to use them all was added. This interface consists of
  +
pairs of routines, which performs compression and decompression.
   
  +
First pair of functions is compress and decompress. Their usage is similar:
 
  +
== General compression routines ==
  +
  +
First pair of routines are 'compress' and 'decompress'. Their usage is similar:
 
compress method reader writer
 
compress method reader writer
 
decompress method reader writer
 
decompress method reader writer
   
  +
What is the 'method' in these routines? Each compression algorithm has its own
What is "method" in these functions? Each compression algorithm has its own set of parameters, tuning them has effect of dramatically changing compression ratio and compression/decompression speed. To unify access to all these parameters, compression algorithm together with all it's parameters are represented as single string, such as "lzma:8mb:fast:hc4:fb32" or "grzip:8mb:m4". Each method has default parameters, which will be used unless overridden. For example, string "grzip" means using method GRZip with all default parameters, "grzip:m2" - using all default parameters, with exception that parameter "m" set to value "2", "grzip:m2:h20" means changing two parameters from their defaults and so on: See Description of compression methods parameters.
 
  +
set of parameters, tuning them has effect of dramatically changing
  +
compression ratio and compression/decompression speed. To unify access to
  +
all these parameters, compression algorithm together with all it's
  +
parameters is represented by single string, such as
  +
"lzma:8mb:fast:hc4:fb32" or "grzip:8mb:m4". Each method has default
  +
parameters, which will be used unless overridden. For example, string
  +
"grzip" means using method GRZip with all default parameters, "grzip:m2" -
  +
using all default parameters, with exception that parameter "m" set to
  +
value "2", "grzip:m2:h20" means changing two parameters from their defaults
  +
and so on... Section [[#Description of compression methods parameters]] shows
  +
all the compression methods supported, their parameters and its' default values.
  +
  +
'reader' and 'writer' are callbacks, used by (de)compression code to read
  +
input data and write output data. Their types are identical:
   
"reader" and "writer" are callbacks, used by (de)compression code to read input data and write output data. Their types are identical:
 
 
reader, writer :: Ptr CChar -> Int -> IO Int
 
reader, writer :: Ptr CChar -> Int -> IO Int
   
When called, "reader" gets pointer to input buffer and its size. It must fill buffer with input data and return number of bytes read. If there is no more input data, it must return 0, if error occurred - it must return error code < 0. In the last case this error code will be returned as a result of entire call to (de)compress.
+
When called, 'reader' gets pointer to input buffer and its size. It must
  +
fill buffer with input data and return number of bytes read. If there is no
  +
more input data, it must return 0, if error occurred - it must return error
  +
code < 0. In the last case this error code will be returned as a result of
  +
entire call to '(de)compress' routine.
   
"writer", when called, gets pointer to output buffer and its size. It must write output data and return value >= 0 if all ok and error code < 0, if there is an error. In last case this code will be returned as a result of (de)compress.
+
'writer', when called, gets pointer to output buffer and its size. It must
  +
write output data and return value >= 0 if all ok and error code < 0, if
  +
there is an error. In last case this code will be returned as a result of
  +
(de)compress.
   
 
For example, simplest compression program can be written just as:
 
For example, simplest compression program can be written just as:
compress "ppmd:10:48mb" (hGetBuf stdin) (\buf size -> hPutBuf stdout buf size >> return size)
+
let doPutBuf h buf size = do hPutBuf h buf size; return size
  +
compress "ppmd:10:48mb" (hGetBuf stdin) (doPutBuf stdout)
  +
  +
This program will work as compressing filter: read data from stdin,
  +
compress them and write result to stdout.
   
This program will work as compressing filter: read data from stdin, compress them and write result to stdout.
 
 
Corresponding decompressor, as you can guess:
 
Corresponding decompressor, as you can guess:
decompress "ppmd:10:48mb" (hGetBuf stdin) (\buf size -> hPutBuf stdout buf size >> return size)
+
decompress "ppmd:10:48mb" (hGetBuf stdin) (doPutBuf stdout)
Of course, under Windows you need to add "hSetBinaryMode stdin True", "hSetBinaryMode stdout True" lines to these simple programs.
 
   
  +
Of course, under Windows you need to add "hSetBinaryMode stdin True",
Another paired functions that you can use - compressWithHeader and decompressWithHeader. They work just as previous ones, but name of used compression algorithm is saved in compressed output. Decompression function just reads method name from input data instead of receiving it as a parameter. This is advantageous if data can be compressed with different algorithms or even one algorithm with different parameters, because some of these parameters are needed for decompression too. You can see example of using these functions in Example-Haskell.hs
 
  +
"hSetBinaryMode stdout True" operations to these simple programs.
   
Using callbacks to read and write data provides you ability to (de)compress data completely located in memory as well as data read/written to files, received/send to network sockets, another program or even another process in your own program. I use the last technique to get compressed on-the-fly the data generated by some complex algorithm, which is impractically to write as reader callback.
 
   
  +
Another paired routines that you can use are 'compressWithHeader' and
There is also some more functions - canonizeCompressionMethod returns canonical representation of compression method (two methods are equal if their canonical representations are equal), and a set of low-level (de)compression functions, which calls concrete algorithms and pass all their parameters as separate numbers, for example: ppmd_compress 10 (48*mb) 0 reader writer - will work just the same as compress "ppmd:10:48mb:r0" :
 
  +
'decompressWithHeader'. They work just as previous ones, but name of used
  +
compression algorithm is saved in compressed output. Decompression routine
  +
just reads method name from input data instead of receiving it as a
  +
parameter. This is advantageous if data can be compressed with different
  +
algorithms or even one algorithm with different parameters, because some of
  +
these parameters are needed for decompression too. You can see example of
  +
using these functions in Example-Haskell.hs
   
  +
Using callbacks to read and write data provides you ability to (de)compress
==Compressing data in memory==
 
  +
data completely located in memory as well as data read/written to files,
  +
received/send to network sockets, another program or even another process
  +
in your own program. I use the last technique to get compressed on-the-fly
  +
the data generated by some complex algorithm, which is impractically to
  +
write as reader callback.
   
  +
There are also some more functions - 'canonizeCompressionMethod' returns
compressMem method inBuf inBufSize outBuf outBufSize will compress data in inBuf having length inBufSize. Output data will be written in outBuf. This function returns size of compressed data or negative value - error code. If output buffer is too small to hold all compressed data, then appropriate error code will be returned.
 
  +
canonical representation of compression method (two methods are equal if
  +
their canonical representations are equal), and a set of low-level
  +
(de)compression routines, which calls concrete algorithms and pass all
  +
their parameters as separate numbers, for example:
   
  +
ppmd_compress 10 (48*mb) 0 reader writer
decompressMem method inBuf inBufSize outBuf outBufSize will decompress data in inBuf and return number of bytes decompressed or error code.
 
compressMemWithHeader method inBuf inBufSize outBuf outBufSize works like compressMemWithHeader but adds description of method used to the data written to outBuf. decompressMemWithHeader inBuf inBufSize outBuf outBufSize must be used to decompress such data.
 
   
  +
will work just the same as
==Manipulating compression method parameters ==
 
   
  +
compress "ppmd:10:48mb:r0" reader writer
There are a number of functions what can be used to query/modify compression method descriptions in general fashion, without respect to concrete method used. This functions can query/set memory usage for compression and decompression, dictionary size (if this have some meaning for particular compression method) and block size (also if this have a meaning). These functions are:
 
getCompressionMem, getDecompressionMem, getDictionary, getBlockSize :: String -> Int
 
setCompressionMem, setDecompressionMem, setDictionary, setBlockSize :: String -> Int -> String
 
limitCompressionMem, limitDecompressionMem, limitDictionary, limitBlockSize :: String -> Int -> String
 
   
  +
First group of functions query appropriate parameter of compression method. For example, "printLn (getDictionary "lzma:d1000")" will print 1000.
 
  +
== Compressing data in memory ==
Second group of functions sets appropriate parameter of compression method. For example, "printLn (setDictionary 1000 "lzma")" will print "lzma:1000b".
 
  +
Third group of functions limits appropriate parameter of compression method. For example, "printLn (limitDictionary 1000 "lzma")" will print "lzma:1000b", but "printLn (limitDictionary (100*mb) "lzma")" will print "lzma".
 
  +
In order to simplify simple in-memory compression, the following
  +
specialized variants of abovementioned functions are provided:
  +
  +
  +
compressMem method inBuf inBufSize outBuf outBufSize
  +
  +
will compress data in 'inBuf' having length 'inBufSize'. Output data will
  +
be written to 'outBuf' of size 'outBufSize'. This function returns size
  +
of compressed data or negative value - error code. If output buffer is too
  +
small to hold all compressed data, then appropriate error code will be
  +
returned.
  +
  +
decompressMem method inBuf inBufSize outBuf outBufSize
  +
  +
will decompress data in 'inBuf' to 'outBuf' and return number of bytes
  +
decompressed or error code.
  +
  +
  +
compressMemWithHeader method inBuf inBufSize outBuf outBufSize
  +
  +
works like 'compressMem' but adds description of method used to the
  +
data written to 'outBuf'.
  +
  +
decompressMemWithHeader inBuf inBufSize outBuf outBufSize
  +
  +
must be used to decompress such data.
  +
  +
  +
== Manipulating compression method parameters ==
  +
  +
There are a number of functions what can be used to query/modify
  +
compression method descriptions in general fashion, without respect to
  +
concrete method used. This functions can query/set memory usage for
  +
compression and decompression, dictionary size (if this have some meaning
  +
for particular compression method) and block size (also if this have a
  +
meaning). These functions are:
  +
  +
getCompressionMem, getDecompressionMem, getDictionary, getBlockSize
  +
:: String -> Int
  +
setCompressionMem, setDecompressionMem, setDictionary, setBlockSize
  +
:: String -> Int -> String
  +
limitCompressionMem, limitDecompressionMem, limitDictionary, limitBlockSize
  +
:: String -> Int -> String
  +
  +
First group of functions query appropriate parameter of compression method.
  +
For example, 'print (getDictionary "lzma:d1000")' will print '1000'.
  +
  +
Second group of functions sets appropriate parameter of compression method.
  +
For example, 'print (setDictionary 1000 "lzma")' will print 'lzma:1000b'.
  +
  +
Third group of functions limits appropriate parameter of compression
  +
method. For example, 'print (limitDictionary 1000 "lzma")' will print
  +
'lzma:1000b', but 'print (limitDictionary (100*mb) "lzma")' will print
  +
'lzma'.
   
 
Using these methods, you can for example limit memory required for compression:
 
Using these methods, you can for example limit memory required for compression:
compress (limitCompressionMem (20*mb) "lzma") reader writer
 
   
  +
compress (limitCompressionMem (20*mb) "lzma") reader writer
==Compiling library==
 
  +
C part of library consist of modules Compression.cpp, PPMD/PPMD_Parser.cpp, PPMD/C_PPMD_Compress.cpp, PPMD/C_PPMD_Decompress.cpp, PPMD/C_LZP.cpp, LZMA/C_LZMA.cpp, LZMA/C_BCJ.cpp, GRZip/C_GRZip.cpp. All these modules can be compiled with compile[.cmd] if you have GHC or with Example-C.mak, if you have only GCC. To use other C++ compilers, make the appropriate changes in Example-C.mak. In any case, don't forget to change value of TEMPDIR in common.mak to some directory for temporaries existing on your machine.
 
  +
To compile provided Example-Haskell.hs, use compile-Example-Haskell[.cmd]. This batch file also contains reference to "c:/temp/ghc", which must be changed to the same value as TEMPDIR in common.mak.
 
  +
== Compiling library ==
  +
  +
C part of library consist of modules Compression.cpp, PPMD/PPMD_Parser.cpp,
  +
PPMD/C_PPMD_Compress.cpp, PPMD/C_PPMD_Decompress.cpp, PPMD/C_LZP.cpp,
  +
LZMA/C_LZMA.cpp, LZMA/C_BCJ.cpp, GRZip/C_GRZip.cpp. All these modules can
  +
be compiled with compile[.cmd] if you have GHC or with Example-C.mak, if
  +
you have only GCC. To use other C++ compilers, make the appropriate changes
  +
in Example-C.mak. In any case, don't forget to change value of TEMPDIR in
  +
common.mak to some directory for temporaries existing on your machine.
  +
  +
To compile provided Example-Haskell.hs, use compile-Example-Haskell[.cmd].
  +
This batch file also contains reference to "c:/temp/ghc", which must be
  +
changed to the same value as TEMPDIR in common.mak.
  +
  +
  +
If you want to use only decompression code, exclude from above list of
  +
files PPMD/C_PPMD_Compress.cpp (all other compression methods contain both
  +
compressor and decompressor in one file) and define preprocessor macro
  +
FREEARC_DECOMPRESS_ONLY. This will compile only decompression code, which
  +
is about 4 times smaller (50 kb instead of 200 kb). Remember, that when
  +
compiling C sources with GHC, you must use "-optc-DFREEARC_DECOMPRESS_ONLY"
  +
option to define preprocessor symbol.
  +
  +
Also for reducing size of code you can try to turn off optimization, and
  +
exclude modules for unused compression algorithms from compilation. For
  +
example, if you want to use GRZip only, include only Compression.cpp and
  +
GRZip/C_GRZip.cpp in your compilation.
  +
  +
  +
== Description of compression methods parameters ==
  +
  +
I've written default value of each parameter together with it's name
  +
  +
=== PPMD ===
  +
  +
o10 set model order to N
  +
  +
mem48mb use N MB memory. Can also be set as "m48mb" or even "48m"
  +
  +
r0 method of model restoration at memory insufficiency:
  +
  +
-r0 - restart model from scratch (default)
  +
-r1 - cut off model (slow)
  +
-r2 - freeze model (dangerous)
  +
  +
"r" alone is equivalent to "r1"
  +
  +
  +
=== LZMA ===
  +
  +
a1 compression mode: "a0" - fast, "a1" - normal, "a2" - max.
  +
Can also be set as "fast", "normal" and "max", appropriately
  +
  +
d8mb dictionary size
  +
  +
fb32 number of fast bytes - [3, 255]
  +
  +
lc3 number of literal context bits - [0, 8]
  +
  +
lp0 number of literal pos bits - [0, 4]
  +
  +
pb2 number of pos bits - [0, 4]
  +
  +
mfbt4 Match Finder: [bt2, bt3, bt4, bt4b, pat2r, pat2, pat2h, pat3h, pat4h, hc3, hc4].
  +
Can also be set as "bt4". For "normal" and "max" mode most appropriate match
  +
finders are bt3, bt4, bt4b. For "fast" mode - hc3 and hc4.
  +
  +
  +
=== GRZip ===
  +
  +
m1 Selection of compression algorithm, from tightest to fastest:
  +
  +
m1 - LZP + BWT + WFC + EC
  +
m2 - LZP + BWT + MTF + EC
  +
m3 - LZP + ST4 + WFC + EC
  +
m4 - LZP + ST4 + MTF + EC
  +
  +
b8mb Maximum block size. Can be also set as "8mb" or even "8m". 8 mb is the maximum.
  +
  +
l32 LZP Minimum Matched Len. Can also be set as "32"
  +
  +
h15 LZP Hash table size, actual memory usage for hash table will be 4*2^Size bytes.
  +
Greater values increase compression ratio, but decreases speed of compression
  +
and decompression, because hash table will no more fit in Level-2 processor cache.
  +
  +
s Use alternative BWT Sorting algorithm (faster for repetitive blocks)
  +
  +
a Enable Adaptive block size reduction
  +
  +
l Disable LZP preprocessing. On some data (executables, plain text books) LZP
  +
preprocessor don't gives any improvements and only wastes CPU time
  +
  +
d Enable Delta filter
  +
  +
p Disable all Preprocessing techniques (LZP, Delta filter, Adaptive block size
  +
reduction). By default, all preprocessing techniques, except for LZP, are already disabled
  +
  +
  +
=== LZP ===
  +
  +
LZP algorithm is just an LZP preprocessor from GRZip algorithm, optimized by Dmitry Shkarin. So it has the same parameters:
  +
  +
b8mb Maximum block size. Can be also set as "8mb" or even "8m"
  +
  +
l64 Minimum Matched Len. Can also be set as "64"
  +
  +
h18 Hash table size, actual memory usage for hash table will be 4*2^Size bytes.
  +
Greater values increase compression ratio, but decreases speed of compression
  +
and decompression, because hash table will no more fit in Level-2 processor cache.
  +
  +
=== EXE ===
  +
  +
This method just preprocess Win32 executables to allow better compression
   
If you want to use only decompression code, exclude from above list of files PPMD/C_PPMD_Compress.cpp (all other compression methods contain both compressor and decompressor in one file) and define preprocessor macro FREEARC_DECOMPRESS_ONLY. This will compile only decompression code, which is about 4 times smaller (50 kb instead of 200 kb). Remember, that when compiling C sources with GHC, you must use "-optc-DFREEARC_DECOMPRESS_ONLY" option to define preprocessor symbol.
 
Also for reducing size of code you can try to turn off optimization, and exclude modules for unused compression algorithms from compilation. For example, if you want to use GRZip only, include only Compression.cpp and GRZip/C_GRZip.cpp in your compilation.
 
   
  +
In order to know more about how each parameter influence compression ratio,
==Description of compression methods parameters==
 
  +
compression/decompression speed and memory usage, please download original
  +
libraries and study their documentation. LZMA parameters are best described
  +
in documentation for 7-zip, see description of "-m" option in 7-zip.chm.
   
Parameter name and default value Description
 
PPMD
 
o10 set model order to N
 
mem48mb use N MB memory. Can also be set as "m48mb" or even "48m"
 
r0 method of model restoration at memory insufficiency:
 
-r0 - restart model from scratch (default)
 
-r1 - cut off model (slow)
 
-r2 - freeze model (dangerous)
 
"r" equivalent to "r1"
 
LZMA
 
a1 compression mode: "a0" - fast, "a1" - normal, "a2" - max. Can also be set as "fast", "normal" and "max", appropriately
 
d8mb dictionary size
 
fb32 number of fast bytes - [3, 255]
 
lc3 number of literal context bits - [0, 8]
 
lp0 number of literal pos bits - [0, 4]
 
pb2 number of pos bits - [0, 4]
 
mfbt4 Match Finder: [bt2, bt3, bt4, bt4b, pat2r, pat2, pat2h, pat3h, pat4h, hc3, hc4]. Can also be set as "bt4". For "normal" and "max" mode most appropriate match finders are bt3, bt4, bt4b. For "fast" mode - hc3 and hc4.
 
GRZip
 
m1 Selection of compression algorithm, from tightest to fastest:
 
m1 - LZP + BWT + WFC + EC
 
m2 - LZP + BWT + MTF + EC
 
m3 - LZP + ST4 + WFC + EC
 
m4 - LZP + ST4 + MTF + EC
 
b8mb Maximum block size. Can be also set as "8mb" or even "8m". 8 mb is the maximum.
 
l32 LZP Minimum Matched Len. Can also be set as "32"
 
h15 LZP Hash table size, actual memory usage for hash table will be 4*2^Size bytes. Greater values increase compression ratio, but decreases speed of compression and decompression, because hash table will no more fit in Level-2 processor cache.
 
s Use alternative BWT Sorting algorithm (faster for repetitive blocks)
 
a Enable Adaptive block size reduction
 
l Disable LZP preprocessing. On some data (executables, plain text books) LZP preprocessor don't gives any improvements and only wastes CPU time
 
d Enable Delta filter
 
p Disable all Preprocessing techniques (LZP, Delta filter, Adaptive block size reduction). By default, all preprocessing techniques, except for LZP, are already disabled
 
LZP
 
LZP algorithm is just an LZP preprocessor from GRZip algorithm, optimized by Dmitry Shkarin. So it has all the same parameters
 
b8mb Maximum block size. Can be also set as "8mb" or even "8m"
 
l64 Minimum Matched Len. Can also be set as "64"
 
h18 Hash table size, actual memory usage for hash table will be 4*2^Size bytes. Greater values increase compression ratio, but decreases speed of compression and decompression, because hash table will no more fit in Level-2 processor cache.
 
   
  +
== Features planned for next versions ==
In order to know more about how each parameter influence compression ratio, compression/decompression speed and memory usage, please download original libraries and study their documentation. LZMA parameters are best described in documentation for 7-zip, see in 7-zip.chm description of "-m" option.
 
   
  +
- Ability to use several compression algorithms sequentially, for example:
==Features planned for next versions==
 
  +
compress "lzp+exe+lzma". This will improve compression ratio
   
  +
- Passing to C compression/decompression functions additional pointers,
" Ability to use several compression algorithms sequentially, for example: compress "lzp+exe+lzma" : This will improve compression ratio
 
  +
which then will be passed in each call to reader and writer functions. It's
" Passing to C compression/decompression functions additional pointers, which then will be passed in each call to reader and writer functions. It's needed to throw away using global variables in readers/writers, which will allow writing multi-thread friendly code. If you use only Haskell interfaces, this is completely not a problem for you (actually, I use this library in my own multi-threaded Haskell program)
 
  +
needed to throw away using global variables in readers/writers, which will
  +
allow writing multi-thread friendly code. If you use only Haskell
  +
interfaces, this is completely not a problem for you (actually, I use this
  +
library in my own multi-threaded Haskell program)

Revision as of 16:53, 20 May 2006

Compression-2005 library

Compression-2005 library includes the most competitive compression libraries as of April'05:

- LZMA 4.06 by Igor Pavlov. You can download original library at http://www.7-zip.org. This library features fast extraction speed, minimal memory requirements for extraction, best compression for non-text files

- PPMd var.I by Dmitry Shkarin, http://www.compression.ru/ds. This library gives best compression for plain text files

- GRZipLib 0.2.4 by Grebnov Ilya, http://magicssoft.ru/. This library provides fastest compression while compress better than bzip2

All these libraries were assembled together by me and simple but powerful common interface to use them all was added. This interface consists of pairs of routines, which performs compression and decompression.


General compression routines

First pair of routines are 'compress' and 'decompress'. Their usage is similar:

compress method reader writer
decompress method reader writer

What is the 'method' in these routines? Each compression algorithm has its own set of parameters, tuning them has effect of dramatically changing compression ratio and compression/decompression speed. To unify access to all these parameters, compression algorithm together with all it's parameters is represented by single string, such as "lzma:8mb:fast:hc4:fb32" or "grzip:8mb:m4". Each method has default parameters, which will be used unless overridden. For example, string "grzip" means using method GRZip with all default parameters, "grzip:m2" - using all default parameters, with exception that parameter "m" set to value "2", "grzip:m2:h20" means changing two parameters from their defaults and so on... Section #Description of compression methods parameters shows all the compression methods supported, their parameters and its' default values.

'reader' and 'writer' are callbacks, used by (de)compression code to read input data and write output data. Their types are identical:

reader, writer :: Ptr CChar -> Int -> IO Int

When called, 'reader' gets pointer to input buffer and its size. It must fill buffer with input data and return number of bytes read. If there is no more input data, it must return 0, if error occurred - it must return error code < 0. In the last case this error code will be returned as a result of entire call to '(de)compress' routine.

'writer', when called, gets pointer to output buffer and its size. It must write output data and return value >= 0 if all ok and error code < 0, if there is an error. In last case this code will be returned as a result of (de)compress.

For example, simplest compression program can be written just as:

let doPutBuf h buf size = do hPutBuf h buf size; return size
compress "ppmd:10:48mb" (hGetBuf stdin) (doPutBuf stdout)

This program will work as compressing filter: read data from stdin, compress them and write result to stdout.

Corresponding decompressor, as you can guess:

decompress "ppmd:10:48mb" (hGetBuf stdin) (doPutBuf stdout)

Of course, under Windows you need to add "hSetBinaryMode stdin True", "hSetBinaryMode stdout True" operations to these simple programs.


Another paired routines that you can use are 'compressWithHeader' and 'decompressWithHeader'. They work just as previous ones, but name of used compression algorithm is saved in compressed output. Decompression routine just reads method name from input data instead of receiving it as a parameter. This is advantageous if data can be compressed with different algorithms or even one algorithm with different parameters, because some of these parameters are needed for decompression too. You can see example of using these functions in Example-Haskell.hs

Using callbacks to read and write data provides you ability to (de)compress data completely located in memory as well as data read/written to files, received/send to network sockets, another program or even another process in your own program. I use the last technique to get compressed on-the-fly the data generated by some complex algorithm, which is impractically to write as reader callback.

There are also some more functions - 'canonizeCompressionMethod' returns canonical representation of compression method (two methods are equal if their canonical representations are equal), and a set of low-level (de)compression routines, which calls concrete algorithms and pass all their parameters as separate numbers, for example:

ppmd_compress 10 (48*mb) 0 reader writer

will work just the same as

compress "ppmd:10:48mb:r0" reader writer


Compressing data in memory

In order to simplify simple in-memory compression, the following specialized variants of abovementioned functions are provided:


compressMem method inBuf inBufSize outBuf outBufSize

will compress data in 'inBuf' having length 'inBufSize'. Output data will be written to 'outBuf' of size 'outBufSize'. This function returns size of compressed data or negative value - error code. If output buffer is too small to hold all compressed data, then appropriate error code will be returned.

decompressMem method inBuf inBufSize outBuf outBufSize

will decompress data in 'inBuf' to 'outBuf' and return number of bytes decompressed or error code.


compressMemWithHeader method inBuf inBufSize outBuf outBufSize

works like 'compressMem' but adds description of method used to the data written to 'outBuf'.

decompressMemWithHeader inBuf inBufSize outBuf outBufSize

must be used to decompress such data.


Manipulating compression method parameters

There are a number of functions what can be used to query/modify compression method descriptions in general fashion, without respect to concrete method used. This functions can query/set memory usage for compression and decompression, dictionary size (if this have some meaning for particular compression method) and block size (also if this have a meaning). These functions are:

getCompressionMem, getDecompressionMem, getDictionary, getBlockSize 
    :: String -> Int
setCompressionMem, setDecompressionMem, setDictionary, setBlockSize 
    :: String -> Int -> String
limitCompressionMem, limitDecompressionMem, limitDictionary, limitBlockSize 
    :: String -> Int -> String

First group of functions query appropriate parameter of compression method. For example, 'print (getDictionary "lzma:d1000")' will print '1000'.

Second group of functions sets appropriate parameter of compression method. For example, 'print (setDictionary 1000 "lzma")' will print 'lzma:1000b'.

Third group of functions limits appropriate parameter of compression method. For example, 'print (limitDictionary 1000 "lzma")' will print 'lzma:1000b', but 'print (limitDictionary (100*mb) "lzma")' will print 'lzma'.

Using these methods, you can for example limit memory required for compression:

compress (limitCompressionMem (20*mb) "lzma") reader writer


Compiling library

C part of library consist of modules Compression.cpp, PPMD/PPMD_Parser.cpp, PPMD/C_PPMD_Compress.cpp, PPMD/C_PPMD_Decompress.cpp, PPMD/C_LZP.cpp, LZMA/C_LZMA.cpp, LZMA/C_BCJ.cpp, GRZip/C_GRZip.cpp. All these modules can be compiled with compile[.cmd] if you have GHC or with Example-C.mak, if you have only GCC. To use other C++ compilers, make the appropriate changes in Example-C.mak. In any case, don't forget to change value of TEMPDIR in common.mak to some directory for temporaries existing on your machine.

To compile provided Example-Haskell.hs, use compile-Example-Haskell[.cmd]. This batch file also contains reference to "c:/temp/ghc", which must be changed to the same value as TEMPDIR in common.mak.


If you want to use only decompression code, exclude from above list of files PPMD/C_PPMD_Compress.cpp (all other compression methods contain both compressor and decompressor in one file) and define preprocessor macro FREEARC_DECOMPRESS_ONLY. This will compile only decompression code, which is about 4 times smaller (50 kb instead of 200 kb). Remember, that when compiling C sources with GHC, you must use "-optc-DFREEARC_DECOMPRESS_ONLY" option to define preprocessor symbol.

Also for reducing size of code you can try to turn off optimization, and exclude modules for unused compression algorithms from compilation. For example, if you want to use GRZip only, include only Compression.cpp and GRZip/C_GRZip.cpp in your compilation.


Description of compression methods parameters

I've written default value of each parameter together with it's name

PPMD

o10       set model order to N
mem48mb   use N MB memory. Can also be set as "m48mb" or even "48m"
r0        method of model restoration at memory insufficiency:
          -r0 - restart model from scratch (default)
          -r1 - cut off model (slow)
          -r2 - freeze model (dangerous)
          "r" alone is equivalent to "r1"


LZMA

a1      compression mode: "a0" - fast, "a1" - normal, "a2" - max.
        Can also be set as "fast", "normal" and "max", appropriately
d8mb    dictionary size
fb32    number of fast bytes - [3, 255]
lc3     number of literal context bits - [0, 8]
lp0     number of literal pos bits - [0, 4]
pb2     number of pos bits - [0, 4]
mfbt4   Match Finder: [bt2, bt3, bt4, bt4b, pat2r, pat2, pat2h, pat3h, pat4h, hc3, hc4].
        Can also be set as "bt4". For "normal" and "max" mode most appropriate match
        finders are bt3, bt4, bt4b. For "fast" mode - hc3 and hc4.


GRZip

m1      Selection of compression algorithm, from tightest to fastest:
        m1 - LZP + BWT + WFC + EC
        m2 - LZP + BWT + MTF + EC
        m3 - LZP + ST4 + WFC + EC
        m4 - LZP + ST4 + MTF + EC
b8mb    Maximum block size. Can be also set as "8mb" or even "8m". 8 mb is the maximum.
l32     LZP Minimum Matched Len. Can also be set as "32"
h15     LZP Hash table size, actual memory usage for hash table will be 4*2^Size bytes.
        Greater values increase compression ratio, but decreases speed of compression
        and decompression, because hash table will no more fit in Level-2 processor cache.
s       Use alternative BWT Sorting algorithm (faster for repetitive blocks)
a       Enable Adaptive block size reduction
l       Disable LZP preprocessing. On some data (executables, plain text books) LZP
        preprocessor don't gives any improvements and only wastes CPU time
d       Enable Delta filter
p       Disable all Preprocessing techniques (LZP, Delta filter, Adaptive block size
        reduction). By default, all preprocessing techniques, except for LZP, are already disabled


LZP

LZP algorithm is just an LZP preprocessor from GRZip algorithm, optimized by Dmitry Shkarin. So it has the same parameters:

b8mb    Maximum block size. Can be also set as "8mb" or even "8m"
l64     Minimum Matched Len. Can also be set as "64"
h18     Hash table size, actual memory usage for hash table will be 4*2^Size bytes.
        Greater values increase compression ratio, but decreases speed of compression
        and decompression, because hash table will no more fit in Level-2 processor cache.

EXE

This method just preprocess Win32 executables to allow better compression


In order to know more about how each parameter influence compression ratio, compression/decompression speed and memory usage, please download original libraries and study their documentation. LZMA parameters are best described in documentation for 7-zip, see description of "-m" option in 7-zip.chm.


Features planned for next versions

- Ability to use several compression algorithms sequentially, for example: compress "lzp+exe+lzma". This will improve compression ratio

- Passing to C compression/decompression functions additional pointers, which then will be passed in each call to reader and writer functions. It's needed to throw away using global variables in readers/writers, which will allow writing multi-thread friendly code. If you use only Haskell interfaces, this is completely not a problem for you (actually, I use this library in my own multi-threaded Haskell program)