Numeric Haskell: A Vector Tutorial
Vector is a Haskell library for working with arrays. It has an emphasis on very high performance through loop fusion, whilst retaining a rich interface. The main data types are boxed and unboxed arrays, and arrays may be immutable (pure), or mutable. Arrays may hold Storable elements, suitable for passing to and from C, and you can convert between the array types. Arrays are indexed by non-negative Int
values.
The vector library has an API similar to the famous Haskell list library, with many of the same names.
This tutorial is modelled on the NumPy tutorial.
Quick Tour
Here is a quick overview to get you started.
Importing the library
Download the vector package:
$ cabal install vector
and import it as, for boxed arrays:
import qualified Data.Vector as V
or:
import qualified Data.Vector.Unboxed as V
for unboxed arrays. The library needs to be imported qualified as it shares the same function names as list operations in the Prelude.
Generating Vectors
New vectors can be generated in many ways:
$ ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> :m + Data.Vector
-- Generating a vector from a list:
Prelude Data.Vector> let a = fromList [10, 20, 30, 40]
Prelude Data.Vector> a
fromList [10,20,30,40] :: Data.Vector.Vector
-- Or filled from a sequence
Prelude Data.Vector> enumFromStepN 10 10 4
fromList [10,20,30,40] :: Data.Vector.Vector
-- A vector created from four consecutive values
Prelude Data.Vector> enumFromN 10 4
fromList [10,11,12,13] :: Data.Vector.Vector
You can also build vectors using operations similar to lists:
-- The empty vector
Prelude Data.Vector> empty
fromList [] :: Data.Vector.Vector
-- A vector of length one
Prelude Data.Vector> singleton 2
fromList [2] :: Data.Vector.Vector
-- A vector of length 10, filled with the value '2'
-- Note that to disambiguate names,
-- and avoid a clash with the Prelude,
-- with use the full path to the Vector module
Prelude Data.Vector> Data.Vector.replicate 10 2
fromList [2,2,2,2,2,2,2,2,2,2] :: Data.Vector.Vector
In general, you may construct new vectors by applying a function to the index space:
Prelude Data.Vector> generate 10 (^2)
fromList [0,1,4,9,16,25,36,49,64,81] :: Data.Vector.Vector
Vectors may have more than one dimension:
-- Here we create a two dimensional vector, 10 columns,
-- each row filled with the row index.
Prelude Data.Vector> let x = generate 10 (\n -> Data.Vector.replicate 10 n)
-- The type is "Vector of Vector of Ints"
Prelude Data.Vector> :t x
x :: Vector (Vector Int)
Vectors may be grown or shrunk arbitrarily:
Prelude Data.Vector> let y = Data.Vector.enumFromTo 0 11
Prelude Data.Vector> y
fromList [0,1,2,3,4,5,6,7,8,9,10,11] :: Data.Vector.Vector
-- Take the first 3 elements as a new vector
Prelude Data.Vector> Data.Vector.take 3 y
fromList [0,1,2] :: Data.Vector.Vector
-- Duplicate and join the vector
Prelude Data.Vector> y Data.Vector.++ y
fromList [0,1,2,3,4,5,6,7,8,9,10,11,0,1,2,3,4,5,6,7,8,9,10,11] :: Data.Vector.Vector
Modifying vectors
Just as with lists, you can iterate (map) over arrays, reduce them (fold), filter them, or join them in various ways:
-- mapping a function over the elements of a vector
Prelude Data.Vector> Data.Vector.map (^2) y
fromList [0,1,4,9,16,25,36,49,64,81,100,121] :: Data.Vector.Vector
-- Extract only the odd elements from a vector
Prelude Data.Vector> Data.Vector.filter odd y
fromList [1,3,5,7,9,11] :: Data.Vector.Vector
-- Reduce a vector
Prelude Data.Vector> Data.Vector.foldl (+) 0 y
66
-- Take a scan (partial results from a reduction):
Prelude Data.Vector> Data.Vector.scanl (+) 0 y
fromList [0,0,1,3,6,10,15,21,28,36,45,55,66] :: Data.Vector.Vector
-- Zip two arrays pairwise, into an array of pairs
Prelude Data.Vector> Data.Vector.zip y y
fromList [(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9),(10,10),(11,11)] :: Data.Vector.Vector
Indexing vectors
And like all good arrays, you can index them in various ways:
-- Take an arbitrary element
Prelude Data.Vector> y ! 4
4
-- Take the last element
Prelude Data.Vector> Data.Vector.last y
11
-- Take the first element
Prelude Data.Vector> Data.Vector.head y
0
-- Take the rest
Prelude Data.Vector> Data.Vector.tail y
fromList [1,2,3,4,5,6,7,8,9,10,11] :: Data.Vector.Vector
The Tutorial
The vector package provides several types of array. The most general interface is via Data.Vector, which provides for boxed arrays, holding any type.
There are also more specialized array types:
which provide unboxed arrays (i.e. no closures) and storable arrays (data that is pinned, and may be passed to and from C via a Ptr).
In all cases, the operations are subject to loop fusion. That is, if you compose two functions,
map f . map g
the compiler will rewrite it into a single traversal:
map (f . g)
saving time and space.
Simple example
You can create the arrays in many ways, for example, from a regular Haskell list:
Prelude Data.Vector> let a = fromList [2,3,4]
Prelude Data.Vector> a
fromList [2,3,4] :: Data.Vector.Vector
Prelude Data.Vector> :t a
a :: Vector Integer
GHCi will print the contents of the vector as executable code.
To create a multidimensional array, you can use a nested list generator to fill it:
Prelude Data.Vector> let x = fromList [ fromList [1 .. x] | x <- [1..10] ]
Prelude Data.Vector> :t x
x :: Vector (Vector Integer)
-- XXX TODO need a better printing function for multidimensional arrays.
You can also just create arrays filled with zeroes:
Prelude Data.Vector> Data.Vector.replicate 10 0
fromList [0,0,0,0,0,0,0,0,0,0] :: Data.Vector.Vector
And you can fill arrays from a sequence generator:
Prelude Data.Vector> enumFromN 1 10
fromList [1,2,3,4,5,6,7,8,9,10] :: Data.Vector.Vector
Prelude Data.Vector> enumFromStepN 0 10 10
fromList [0,10,20,30,40,50,60,70,80,90] :: Data.Vector.Vector
Array Types
The vector package provides several array types, with an identical interface. They have different flexibility with respect to the types of values that may be stored in them, and different performance characteristics.
In general:
- End users should use Data.Vector.Unboxed for most cases
- If you need to store more complex structures, use Data.Vector
- If you need to pass to C, use Data.Vector.Storable
For library writers;
- Use the generic interface, to ensure your library is maximally flexible: Data.Vector.Generic
Boxed Arrays: Data.Vector
The most flexible type is Data.Vector.Vector, which provides *boxed* arrays: arrays of pointers to Haskell values.
- Data.Vector.Vector's are fully polymorphic: they can hold any valid Haskell type
These arrays are suitable for storing complex Haskell types (sum types, or algebraic data types), but a better choice for simple data types is Data.Vector.Unboxed.
Unboxed Arrays: Data.Vector.Unboxed
Simple, atomic types, and pair types can be stored in a more efficient manner: consecutive memory slots without pointers. The Data.Vector.Unboxed.Vector type provides unboxed arrays of types that are members of the Unbox class, including:
- Bool
- ()
- Char
- Double
- Float
- Int
- Int8, 16, 32, 64
- Word
- Word8, 16, 32, 64
- Complex a's, where 'a' is in Unbox
- Tuple types, where the elements are unboxable
Unboxed arrays should be preferred when you have unboxable elements, as they are generally more efficient.
Storable Arrays: passing data to C
Storable arrays (Data.Vector.Storable.Vector) are vectors of any type in the Storable class.
These arrays are pinned, and may be converted to and from pointers, that may be passed to C functions, using a number of functions:
unsafeFromForeignPtr
:: Storable a
=> ForeignPtr a
-> Int
-> Int
-> Vector a
-- Create a vector from a ForeignPtr with an offset and a length. The data may
-- not be modified through the ForeignPtr afterwards.
unsafeToForeignPtr
:: Storable a
=> Vector a
-> (ForeignPtr a, Int, Int)
-- Yield the underlying ForeignPtr together with the offset to the data and its
-- length. The data may not be modified through the ForeignPtr.
unsafeWith
:: Storable a
=> Vector a
-> (Ptr a -> IO b)
-> IO b
-- Pass a pointer to the vector's data to the IO action. The data may not be
-- modified through the 'Ptr.
Storing your own types in Storable vectors
You can store new data types in Storable vectors, beyond those for instances that already exist, by writing a Storable instance for the type.
Here we store a 4-element Double vector as the array element.
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
Ensure we can store it in an array
instance Storable Vec4 where
sizeOf _ = sizeOf (undefined :: CFloat) * 4
alignment _ = alignment (undefined :: CFloat)
{-# INLINE peek #-}
peek p = do
a <- peekElemOff q 0
b <- peekElemOff q 1
c <- peekElemOff q 2
d <- peekElemOff q 3
return (Vec4 a b c d)
where
q = castPtr p
{-# INLINE poke #-}
poke p (Vec4 a b c d) = do
pokeElemOff q 0 a
pokeElemOff q 1 b
pokeElemOff q 2 c
pokeElemOff q 3 d
where
q = castPtr p
And now we can write operations on the new vector, with very good performance.
a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6
add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')
mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')
vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d
multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
| count <= 0 = src
| otherwise = multList (count-1) $ V.map (\v -> add (mult v m) a) src
main = do
print $ Data.Vector.Storable.sum
$ Data.Vector.Storable.map vsum
$ multList repCount
$ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)
repCount, arraySize :: Int
repCount = 10000
arraySize = 20000
Pure Arrays
Impure Arrays
Arrays can be created and operated on in a mutable fashion -- using destructive updates, as in an imperative language. Once all operations are complete, the mutable array can be "frozen" to a pure array, which changes its type.
Mutable arrays plus freezing are quite useful for initializing arrays from data in the outside world.
For example, to fill a generic array, we first
- allocate an empty vector of size
n
- destructively update the cells using a generator function
- freeze the array and return it as a pure value.
import qualified System.Random.Mersenne as R
import qualified Data.Vector.Generic as G
import qualified Data.Vector.Generic.Mutable as GM
random :: (R.MTRandom a, G.Vector v a) => R.MTGen -> Int -> IO (v a)
random g n = do
v <- GM.new n
fill v 0
G.unsafeFreeze v
where
fill v i
| i < n = do
x <- R.random g
GM.unsafeWrite v i x
fill v (i+1)
| otherwise = return ()
Here we use the Data.Vector.Generic.Mutable.new to allocate a new, uninitialized array, which we then write elements to (using unsafeWrite).
By using the generic interface, we can construct boxed, storable or unboxed arrays all from the same code.
Some examples
The most important attributes of an array are available in O(1) time, such as the size (length),
-- how big is the array?
Prelude Data.Vector> let a = fromList [1,2,3,4,5,6,7,8,9,10]
Prelude Data.Vector> Data.Vector.length a
10
-- is the array empty?
Prelude Data.Vector> Data.Vector.null a
False
Array Creation
Enumerations
The most common way to generate a vector is via an enumeration function:
- enumFromN
- enumFromStepN
And the list-like:
- enumFromTo
- enumFromThenTo
The enumFrom*N functions are guaranteed to optimize well for any type. The enumFromTo functions might fall back to generating from lists if there is no specialization for your type. They are currently specialized to most Int/Word/Double/Float generators.
> enumFromN 1 10
fromList [1,2,3,4,5,6,7,8,9,10]
> enumFromStepN 1 3 4
fromList [1,4,7,10]
> Data.Vector.enumFromTo 1 10
fromList [1,2,3,4,5,6,7,8,9,10]
-- counting backwards
> Data.Vector.enumFromThenTo 10 9 1
fromList [10,9,8,7,6,5,4,3,2,1]
A note on fusion
As for almost all vector functions, if an enumerator is composed with a traversal or fold, they will fuse into a single loop.
For example, we can fuse generation of an array of doubles, with computing the product of the square roots. The source program consists of two loops:
import qualified Data.Vector as V
test :: V.Vector Int -> Double
test = V.foldl (\ a b -> a * sqrt (fromIntegral b)) 0
create :: Int -> V.Vector Int
create n = (V.enumFromTo 1 n)
main = print (test (create 1000000))
And after optimization (revealed with the ghc-core tool), we have only one loop:
main_$s$wfoldlM_loop :: Int# -> Double# -> Double#
main_$s$wfoldlM_loop =
\ (sc_sWA :: Int#) (sc1_sWB :: Double#) ->
case <=# sc_sWA 1000000 of _ {
False -> sc1_sWB;
True ->
main_$s$wfoldlM_loop
(+# sc_sWA 1)
(*##
sc1_sWB (sqrtDouble# (int2Double# sc_sWA)))
}
Doubling the performance, by halving the number of traversals. Fusion also means we can avoid any intermediate data structure allocation.
An example: filling a vector from a file
We often want to populate a vector using a external data file. The easiest way to do this is with bytestring IO, and Data.Vector.unfoldr (or the equivalent functions in Data.Vector.Unboxed or Data.Vector.Storable:
Parsing Ints
The simplest way to parse a file of Int or Integer types is with a strict or lazy ByteString, and the readInt or readInteger functions:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
Note the use of bang patterns to ensure the parsing accumulated state is produced strictly.
Create a data file filled with 1 million integers:
$ seq 1 1000000 > data
Compile with -Odph (enables special optimizations to help fusion):
$ ghc -Odph --make vector.hs
Run:
$ time ./vector data 500000500000 ./vector data 0.08s user 0.01s system 98% cpu 0.088 total
Parsing Floating Point Values
To load a file of floating point values into a vector, you can use bytestrings and the bytestring-lexing package, which provides readDouble and readFloat functions.
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.ByteString.Lex.Lazy.Double as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readDouble s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
Parsing Binary Data
The best way to parse binary data is via bytestrings and the Data.Binary package.
There are instances of Binary and Serialize available in the vector-binary-instances package.
An example: parsing a list of integers in text form, serializing them back in binary form, then loading that binary file:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Binary
import Data.Binary
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector.Unboxed as V
main = do
s <- L.readFile "dat"
let v = parse s :: V.Vector Int
encodeFile "dat2" v
v' <- decodeFile "dat2" :: IO (V.Vector Int)
print (v == v')
-- Fill a new vector from a file containing a list of numbers.
parse = V.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
Random numbers
If we can parse from a file, we can also fill a vector with random numbers. We'll use the mersenne-random package:
$ cabal install mersenne-random
We can then use MTRandom class to generate random vectors of different types:
import qualified Data.Vector.Unboxed as U
import System.Random.Mersenne
import Control.Monad
main = do
-- create a new source of randomness
-- and an infinite list of randoms
g <- newMTGen Nothing
rs <- randoms g
-- fill a vector with the first 10 random Ints
let a = U.fromList (take 10 rs) :: U.Vector Int
-- print the sum
print (U.sum a)
-- print each element
forM_ (U.toList a) print
Running it:
$ runhaskell B.hs
-56426044567146682
-2144043065897064806
-2361203915295681429
1023751035988638668
-5145147152582103336
6545758323081548799
-7630294342751488332
-5861937811333342436
-3198510304070719259
8949914511577398116
-8681457396993884283
We can also just use the vector-random package to generate new vectors initialized with the mersenne twister generator:
For example, to generate 100 million random Doubles and sum them:
import qualified Data.Vector.Unboxed as U
import System.Random.Mersenne
import qualified Data.Vector.Random.Mersenne as G
main = do
g <- newMTGen Nothing
a <- G.random g 10000000 :: IO (U.Vector Double) -- 100 M
print (U.sum a)
Filling with a monadic action
We might want to fill a vector with a monadic action, and have a pure vector at the end. The Vector API now contains a standard replicateM for this purpose, but if your monadic action is in IO, the following code is more efficient:
replicateMIO :: (G.Vector v a) => Int -> IO a -> IO (v a)
replicateMIO n a = do
v <- M.new n
fill v 0
G.unsafeFreeze v
where
fill v i
| i < n = do
x <- a
M.unsafeWrite v i x
fill v (i+1)
| otherwise = return ()
Transformations on Vectors
A primary operation on arrays are the zip class of functions.
> let a = fromList [20,30,40,50]
> let b = enumFromN 0 4
> a
fromList [20,30,40,50]
> b
fromList [0,1,2,3]
> Data.Vector.zipWith (-) a b
fromList [20,29,38,47]
We can also, of course, apply a function to each element of a vector (map):
> Data.Vector.map (^2) b
fromList [0,1,4,9]
> Data.Vector.map (\e -> 10 * sin (fromIntegral e)) a
fromList [9.129452507276277,-9.880316240928618,7.451131604793488,-2.6237485370392877]
> Data.Vector.map (< 35) a
fromList [True,True,False,False]
Folds: Sums, Products, Min, Max
Many special purpose folds (reductions) on vectors are available:
> let a = enumFromN 1 100
> Data.Vector.sum a
5050
> Data.Vector.product a
93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000
> Data.Vector.minimum a
1
> Data.Vector.maximum a
100
Indexing, Slicing and Iterating
One dimensional arrays can be indexed, sliced and iterated over pretty much like lists.
Because Haskell values are by default immutable, all slice operations are zero-copying.
> let a = enumFromN 0 10
> a
fromList [0,1,2,3,4,5,6,7,8,9]
> let b = Data.Vector.map (^3) a
> b ! 2
8
> slice 2 3 b
fromList [8,27,64]
slice takes 3 arguments: the initial index to slice from, the number of elements to slice, and the vector to operate on.
A number of special-purpose slices are also available:
> Data.Vector.init b
fromList [0,1,8,27,64,125,216,343,512]
> Data.Vector.tail b
fromList [1,8,27,64,125,216,343,512,729]
> Data.Vector.take 3 b
fromList [0,1,8] :: Data.Vector.Vector
> Data.Vector.drop 3 b
fromList [27,64,125,216,343,512,729] :: Data.Vector.Vector
Unsafe Slices
For performance reasons you may wish to avoid bounds checks, when you can prove that the substring or index will be in bounds. For these cases there are unsafe operations, that let you skip the bounds check:
> let a = fromList [1..10]
> unsafeSlice 2 4 a
fromList [3,4,5,6]
> unsafeInit a
fromList [1,2,3,4,5,6,7,8,9]
> unsafeTail a
fromList [2,3,4,5,6,7,8,9,10]
and also unsafeTake, unsafeDrop.
Examples
Sum of Squares
Take the sum of the squares of the elements of a vector:
sumsq :: U.Vector Int -> Int
sumsq v = U.sum (U.map (\x -> x * x) v)