This page is initially to provide a location for discussions on extending GHC to take advantage of CPU SIMD instructions, including SSE and Altivec instructions.
SSE provides 'packed' data types of floats and integers that fit into 128 bit xmm registers.
The operations on these data types include the standard mathematical operations (Add/Mul/...). There are also additional mathematical operations (reciprocal, reciprocal-square-root) and packed-specific operations such as dot-product, horizontal add/sub/add-sub.
Also, to support data-streaming operations, there are memory operations that bypass the cache and write directly to/from the xmm registers.
xmm registers are 128 bits and hold both packed integer and packed float types. I suggest that a new `PackedReg` data constructor be added.
In terms of an implementation plan:
- Add new packed data types and 'standard' operations on those types to Cmm and primops.txt.pp
- Int32Packed4#, ...
- Width = ... | W32_4 | ...
- implement new types and operations in backends (C/LLVM/ASM)
So far this is straightforward.
- As has been mentioned on the developer's wiki a 'packed-size' agnostic optimising layer of vector operations would be great. It seems that this could be implemented without new primops on top of the CPU-specific primops.
- What mechanism should be used for constructing/accessing elements of a packed data type? (LLVM has a <vector n type> datatype with accessor functions).
- Stream fusion would allow complex operations for 'map'ed and 'zip'ed vectors of Floats, etc., that are optimised to make use of CPU Vectors.