|Part of Yhc|
This page lists entirely weird and crazy ideas, that probably shouldn't be implemented. However, if you don't write a thought down it might never come back.
1 Have a via C generator
Could Yhc generate C? Maybe from the .hbc file? It should speed things up considerably. Perhaps have a yho (York Haskell Optimiser) that takes the G-machine code and converts it into C? I guess a simple one should be quite easy to do, a more complex one that exploits stuff (i.e. no point pushing then adding, just add directly) would give better performance, without too much hassle.
2 Proof of concept port to Cell architecture
People have sometimes commented that a language like Haskell might be a better language for programming a Cell-like cpu architecture than C since it is further from the low level details and so could adapt easier.
The Cell cpu architecture is difficult to code for in C because:
- Highly threaded. The Cell has one main cpu (Actually, the architecture I'm familiar with has 2 PPUs and 8 SPUs.) and 8 small cpus. Threaded C programming is difficult.
- Local memory. Each of the 8 small cpus do not directly address main memory but their own small local memory (about 256K) and then do block memory operations to and from main memory. In essence this is L1/L2 cache but under software control.
- Limited IO. Only the central cpu can do IO operations.
So it would be interesting to see if Haskell really can be implemented on this kind of architecture. For just a proof-of-concept it would be a great deal easier to modify yhc to try this than ghc since the latter is much more complex.
So what might it look like? Well obviously yhc would need to be extended to support concurrent Haskell.
One problem with the 8 SPUs is that their local memory is very small. You would not want to waste space by having a full runtime system on each one. Each SPU should only run the "mutator" parts of the runtime system. Allocation of memory blocks and major GC collections would be deferred to the central CPU.
Like GHC's IO design (at least in the threaded rts), we would want to use an IO manager thread running on the central CPU to perform all IO operations on behalf of each SPU. This would also eliminate much of the rts that is needed to run code on each SPU.
The RTS on each SPU would have to deal with minor GC collections. The GC would use a generational scheme where some proportion of the SPU local memory would used for the first generation. The SPU local memory would be split between the first generaton and those bits of older generations which were currently being read.
The RTS of each SPU would need to be able to determine if a Haskell heap object was currently in local memory or if it needed to be fetched from main memory. Perhaps some software implementation of a TLB would be appropriate. It would also need to deal with 'paging' chunks between local and main memory to deal with accessing objects from older GC generations.
If the RTS were particularly clever it might arrange for threads which are in a MVar reader/writer relationship to be scheduled on ajacent SPUs since these have faster access to each others local memory.
3 Bytecode Optimiser
Hopefully this isn't that crazy... This should collapse multiple bytecode files into one, so you have the standalone executable. It should do optimisation, much more agressively. Particularly inlining could be used. Also duplicate and redundant code should be detected and rejected.
4 HsNub - Haskell Duplicator
Two functions are equivalent if they have the same bytecode. This program should grab a load of bytecode files, and look for duplication. In particular, functions from the prelude might often be duplicated. There is a certain level of isomorphism that could be detected - i.e. constant table in different order, but even just a bit-for-bit identical would be useful as a first pass.
For more details, see the Yhc blog
5 .hbc -> .abc Compiler
The Clean language (http://www.cs.ru.nl/~clean/) has a portable byte code format, .abc. It also has a separate and stand alone .abc -> native code compiler. I'm not sure what optimisations are applied at each level, but this might be a fairly easy of getting a huge speed boost. Clean is known for being uber-fast.
I couldn't find a reference on .abc bytecode, but its plain text, so maybe someone could reverse engineer it. http://www.cs.ru.nl/~clean/contents/Addison__Wesley_book/addison__wesley_book.html