Difference between revisions of "GHC/CloudAndHPCHaskell"

Revision as of 13:55, 7 October 2011

This page serves to discuss the architecture of Cloud and HPC frameworks, possibly with an emphasis on the distributed execution and network model.

Subpages:

http://haskell.org/haskellwiki/GHC/CloudAndHPCHaskell/Transport

Peter Braam's thoughts about HPC & Cloud Compute Frameworks

I want to describe here what is taking place in HPC and cloud computing, and how I think it could impact the architecture of the Haskell packages we are discussing.

In HPC the 90's saw the emergence of clusters for computing with a distributed memory model. MPI became the de-facto standard for this some 10 years later. Then multicore entered the game for which OpenMP became the software and communication "standard", and more recently GPGPU's (which have CUDA and OpenCL) and other accelerators (FPGA boards are seeing a revival). Moreover, other many core chips which have a NOC (network on chip) are becoming important. It is likely that these hybrid architectures which are sometimes even re-configurable will remain prominent for the next 10 years or so.

The HPC community is looking for a software platform to program these complex systems within one model - things like CUDA + OpenMP + MPI are way too complicated. The architecture of the system must be made a first class citizen in further considerations for that to succeed. The architecture should be some kind of data structure, influenced by the hardware industry, that describes the networks with their interfaces and topologies that connect nodes, and include the various kinds of memories present in the node, the CPU chips with cores and accelerators with their memories and NOC's. As an example of this interest, AMD is developing the so called FSA architecture that does this at least for a node. It provides very low level interfaces to the components for instruction / data / thread dispatch and scheduling. Elements of such architecture descriptions exist (e.g. LLNL "genders" and resources associated with schedulers), but nowhere close to what we need.

A second key element is the scheduler. Until recently, things like HPC or the Hadoop job schedulers started jobs. Now, much more needs to be scheduled, on different cores, on accelerator boards and so on. The scheduler is responsible to create running entities on the components of the architecture, importantly these running entities normally read input data from a distributed file system or database. Currently scheduling is another messy story - one thing starts processes on remote nodes, another element spawns threads, and yet other facilities deal with the accelerators.

The networks are increasingly complicated too. There are now facilities to send packets directly from CPU caches to Infiniband networks, and people are working to set up networking between the accelerators and the cluster network (often IB). Moreover, most of the architectures are NUMA now. There are API's (like CCI) that aim to unify these various transports as one. About MPI, one gets about 50% of the bandwidth that is available, it's not a good platform anymore (see http://gasnet.cs.berkeley.edu/ for comparison with a PGAS oriented transport - created for UPC, or see ETI's web pages about SWARM).

Unlike client server jobs and unlike user applications running a 3rd party network protocols, the cloud and HPC jobs would probably be well served with an SPMD like model even with these complex architectures. I think that in SPMD models much communication is the exchange of memory, with just one caveat that sending small messages is very costly. With that we can hide the network almost completely from the programmer, just like DPH hides the cores.

Combining DPH with (i) networking, (ii) a BSP (Bulk Synchronous Processing) mode (iii) and beginning to look at other dispatch after vectorizing in DPH (e.g. to Accelerate) - probably totally obvious ideas to you would provide the parallel programs the Architecture and Scheduler can run. Encapsulating network and node failures in some Monad should be possible, and would as mentioned be an important component, or hopefully a perpendicular extension. The OTP failure model is perfect for purely functional situations, but with state involved things like checkpoints and exactly once semantics make it more involved.

This is in some sense just elaborating what SPJ said at a CUFP 2011 breakfast last week. I think that getting rid of compiler flags and runtime hints and instead leverage a built in scheduler / architecture model would be a good first step.

Difference between revisions of "GHC/CloudAndHPCHaskell"

Revision as of 13:55, 7 October 2011

Navigation menu

Search

@@ Line 15: / Line 15: @@
 A second key element is the scheduler.  Until recently, things like HPC or the Hadoop job schedulers started jobs.  Now, much more needs to be scheduled, on different cores, on accelerator boards and so on.    The scheduler is responsible to create running entities on the components of the architecture, importantly these running entities normally read input data from a distributed file system or database.  Currently scheduling is another messy story - one thing starts processes on remote nodes, another element spawns threads, and yet other facilities deal with the accelerators.
-The networks are increasingly complicated too.  There are now facilities to send packets directly from registers to Infiniband networks, and people are working to set up networking between the accelerators and the cluster network (often IB).  Moreover, most of the architectures are NUMA now.  There are API's (like CCI) that aim to unify these various transports as one.  About MPI, one gets about 50% of the bandwidth that is available, it's not a good platform anymore (see http://gasnet.cs.berkeley.edu/ for comparison with a PGAS oriented transport - created for UPC, or see ETI's web pages about SWARM).
+The networks are increasingly complicated too.  There are now facilities to send packets directly from CPU caches to Infiniband networks, and people are working to set up networking between the accelerators and the cluster network (often IB).  Moreover, most of the architectures are NUMA now.  There are API's (like CCI) that aim to unify these various transports as one.  About MPI, one gets about 50% of the bandwidth that is available, it's not a good platform anymore (see http://gasnet.cs.berkeley.edu/ for comparison with a PGAS oriented transport - created for UPC, or see ETI's web pages about SWARM).
 Unlike client server jobs and unlike user applications running a 3rd party network protocols, the cloud and HPC jobs would probably be well served with an SPMD like model even with these complex architectures.  I think that in SPMD models much communication is the exchange of memory, with just one caveat that sending small messages is very costly.  With that we can hide the network almost completely from the programmer, just like DPH hides the cores.