Nowaday multiprocessor machine memory levels are more and more hierarchical: Opteron systems are just NUMAs, bicore chips can have a shared cache level, HyperThreaded logical processors share almost everything, one adds to this acceleration boards, ... How to properly schedule non-regular scientific computing on such machines ?!
The basic idea I have developped during my PhD thesis is providing programmers a way to express how threads of their application relate together: bubbles. A bubble expresses for instance that some threads work on the same set of data, that they often communicate together, ... so that they should be schedule in the same "corner" of the machine ; in a hierarchical manner.
I have developped an API that permits to manipulate these bubbles with a high level of abstraction. That way, people can experiment different distribution schedulers without having to care about hardware details for instance. They can really focus on algorithmic issues.
I have developped "bubble schedulers" that manipulates such hierarchy of bubbles: spreading the computation load while keeping affinities into account, gang scheduling, work stealing. Trainees could experiment some other strategies: favoring affinities above all, taking into account the size of data, how it is shared and the access rate, ... All this in a way that can automatically adapt itself to any hierarchical machine! The PhD thesis of François Broquedis developped these schedulers, experimenting them with OpenMP applications.
My thesis is available as PDF.
This is being developped within Marcel, the efficient, portable and flexible thread library of the PM2 project.
From my work on the hierarchy of a machine, we extracted a software component, HwLoc, which handles abstracting the details of detection and representation of the hierarchy of a machine, which is modeled through an annotated tree. Computation software can thus easily, in a portable way, explicitly manipulate “cores”, “sockets”, but also consider the machine as a generic hierarchy, without caring about architectural details. This component is now used in all the main implementations of the MPI communication interface, and in numerous computation projects. It is thus installed in the majority of computation centers.
Cédric Augonnet, during his PhD under my co-direction, has designed StarPU, a framework for scheduling tasks over heterogeneous machines. The idea is to try to perform all optimizations at runtime: data transfers are minimized and performed in advanced, overlapped with computation, and interact with the task scheduling decisions. The latter take into account performance models of the tasks, which permits to capture the heterogeneous aspect of the machine, and even take benefit from it! StarPU is being integrated into scientific computation libraries such as to integrate the linear algebra reference implementations: PLASMA and MAGMA.