FOX‎ > ‎


Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. Future systems are expected to feature billions of threads and 10s of millions of CPUs. The nodes and networks of these systems will be hierarchical, and ignoring this hardware hierarchy will lead to poor utilization. Failure will be a constant companion, and it is unlikely that checkpointing the entire system, with its petabytes of memory, will be practical. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis.

Absence of Reliability

Most failures on HPC systems are hardware failures. It is expected that as machines grow in size and semiconductor feature sizes shrink, overall system reliability will become much worse, and the mean time between faults and interrupts will decrease. In fact we expect applications will have to function in an environment of near-continuous failure: something, somewhere in the machine, will always be failing.

On a single node, failures in hardware are so routine that the hardware is designed not to react to failure, but to bypass it, in the sense that paths known to fail are over-designed to work around failure. For example, Error Correcting Codes use 5 extra bits of memory per 32 bits but largely eliminate the need to react to memory failure: the days when a machine had to reboot because of a single bit memory error are far behind us. On a single node, the hard- ware provides an interface to the runtime which allows the runtime to be oblivious to hardware faults, i.e. it provides the runtime with a fault-oblivious interface. However our parallel computing runtimes do not provide such an interface to the application. In fact the runtime requires the application to be fault-aware.

In response to the lack of a fault-oblivious runtime, application programmers have now begun devising ways to make programs continue to run correctly even when a subset of the computation fails. Doing this within the context of existing programming practices is a major challenge because the number of possible code paths explodes when faults must also be considered. Development of a complex scientific ap- plication is already a major endeavour, and to ensure that application will perform correctly for each of the possible code paths that a failure could cause would be a daunting task.

Programmers have to program around failure because the programming models they use do not tolerate failure. Check-pointing of the entire system is required because the programming models do not tolerate failure. Builders of super-computers, in turn, must design peak performance of the networks and file systems around failure because the programming models do not tolerate failure. A huge amount of research has been put into optimizing checkpointing, including the creation of specialized checkpointing file systems. Our fault-intolerant programming models drive every aspect of HPC.

The Lack of Tools

Another issue is the management of data. Programmers have a plethora of tools to manage data – outside the HPC system. Inside the system, they have their own program and whatever modules they can pull into, e.g., Python. The HPC system is data-rich, but tool-poor; the users’ desktop systems are tool-rich, and data-poor. To get the data from the HPC system to the user, it must first be parked in a file system once the computation has finished, at which point it can be accessed. There is still no generally available HPC software for interactive use: we still use a batch model, even as we will soon be celebrating a half-century of time-shared, interactive computer access. To access the data while it is in the HPC machine, special code must be compiled to work in the context of the MPI libraries and limited-function operating systems. The programming in- terface used on the HPC system is nothing like that used on the non-HPC systems. Programmers end up living in two very different worlds, which ultimately limits their productivity.

Inadequacy of Programming Models

Scientific computing is dominated by imperative programming languages such as C, C++, and FORTRAN. Such languages require the programmer to specify where all data resides and when all computations are performed. This approach to computing continues to become more and more inadequate as the gap between memory speeds and processor speeds continues to grow. This von Neumann bottleneck has been known for several decades; however, no satisfactory solution exists for overcoming it in scientific programming. Currently, developers seeking high performance must painstakingly optimize for the memory hierarchy or use subroutine libraries that have already performed such optimizations. Introducing parallelism greatly complicates an already difficult situation. Effectively several more layers are added to the memory hierarchy, which causes additional data synchronization and consistency issues. The Message Passing Interface (MPI) has enabled great strides in high performance computing by providing a portable interface for distributed memory applications; however, as with imperative languages, it encourages programmers to more or less specify where data resides and what computations will be done on which nodes. A further problem is that MPI programming styles typically require the use of collective operations such as barrier, reductions, and broadcasts. Such operations are not even weakly scalable because they take longer to execute as the number of processors increase and further impact performance by introducing load imbalance. Dynamic load-balancing schemes can relieve some of these issues, but do not fully resolve the problems introduced by memory hierarchies and synchronization.

The Role of System Software

The current model for all current HPC system software follows two main thrusts: it is either a “heavyweight” kernel, based on the Unix time-sharing kernel design of the 1970s; or it is a “light-weight” kernel, which provides little more than the ability to start a program up and let it take over the hardware completely.

Neither of these systems can provide the capabilities weneed for our environment. Light-weight kernels, by design, put the application in control of everything on the system. They were developed for HPC in the 1990s due to the failings of the then-extant Unix-like kernels, which extracted too high a price for the mostly-unused capabilities they provided.

The Unix-like systems, now mostly represented by Linux, are really a time sharing kernel brought to HPC. Their capabilities are also not a good match to our needs (the rea- son that the lightweight kernels were developed in the first place), although they can be made to work, as witness the fact that the fastest machine in the world runs Linux. At the same time, Linux plays much the same role that the lightweight kernels do: its main job is to start programs and get out of the way, handing control of the network interface to the application. The irony of the success of Linux in HPC is that it is achieved in part by disabling so much of what Linux does.

The light-weight kernels, typified by the Compute Node Kernel (CNK) on Blue Gene, are evolving over time to provide more and more services provided by Linux. The CNK now provides limited multiple process support via the Linux clone() system call, as well as shared library support. These kernels are becoming hybrids.

Neither the heavy- or light-weight kernels provide support for capabilities we are using today in the HARE project. These kernels support systems software environments that are batch-oriented. Applications, not kernels, control the networks, which limits the richness of the tools available to users on HPC nodes. Any tool which runs in these environments, and which wishes to share data with the application while the application is running, has to be linked into the application, and use the libraries the application is using.

In fact, HPC systems software is stuck. It has changed little in 20 years. Users are given the choice of heavy-weight, time-sharing oriented kernels that severely impact the performance of applications while not providing capabilities useful to HPC; or light-weight kernels that leave the application in charge of everything, and limit the types of programs that can run on the system.

Additional Background Material and Prior Work

SelectionFile type iconFile nameDescriptionSizeRevisionTimeUser
View Download
Kittyhawk: Enabling cooperation and competition in a global shared computational system. Appavoo, et. al.  265k v. 2 Feb 4, 2011, 8:55 AM Eric Van Hensbergen
View Download
Experiences porting the Plan 9 research operating system to the IBM Blue Gene supercomputers. Minnich, et. al.  259k v. 2 Feb 4, 2011, 8:56 AM Eric Van Hensbergen
View Download
A Simulator for Large-scale Parallel Computer Architectures. Janssen, et. al.  374k v. 2 Feb 4, 2011, 8:57 AM Eric Van Hensbergen
View Download
Selective recovery From Failures In A Task Parallel Programming Model. Krishnamoorthy, et. al.  104k v. 2 Feb 4, 2011, 8:57 AM Eric Van Hensbergen