Fluid HPC: How Extreme-Scale Computing Should Respond to Meltdown and Spectre
The #Meltdown and #Spectre vulnerabilities are proving difficult to fix, and initial experiments suggest security patches will cause significant performance penalties to #HPC applications. Even as these patches are rolled out to current HPC platforms, it might be helpful to explore how future HPC systems could be better insulated from CPU or operating system security flaws that could cause massive disruptions. Surprisingly, most of the core concepts to build #supercomputers that are resistant to a wide range of threats have already been invented and deployed in HPC systems over the past 20 years. Combining these technologies, concepts, and approaches not only would improve cybersecurity but also would have broader benefits for improving HPC performance, developing scientific software, adopting advanced hardware such as neuromorphic chips, and building easy-to-deploy data and analysis services. This new form of “Fluid HPC” would do more than solve current vulnerabilities. As an enabling technology, Fluid HPC would be transformative, dramatically improving extreme-scale code development in the same way that virtual machine and container technologies made cloud computing possible and built a new industry. In today’s extreme-scale platforms, compute nodes are essentially embedded computing devices that are given to a specific user during a job and then cleaned up and provided to the next user and job. This “space-sharing” model, where the supercomputer is divided up and shared by doling out whole nodes to users, has been common for decades. Several non-HPC research projects over the years have explored providing whole nodes, as raw hardware, to applications. In fact, the cloud computing industry uses software stacks to support this “bare-metal provisioning” model, and Ethernet switch vendors have also embraced the functionality required to support this model. Several classic supercomputers, such as the Cray T3D and the IBM Blue Gene/P, provided nodes to users in a lightweight and fluid manner. By carefully separating the management of compute node hardware from the software executed on those nodes, an out-of-band control system can provide many benefits, from improved cybersecurity to shorter Exascale Computing Project (ECP) software development cycles.