ETP4HPC White Papers

While preparing for the next edition of our Strategic Research Agenda (SRA), ETP4HPC has also opened the process of delivering White Papers, ie. short documents tackling technical issues pertaining to European HPC. The findings of the White Papers will be re-used in the SRA and in others stream of work such as the TransContinuum Initiative.


Processing in Memory: the Tipping Point	Towards Integrated Hardware/Software Ecosystems for the Edge-Cloud-HPC Continuum	Task-Based Performance Portability in HPC	< QC \| HPC > Quantum for HPC	HPC for Urgent Decision-Making
				-
Heterogeneous High Performance Computing	Federated HPC, Cloud and Data infrastructures	Unconventional HPC Architectures	Modular Supercomputing Architecture	-

Processing in Memory: the Tipping Point

Decades after being initially explored in the 1970s, Processing in Memory (PIM) is currently experiencing a renaissance. By moving part of the computation to the memory devices, PIM addresses a fundamental issue in the design of modern computing systems, the mismatch between the von Neumann architecture and the requirements of important data-centric applications. A number of industrial prototypes and products are under development or already available in the marketplace, and these devices show the potential for cost-effective and energy-efficient acceleration of HPC, AI and data analytics workloads. This paper reviews the reasons for the renewed interest in PIM and surveys industrial prototypes and products, discussing their technological readiness.

Wide adoption of PIM in production, however, depends on our ability to create an ecosystem to drive and coordinate innovations and co-design across the whole stack. European companies and research centres should be involved in all aspects, from technology, hardware, system software and programming environment, to updating of the algorithm and application. In this paper, we identify the main challenges that must be addressed and we provide guidelines to prioritise the research efforts and funding. We aim to help make PIM a reality in production HPC, AI and data analytics.

Read online below or download the PDF>

Towards Integrated Hardware/Software Ecosystems for the Edge-Cloud-HPC Continuum

Modern use cases such as autonomous vehicles, digital twins, smart buildings and precision agriculture, greatly increase the complexity of application workflows. They typically combine physics-based simulations, analysis of large data volumes and machine learning and require a hybrid execution infrastructure: edge devices create streams of input data, which are processed by data analytics and machine learning applications in the Cloud, and simulations on large, specialised HPC systems provide insights into and prediction of future system state. From these results, additional steps create and communicate output data across the infrastructure levels, and for some use cases, control devices or cyber-physical systems in the real world are controlled (as in the case of smart factories). All of these steps pose different requirements for the best suited execution platforms, and they need to be connected in an efficient and secure way. This assembly is called the Computing Continuum (CC). It raises challenges at multiple levels: at the application level, innovative algorithms are needed to bridge simulations, machine learning and data-driven analytics; at the middleware level, adequate tools must enable efficient deployment, scheduling and orchestration of the workflow components across the whole distributed infrastructure; and, finally, a capable resource management system must allocate a suitable set of components of the infrastructure to run the application workflow, preferably in a dynamic and adaptive way, taking into account the specific capabilities of each component of the underlying heterogeneous infrastructure.

To address the challenges, we foresee an increasing need for integrated software ecosystems which combine current “island” solutions and bridge the gaps between them. These ecosystems must facilitate the full lifecycle of CC use cases, including initial modelling, programming, deployment, execution, optimisation, as well as monitoring and control. It will be important to ensure adequate reproducibility of workflow results and to find ways for creating and managing trust when sharing systems, software and data. All of these will in turn require novel or improved hardware capabilities. This white paper provides an initial discussion of the gaps. Our objective is to accelerate progress in both hardware and software infrastructures to build CC use cases, with the ultimate goals of accelerating scientific discovery, improving timeliness, quality and sustainability of engineering artefacts, and supporting decisions in complex and potentially urgent situations.

Read online below or download the PDF>

Task-Based Performance Portability in HPC

As HPC hardware continues to evolve and diversify and workloads become more dynamic and complex, applications need to be expressed in a way that facilitates high performance across a range of hardware and situations. The main application code should be platform-independent, malleable and asynchronous with an open, clean, stable and dependable interface between the higher levels of the application, library or programming model and the kernels and software layers tuned for the machine. The platform-independent part should avoid direct references to specific resources and their availability, and instead provide the information needed to optimise behaviour.

This paper summarises how task abstraction, which first appeared in the 1990s and is already mainstream in HPC, should be the basis for a composable and dynamic performance-portable interface. It outlines the innovations that are required in the programming model and runtime layers, and highlights the need for a greater degree of trust among application developers in the ability of the underlying software layers to extract full performance. These steps will help realise the vision for performance portability across current and future architectures and problems.

Read online below or download the PDF>

< QC | HPC > Quantum for HPC

Quantum Computing (QC) describes a new way of computing based on the principles of quantum mechanics. From a High Performance Computing (HPC) perspective, QC needs to be integrated:

at a system level, where quantum computer technologies need to be integrated in HPC clusters;
at a programming level, where the new disruptive ways of programming devices call for a full hardware-software stack to be built;
at an application level, where QC is bound to lead to disruptive changes in the complexity of some applications so that compute-intensive or intractable problems in the HPC domain might become tractable in the future.

The White Paper QC for HPC focuses on the technology integration of QC in HPC clusters, gives an overview of the full hardware-software stack and QC emulators, and highlights promising customised QC algorithms for near-term quantum computers and its impact on HPC applications. In addition to universal quantum computers, we will describe non-universal QC where appropriate. Recent research references will be used to cover the basic concepts. The target audience of this paper is the European HPC community: members of HPC centres, HPC algorithm developers, scientists interested in the co-design for quantum hardware, benchmarking, etc.

Read online below or download the PDF>

HPC for Urgent Decision-Making

Emerging use cases from incident response planning and broad-scope European initiatives (e.g. Destination Earth, European Green Deal and Digital Package) are expected to require federated, distributed infrastructures combining computing and data platforms. These will provide elasticity enabling users to build applications and integrate data for thematic specialisation and decision support, within ever shortening response time windows.

For prompt and, in particular, for urgent decision support, the conventional usage modes of HPC centres is not adequate: these rely on relatively long-term arrangements for time-scheduled exclusive use of HPC resources, and enforce well-established yet time-consuming policies for granting access. In urgent decision support scenarios, managers or members of incident response teams must initiate processing and control the resources required based on their real-time judgement on how a complex situation evolves over time. This circle of clients is distinct from the regular users of HPC centres, and they must interact with HPC workflows on-demand and in real-time, while engaging significant HPC and data processing resources in or across HPC centres.

This white paper considers the technical implications of supporting urgent decisions through establishing flexible usage modes for computing, analytics and AI/ML-based applications using HPC and large, dynamic assets.

The target decision support use cases will involve ensembles of jobs, data-staging to support workflows, and interactions with services/facilities external to HPC systems/centres. Our analysis identifies the need for flexible and interactive access to HPC resources, particularly in the context of dynamic workflows processing large datasets. This poses several technical and organisational challenges: short-notice secure access to HPC and data resources, dynamic resource allocation and scheduling, coordination of resource managers, support for data-intensive workflow (including data staging on node-local storage), preemption of already running workloads and interactive steering of simulations. Federation of services and resources across multiple sites will help to increase availability, provide elasticity for time-varying resource needs and enable leverage of data locality.

Read online below or download the PDF>>

Heterogeneous High Performance Computing

Modern HPC systems are becoming increasingly heterogeneous, affecting all components of HPC systems, from the processing units, through memory hierarchies and network components to storage systems. This trend is on the one hand due to the need to build larger, yet more energy efficient systems, and on the other hand it is caused by the need to optimise (parts of the) systems for certain workloads. In fact, it is not only the systems themselves that are becoming more heterogeneous, but also scientific and industrial applications are increasingly combining different technologies into complex workflows, including simulation, data analytics, visualisation, and artificial intelligence/machine learning. Different steps in these workflows call for different hardware and thus today’s HPC systems are often composed of different modules optimised to suit certain stages in these workflows.

While the trend towards heterogeneity is certainly helpful in many aspects, it makes the task of programming these systems and using them efficiently much more complicated. Often, a combination of different programming models is required and selecting suitable technologies for certain tasks or even parts of an algorithm is difficult. Novel methods might be needed for heterogeneous components or be only facilitated by them. And this trend is continuing, with new technologies around the corner that will further increase heterogeneity, e.g. neuromorphic or quantum accelerators, in-memory-computing, and other non-von-Neumann approaches.

In this paper, we present an overview of the different levels of heterogeneity we find in HPC technologies and provide recommendations for research directions to help deal with the challenges they pose. We also point out opportunities that particularly applications can profit from by exploiting these technologies. Research efforts will be needed over the full spectrum, from system architecture, compilers and programming models/languages, to runtime systems, algorithms and novel mathematical approaches.

Read online below or download the PDF>>

Federated HPC, Cloud and Data Infrastructures

An increasing interest is observed in making a diversity of compute and storage resources, which are geographic spread, available in a federated manner. A common services layer can facilitate easier access, more elasticity as well as lower response times, and improved utilisation of the underlying resources. In this white paper, current trends are analysed both from an infrastructure provider as well as an end-user perspective. Here the focus is on federated e-infrastructures that among others include high-performance computing (HPC) systems as compute resources. Two initiatives, namely Fenix and GAIA-X, are presented as illustrative examples. Based on a more detailed exploration of selected topical areas related to federated e-infrastructures, various R&I challenges are identified and recommendations for further efforts formulated.

Read online below or download the PDF>>

Unconventional HPC Architectures

As CMOS scaling is facing increasing challenges to deliver performance gains by conventional means, a range of new unconventional architectures emerge

Moore’s Law, which stated that “the complexity for minimum component costs has increased at a rate of roughly a factor of two per year “, is slowing down due to the enormous cost of developing new process generations along with feature sizes approximating silicon interatomic distances. With the end of Dennard scaling (i.e; the rather constant power density across technology nodes, and the increase of operating frequency with each technology node) more than 15 years ago, using more transistors to build increasingly parallel architectures was a key focus. Now, other ways need to be found to deliver further advances in performance. This can be achieved by a combination of innovations at all levels: technology (3D stacking, using physics carry out computations, etc.), architecture (e.g., specialization, computing in/near memory, dataflow), software, algorithms and new ways to represent information (e.g., neuromorphic – coding information “in time” with “spikes”, quantum, mixed precision). A closely related challenge is the energy required to move data, which can be orders of magnitude higher than the energy of the computation itself.

These challenges give rise to new unconventional HPC architectures, which, through some form of specialisation, achieve more efficient and faster computations. This paper covers a range of new unconventional HPC architectures which are currently emerging. While these architectures and their underlying requirements are naturally diverse, AI emerges as a technology that drives the development of both novel architectures and computational techniques due to its dissemination and computing requirements. The ever-increasing demand of AI applications requires more efficient computation of data centric algorithms. Furthermore, minimising data movement and improving data locality plays an important role in achieving high performance while limiting power dissipation and this is reflected in both architectures and programming models. We also cover models of computation that differ from those used in conventional CPU architectures, or models that are purely focussed on achieving performance through parallelisation. Finally, we address the challenges of developing, porting and maintaining applications for these new architectures.

Read online below or download the PDF>>

Modular Supercomputing Architecture

The European Community and its member states regularly invest large volumes of funding and effort in the development of HPC technologies in Europe. However, some observers express the criticism that these investments are either unfocused, lack long-term perspectives, or that their results are not mature enough to be adopted by the mainstream developments, which limits their benefit for the European HPC community, industry and society. This paper is intended as a counterexample to this pessimistic view. It describes the success story of Modular Supercomputing Architecture, which started in 2011 with the EU-funded R&D project “DEEP”, and is now being adopted by large-scale supercomputing centres across the old continent and worldwide. Main hardware and software characteristics of the architecture and some of the systems using it are described, complemented by a historical view of its development, the lessons learned in the process and future prospects.

Read online below or download the PDF>>