Abstracts

Decentralized machine learning and the compute continuum


Marco Aldinucci

University of Torino, Italy


Decentralized machine learning (DML) enables collaborative machine learning without centralized input data. Federated learning (FL) and edge inference (EI) are examples of DML. Collaboration naturally happens at the edge of a distributed system with inherently distributed data. While tools for DML are starting to flourish, much needs to be done to get more flexible and portable tools to experiment with novel techniques, non-fully connected topologies, multiple data domains, and asynchronous collaboration schemes. We will present recent advances in DML, aiming to improve usability in data centers and, at the edge, to widen the class of models extending FL to non-DDN paradigms, to improve the accuracy of models controlling normalization and frequency of communications, and to boost data privacy through generative adversarial networks.


Chatting about transformers


Gianfranco Bilardi

University of Padova, Italy


The talk will present the Transformer Architecture for Large Language Models and outline some research questions.


Fast and Accurate Triangle Counting in Graph Streams Using Predictions


Cristian Boldrin

University of Padova, Italy


Counting the number of triangles is a fundamental primitive in the analysis of graphs, with applications ranging from community detection, anomaly detection and molecular biology. In most applications, the exact computation of the number of triangles is unfeasible, due to the massive size of the nowadays datasets. For this reason, one has often to resort to efficient algorithms that provide high-quality approximations, while using a limited amount of memory. This talk will introduce TONIC, an efficient and practical algorithm for estimating the number of triangles in a graph stream. TONIC carefully couples sampling techniques with predictions for the heaviness of edges, that is the number of triangles in which an edge is involved. As a result, our algorithm is fast, provides guarantees on the amount of memory used, and exploits the additional information provided by the predictor to produce highly accurate estimates outperforming the state-of-the-art methods, especially when considering sequences of hundreds of graph streams.

Joint work with Fabio Vandin.


The Role of Compilers in Deep Learning


Barbara Chapman

Hewlett Packard Enterprise and Stony Brook University, USA


As AI evolves, it not only devours significant computational power but also demands increasing efficiency. Many elements of the software stack employ HPC principles, but not necessarily in familiar ways, given the specific nature of this workload. This includes compilers. As deep neural architectures have evolved from simple models to complex systems, such as Megatron- Turing Natural Language Generation (MT-NLG), the search space of potential compiler optimizations has grown significantly. Thus a number of approaches have been developed to reduce the search space or accelerate the identification of a suitable execution schedule for the tensor operations, from feedback-based approaches through transfer learning. We discuss the demands of this space and the evolution of compilers to facilitate scalable performance of large and complex DNNs.


Computing Systems Research and AI Solutions: How to Wash Each Other’s Hands


Tiziano De Matteis

Vrije Universiteit Amsterdam, Netherlands


The rising computational demands of emerging workloads like Generative AI, alongside the decline of Moore’s Law and Dennard scaling, challenge ICT researchers and companies to continue supporting the digitalization of our society. Besides performance, energy efficiency and sustainability are also now crucial objectives: this is not just a research direction but also a pressing need for the ICT community.

Leveraging its experience in distributed computing systems, our group is actively addressing these challenges by integrating insights from traditional systems research with the pressing need for more efficient computing and the latest advancements in hardware and system architecture driven by machine learning (ML) workloads.

In the first part of the talk, we will focus on how we can improve the ICT infrastructure that runs (among the other) AI/ML workloads. We will present our ongoing work on FootPrinter, a simulation-based tool that assists data center designers and operators in evaluating the environmental impact of their facilities and in making informed decisions about operational changes or design improvements.

In the second part of the talk, we will discuss how we plan to leverage ML-tailored hardware solutions for the scientific and HPC community. Despite the massive parallelism offered by ML-specialized accelerators, the scientific community has yet to fully explore their use in areas other than ML, such as computational science or graph processing. The steep learning curve, low productivity of existing tools, and lack of reusable software make them inaccessible to non-experts. We argue that, like GPUs 15 years ago, there is a need for new abstractions, models, and open-source tools to make these devices more accessible and effective for non-experts.


ClusterCockpit - A job-specific performance and energy monitoring framework (also includes our experiences as a Tier 2 HPC center with users doing AI)


Jan Eitzinger (Treibig)

University of Erlangen, Germany


ClusterCockpit is a job-specific performance and energy monitoring framework with a focus on easy installation and maintenance. ClusterCockpit consists of cc-backend (an API-Backend and Web Frontend), cc-metric-collector (a node agent collecting various metrics including HPM metrics), and cc-metric-store (an in-memory zero maintenance metric cache). The ClusterCockpit project proposes JSON schemas to enable interoperability between different HPC monitoring stacks. It gives HPC users as well as support personnel access to various metrics and job statistics enabling them to identify faulty jobs or jobs with a large optimization potential. As part of the BMBF EE-HPC project, ClusterCockpit will be accompanied by a cluster-wide energy management framework, which can limit the total power and automatically optimize the energy distribution among jobs for optimal energy usage. Finally, the talk will present the current status of AI at NHR@FAU, an academic Tier2 HPC center in Germany, and will give an overview of common problems with this new application area from an HPC center perspective.


HPC for AI and Geophysical Forecasting


Anne Elster

NTNU: Norwegian University of Science and Technology, Norway


AI has with the help of HPC and access to large datasets evolved from expert systems and hand-tuned model building to using machine learning to auto-generated models, and generative AI technologies such as OpenAI´s ChatGPT and Google´s Bard. At the same time, these technologies are due to their extreme need for computing power now driving forces in hardware design. Modern GPUs thus cater to not only graphics but AI algorithms by providing features such as an increasing number of lower-precision tensor cores. In this talk, we will discuss some of our work related to utilizing these features, in the context of HPC, and how HPC techniques such as autotuning and GPU-assisted compression, can impact AI. This talk will also highlight some of the ongoing work my group is involved in at The Center for Geophysics Forecasting at NTNU.

Geophysical forecasting offers the opportunity to leverage some of the cutting-edge technologies from the oil and gas sector to improve, for instance, geohazard monitoring and forecasting sudden events along roads and railways. This also includes the use of new methods for monitoring and mapping life and geophysical events at sea and near the seabed. Modern seismic sensors and DAS (Distributed Acoustic Sensing) systems also generate vast datasets we will need both AI and HPC techniques to fully make use of. These tasks offer many research challenges over the next several years.


Interactive Program and Performance Visualization for High-level Heterogeneous and Parallel Computing


August Ernstsson

Linköping University, Sweden


This talk presents recent contributions relating to the SkePU skeleton programming framework, in particular, the design and implementation of a performance visualization system for high-level programming of heterogeneous parallel systems. This system consists of an execution trace extension to SkePU and a separate, interactive tool that visualizes the execution trace as a graph with a connected source code browser. The main contribution of the work lies in addressing the gap of performance analysis and debugging for heterogeneous parallel programming at high abstraction levels, where the tools ought to present information at the same level of abstraction as the programming model itself. In addition, the talk will touch on ongoing work with domain-specific extension libraries for SkePU, including our efforts to provide interface and implementation for convolutional neural networks on top of SkePU.


Collaborative State Machines: A Novel Programming Model for the Cloud-Edge-IoT Continuum


Thomas Fahringer

University of Innsbruck, Austria


Existing programming models face challenges in addressing dynamic and stateful aspects of modern Cloud-Edge applications. To overcome this weakness, we introduce Collaborative State Machines (CSM), a novel approach to facilitate the development of reactive, event-driven, and stateful applications for the Cloud-Edge-IoT continuum. CSM enables the development of applications as a collection of state machines that collaborate autonomously and can be distributed across the layers of the continuum. Key features of CSM include (i) a collaboration mechanism among state machines using events and persistent data; (ii) encapsulation of state by encompassing the inherent state of state machines and persistent data; (iii) actions and service invocations as an integral part of states and state transitions decoupling complex application logic from compute and data processing services; and (iv) an advanced data model supporting the processing of local, static and persistent data with scope and lifetime. We evaluate CSM through two realistic use cases. Our evaluation with a light control use case demonstrates that CSM provides a significant improvement in productivity and costs over a state-of-the-art application development platform (AWS Step Functions). Preliminary experiments with a video surveillance use case, incorporating object detection and face analysis ML-based models, demonstrate that CSM also effectively accommodates the Edge layer, resulting in a lower reaction time (10%) and cost reduction (6x) over a comparable use case implementation using Cloud-IoT versus Cloud-Edge-IoT.


Adaptive-Precision SpMV on modern multicore CPUs and GPUs: A Roofline Perspective


Dane Lacey

University of Erlangen, Germany


Data movement remains a critical bottleneck in many scientific applications, necessitating the optimization of memory-bound kernels to achieve scalable software design. Sparse Matrix-Vector Product (SpMV) is a kernel that appears frequently and is notorious for its heavy strain on the main memory interface from loading the sparse matrix and possibly irregular memory access pattern. We introduce the Roofline model and show that to estimate the performance of SpMV, it suffices to measure only the data volume transferred and the bandwidth of the main memory interface. When designing software for high scalability, mixed-precision techniques are being increasingly adopted. Furthermore, the choice of a sparse matrix storage format has a considerable performance impact on SpMV. It is highly sensitive to the underlying hardware, which is an obstacle when using heterogeneous systems. Using a data-driven "adaptive-precision" technique to accelerate SpMV (AP-SpMV), we compare the performance of AP-SpMV using a variety of storage formats on modern hardware from Intel and Nvidia, drawing on insights from the Roofline model to understand the results we observe. Due to the rise in popularity of pruning and compression techniques in neural network training, handling sparsity efficiently is quickly becoming one of today’s most important engineering challenges. While the content presented is focused on classical HPC applications, the themes are pertinent to contemporary challenges in AI and machine learning.


Compilers, profiling and deep learning: labeling instructions to feed microarchitectural predictors


Paul Kelly

Imperial College London, UK


Processor microarchitecture includes many predictors – not just for branches and prefetching but also for memory dependence, hit/miss, and more. Classically, predictors collect and use dynamic history. This talk explores the idea of labeling instructions with additional information, extending the reach and scope of information on which such predictions are made. We stick to just labels - no “real” change to the instruction set.

So where might we get the labels from? We look at using things the compiler can infer. We naturally also look at profile-guided optimization (PGO) on test input data. We also examine the question of what information the labels actually carry. We discuss some work-in-progress, and work by others, on using deep learning – both from the code and from PGO-time traces.


Steering Large Language Models (LLMs) using Sparse Autoencoders (SAEs)


Lalith Manjunath

TU Dresden, Germany


A brief intro on how Sparse Autoencoders (SAE) can be leveraged to extract interpretable, monosemantic features from the opaque intermediate activations of LLMs, providing a window into their internal representations. And we hope to initiate discussions on the methodology of training SAEs on LLM activations, the resulting sparse and high-dimensional representations, and how these can be utilized for model steering tasks."

We’ll examine a case study demonstrating the effectiveness of this approach in changing the level of model “proficiency”. This discussion aims to highlight the potential of SAEs as a scalable, unsupervised method for disentangling LLM behaviors, contributing to the broader goals of AI interpretability and alignment.


AgrUNet: a multi-GPU UNet-based model for crop classification


Andrea Miola

University of Ferrara, Italy


Agriculture acts as a catalyst for comprehensive economic growth, boosting income levels, mitigating poverty, and contrasting hunger. These reasons make it important to monitor agricultural practices and the use of parcels carefully and automatically to support the development of sustainable use of natural resources.

The deployment of high-resolution satellite missions, like LandSat and Copernicus Sentinel, combined with AI Deep Learning (DL) methodologies has revolutionized Earth Observation science, enabling studies on yield predictions, soil classifications, and crop mappings on large areas, and the analysis and processing of Big Data using innovative approaches. This approach requires significant computational power provided by modern High-Performance Computing (HPC) systems, since we deal with large amounts of data and DL algorithms are known to be very computing-heavy.

Additionally, recent multi-GPU HPC systems can boost by one or two orders of magnitude the processing power of classical computing systems based only on CPUs.

In this work, we developed AgrUNet, a scalable, fast, and reliable UNet-based architecture to perform crop segmentation on multispectral, multitemporal satellite data, implemented and optimized to run on single and multi-GPU environments.

Our model achieves a multiclass Dice score of approximately 0.90 and a peak throughput of 59 and 605 img/s for the train and inference steps respectively, improving by approximately a factor 7X the best results reported in the literature.

Testing it on different multi-GPU HPC systems and numerical precision, we observe a nearly ideal speedup for both training and inference phases on 4X V100 and 8X A100 GPU systems.


The nature of computations inside large language models


José Moreira

IBM, T. J. Watson Research Lab., USA


It would be an understatement to say that large language models (LLMs) have taken the world by storm. Their capability to solve complex problems can be described as "surprising". Nevertheless, the underlying numerical computations supporting LLMs are remarkably simple. We can almost say that they are "just a bunch of matrix-vector multiplications", with a few other operations sprinkled in. But the "bunch" is actually a "very big bunch" and significant computational resources are required to process them efficiently and quickly. In this talk, we will review and quantify the basic math kernels in a large language model and discuss how they can be computed efficiently. We will review when and how matrix-vector multiplications can be promoted to the more efficient matrix-matrix multiplications and when they cannot. We will also discuss additional optimizations to the basic matrix operations and also what are some of the "other operations sprinkled in" and if/how they can be optimized.


Backpropagation made easy


Keshav Pingali

University of Texas Austin, USA


Backpropagation plays a central role in training deep neural networks. However, existing algorithms for backpropagation are quite complex and are difficult to generalize to complex neural networks such as those with irregular connections between layers. In this talk, we reframe backpropagation in terms of backward dataflow analysis, which is a framework used in optimizing compilers, and show that this leads to a simple, compositional approach to backpropagation.


Always-on Introspection for Large HPC Systems: Benefits and Progress


Amir Raoofy

Technical University of Munich, Germany


Having insight into the characteristics of the applications running on their HPC systems is a huge benefit for HPC centers. Introspection of user applications via always-on system-wide monitoring tools can provide this insight without bothering the users. At Leibniz Supercomputing Centre, we integrate a lightweight sampling method in our in-house monitoring tool DCDB. This scheme leverages eBPF (extended Berkeley Packet Filter) from modern Linux kernels. In this talk, we showcase the benefits of enabling introspection in DCDB and discuss the latest developments and progress.


Machine Learning and Compiler Optimization


Saday Sadayappan

University of Utah, USA


A fundamental challenge in various aspects of performance optimization by compilers is the development of effective performance models. Often, the space of alternate transformed code versions is explosively large and determining which of those would perform best is very challenging because of the difficulty of developing accurate analytical performance models. Therefore there has been interest in using Machine Learning for performance modeling in optimizing compilers. This talk will discuss some current directions being pursued.


Fast matrix multiplication for AI


Oded Schwartz

Hebrew University of Jerusalem, Israel


AI technologies like ChatGPT have transformed human-computer interaction, boosting productivity across sectors. However, they demand enormous computational and energy resources. Training advanced language models can incur costs in the hundreds of millions, with AI projected to consume 7.5 of global electricity by 2025, up from 1 dominates neural network operations, accounting for 45-95 Despite industry advancements in software (e.g., Intel’s MKL, NVIDIA’s CUDA) and hardware (e.g., Nvidia’s H100, Google’s TPU), most solutions still rely on inefficient cubic-time algorithms. Theoretically faster sub-cubic algorithms face practical hurdles: high hidden constants of the complexity, large input size requirements, communication costs overhead, potential numerical instability, and hardware-software mismatches. These challenges hinder their real-world adoption despite theoretical benefits. The talk will explore these obstacles and discuss potential solutions for more efficient AI computation.


Inference Segmentation of Aortic Valve Calcium Lesions using FPGA Accelerators


Valentina Sisini

University of Ferrara, Italy


This paper presents the porting of a Deep Neural Network (DNN) application to a Field-Programmable Gate Array (FPGA), focusing on the task of calcium lesion segmentation within the cardiac aortic valve. The application is implemented using a convolutional neural network, specifically a U-Net, which is widely recognized for its effectiveness in segmentation tasks. The primary goal of this work is to evaluate the feasibility of deploying this application on an FPGA without compromising accuracy. Additionally, we assess the performance in terms of throughput and energy efficiency, comparing these metrics with the performance achieved on GPUs, which are nowadays the most commonly used accelerators for these tasks. FPGAs are emerging as promising alternatives, offering the potential for high performance-per-watt when operating with reduced numerical precision and the capability to support arbitrary precision computations. Therefore, this study investigates the possibility of quantizing a DNN and executing it on an FPGA at low precision while maintaining accuracy, providing a comprehensive analysis of the resulting performance


Prompt Engineering for Engineers - Trying to understand existing HPC code through the use of AI


Carsten Trinitis

Technical University of Munich, Germany


Within a project at TUM Heilbronn, two student assistants were assigned the task of integrating existing software packages into one open-source-based tool. These software packages, which are used within Electrical Engineering for simulating electrostatic fields, have existed for many years and are hard to understand for those who were not involved in their design and development. Very often, they are barely documented. In order to integrate these packages, which are using their own data formats to represent geometries and boundary conditions, these formats need to be well understood and converted. The students at TUM Heilbronn have elaborated guidelines on how to efficiently use ChatGPT to assist them in understanding the abovementioned problems. Whilst the ChatGPThype is still ongoing and a lot of useless information is generated through it, the talk will outline how it can be used in a more meaningful way where applicable.


Pebbling game and alternative basis, for fast and stable matrix multiplication


Noa Vaknin

Hebrew University of Jerusalem, Israel


Matrix multiplication is a fundamental computation kernel in many fields, including artificial intelligence (AI) computations and in a large number of scientific computing applications. Consequentially, many efforts have been invested in improving its performance. Although sub-cubic algorithms exist, most high-performance implementations are based on the classical Θ(n³) matrix multiplication algorithm. Designing an algorithm that obtains even modest improvements in performance over existing implementations, requires carefully addressing challenges such as reducing computation costs, communication costs, and memory footprint. We have provided the first high-performance general matrix-matrix multiplication on CPUs that utilizes the alternative basis method on Strassen’s algorithm, and outperform Intel’s MKL DGEMM on feasible matrix dimensions. The numerical stability of this algorithm does not fall from that of Strassen’s algorithm, beating the speed-stability tradeoff of Bini and Lotti.


System-side Tooling for Energy-Aware Supercomputers


Josef Weidendorfer

Technical University of Munich, Germany


In this talk, I will give an overview of various efforts towards energy-aware system-side tooling for supercomputers, including the upcoming EU project SEANERGYS. I also will focus on the role of AI models for this use-case.