Abstract : Today, the leadership High Performance Computing (HPC) systems accommodates exa-flops capability through massive parallelism from thousands of heterogeneous compute nodes. This poses a huge challenge for math library developers to derive scalable performance. This heterogeneity trend is likely to continue and it is anticipated that future computing systems could have multiple of accelerators or special processing options to accommodate a variety of application needs. This poses a new challenge for handling multiple types of computing node architectures. In this session, we will discuss the latest research on implementing linear algebra libraries for extreme scale heterogeneous computing systems.
Organizer(s) : Keita Teranishi and Pedro Valero Lara
[03921] Using the StarPU task-based runtime system for heterogeneous platforms as the core engine for a linear algebra software stack.
Format : Talk at Waseda University
Author(s) :
Olivier Aumage (Inria)
Abstract : StarPU is a runtime system developed by Team STORM at Inria in Bordeaux,
France, to support computing platforms based on heterogeneous
architectures composed of combination of CPUs, GPUs and FPGAs. This talk
will present how the Sequential Task Flow programming model offered by
StarPU is being used to build a scalable, comprehensive linear algebra
software stack for heterogeneous supercomputers.
[04285] A Look at the Future of High-Performance Linear Algebra with DPLASMA and PaRSEC
Format : Online Talk on Zoom
Author(s) :
george bosilca (The University of Tennessee)
Abstract : This talk will focus on the dataflow programming to address some of the linear algebra challenges. I will focus in particular on two open-source projects, the PaRSEC runtime and the DPLASMA dense linear algebra library, and discuss the programming approach, the handling of heterogeneity and the opportunities to cover a large spectrum of linear algebra needs.
[04413] Multiple- and Mixed-Precision BLAS with C++ Template
Format : Talk at Waseda University
Author(s) :
Toshiyuki Imamura (RIKEN Center for Computational Science)
Daichi Mukunoki (RIKEN Center for Computational Science)
Atsushi Suzuki (RIKEN Center for Computational Science)
Abstract : We propose a new design for BLAS that can handle multiple- and mixed-precision computations. Our templated mixed-precision BLAS (tmBLAS) addresses weaknesses in existing BLAS by decoupling the data types of each operand and operator using C++ generic programming, with explicit descriptions of operators and type-castings. We demonstrate a prototype implementation that instantiates routines with FP{16, 32, 64, 128}, and DD data types with those operations in one level with higher precision than the data precision.
[05194] MatRIS: A Scalable and Performance Portable Math Library for Heterogeneous and Multi-Device Systems based on the IRIS Runtime
Format : Talk at Waseda University
Author(s) :
Keita Teranishi (Oak Ridge National Laboratory)
Pedro Valero-Lara (Oak Ridge National Laboratory)
Abstract : Vendor libraries are tuned for one architecture and are not portable to others. Moreover, these lack support for heterogeneity and multi-device computation orchestration. We introduce MatRIS, a scalable and performance portable library for sparse/dense BLAS/LaPACK operations to address these challenges. MatRIS separates linear algebra algorithms and vendor libraries by using IRIS runtime. Such abstraction makes the implementation completely agnostic to the vendor libraries/architectures, providing high programming productivity. We demonstrate that MatRIS can fully utilize different multi-device heterogeneous systems, achieving high performance and scalability on three heterogeneous systems, Summit (#5 TOP500), Frontier (#1 TOP500), and CADES with four NVIDIA A100 GPUs and four AMD MI100 GPUs. A detailed performance study is presented for sparse and dense LU factorization where MatRIS provides a speedup of up to 8× from the previous version of the library (LaRIS). Along with better scalability, MatRIS provides competitive and even better performance than vendor libraries.
[05190] Responsibly Reckless Matrix Algorithms for HPC Scientific Applications
Format : Talk at Waseda University
Author(s) :
Hatem Ltaief (KAUST)
Abstract : Referred to by Jack Dongarra, the 2021 ACM Turing Award Laureate, as “responsibly reckless” matrix algorithms, we highlight the implications of mixed-precision (MP) computations for HPC applications. Reducing precision comes at the price of trading away some accuracy for performance (reckless) but in noncritical segments of the workflow (responsible) so that the accuracy requirements of the application can still be satisfied. We illustrate the MP impact on seismic imaging, climate/environment geospatial predictions, and computational astronomy.
[04598] A scalable multi-GPU approach for solving H2-approximated dense linear systems
Format : Talk at Waseda University
Author(s) :
Qianxiang Ma (Tokyo Institute of Technolgy)
Rio Yokota (Global Scientific Information and Computing Center, Tokyo Institute of Technology)
Abstract : In this talk, we present a novel approach for directly solving a dense linear system emerged from 3-D geometry approximated using $\mathcal{H}^2$-matrices. From the pre-compressing the fill-ins, we are able to ULV-factorize and apply forward and backward substitution in an entirely parallel manner by batched BLAS/LAPACK operations on GPUs. Using 512 NVIDIA V100 GPUs, we are able to factorize a matrix of N=29,242,368 under 1 second, utilizing 0.808 PFLOPS/s of performance.
[05233] Towards a Unified Micro-kernel Abstraction for GPU Linear Algebra
Format : Talk at Waseda University
Author(s) :
Vijay Thakkar (NVIDIA | Georgia Tech)
Richard Vuduc (Georgia Tech)
Abstract : We have created a micro-kernel abstraction for GPUs robust enough to represent the tensor core and data movement operations from NVIDIA GPU architectures spanning Maxwell all the way to Hopper. In this talk, we discuss how CuTe layouts and layout algebra allow us to uniformly represent GPU architecture specific operations in a consistent programming model regardless of the threads and data they operate upon to build CUTLASS 3.x’s core abstractions.