# Registered Data

## [02600] Applied and computational discrete algorithms

**Session Date & Time**:- 02600 (1/2) : 2C (Aug.22, 13:20-15:00)
- 02600 (2/2) : 2D (Aug.22, 15:30-17:10)

**Type**: Proposal of Minisymposium**Abstract**: Combinatorial and discrete mathematical problems arise in many real-life applications. Their solution requires designing, analyzing and implementing discrete algorithms. The research in this area therefore brings together mathematicians, computer scientists, statisticians, domain scientists and engineers to solve problems of applied and computational combinatorics. The proposed two-part minisymposium covers scientific computing, machine learning, and graph algorithms. The objective is to summarize the latest discrete algorithmic developments and their applications with computational studies on the applications. This minisymposium follows similar ones held in ICIAM 2019, ICIAM 2015, and ICIAM 2011 on combinatorial aspects of sparse matrix computations.**Organizer(s)**: Alex Pothen, Bora Ucar**Classification**:__05C85__,__65F50__,__68W05__**Speakers Info**:- Cevdet Aykanat (Bilkent University)
- Xiaoye Sherry Li (Lawrence Berkeley National Laboratory)
- Christian Schulz (Heidelberg University)
- Oded Schwarz (Hebrew University)
- Julian Shun (MIT)
- Alex Pothen (Purdue University)
**Bora Ucar**(CNRS and LIP, ENS de Lyon)- Helen Xu (Lawrence Berkeley National Laboratory)

**Talks in Minisymposium**:**[02991] Nested dissection ordering on GPUs****Author(s)**:**Xiaoye Sherry Li**(Lawrence Berkeley National Lab)

**Abstract**: Nested dissection ordering is an important preprocessing step in sparse matrix factorizations. It is effective in reducing the amount of fill in the factored matrices. As of now, there is no good algorithm nor software to compute such an ordreing on GPUs. We have made some progress in this direction. We will present our new algorithms designed for GPUs, including multilevel graph coarsening and finding small separators at each level, and the results of parallel runtime and ordering quality.

**[03099] On heuristics for the Birkhoff--von Neumann decomposition****Author(s)**:**Bora Ucar**(CNRS and ENS de Lyon)- Jeremy Cohen (CREATIS, CNRS)

**Abstract**: The Birkhoff--von Neumann decomposition expresses a doubly stochastic matrix as a convex combination of permutation matrices. This talk will be an introduction to this decomposition. We will discuss algorithmic and combinatorial problems associated with this decomposition.

**[03109] Approximation: a Paradigm for Designing Parallel Graph Algorithms****Author(s)**:**Alex Pothen**(Purdue University )- S M Ferdous (Pacific Northwest National Lab )

**Abstract**: We describe a paradigm for designing parallel algorithms by approximation techniques. Instead of solving a problem exactly, for which parallel algorithms may not exist, we seek a solution with provable approximation guarantees. Furthermore, we design these algorithms to be concurrent. We discuss linear and submodular matching and edge cover problems for which such algorithms have been designed, and describe their use in solving problems in sparse matrix computations and load balancing problems in quantum chemistry.

**[03435] Recent Advances in Streaming (Hyper)Graph Partitioning****Author(s)**:**Christian Schulz**(Heidelberg University)

**Abstract**: Partitioning a (hyper)graph into balanced blocks such that few edges run between blocks is a key problem for large-scale distributed processing. Currently there is a gap observed in the space of available partitioning algorithms. On the one hand, there are streaming algorithms that have been adopted to partition massive graph data on small machines. In the streaming model, vertices arrive one at a time including their neighborhood and then have to be assigned directly to a block. These algorithms can partition huge graphs quickly with little memory, but they produce partitions with low solution quality. On the other hand, there are offline (shared-memory) multilevel algorithms that produce partitions with high quality but also need a machine with enough memory to partition huge networks. In this talk, we present recent advances in the area of streaming algorithms for the problem. First, we present a buffered streaming approach: this model allows to read more than one node and its neighborhood at the time. This enables our algorithm to leverage multilevel techniques, and thus significantly improve solution quality while surprisingly also enhancing the overall complexity of the algorithm. On the other hand, we present a shared-memory streaming multi-recursive partitioning scheme that performs recursive multi-sections on the fly without knowing the overall input graph to compute hierarchical partitionings. If the topology of a distributed system is known, it is possible to further optimize the communication costs by mapping partitions onto processing elements.

**[04306] Parallel Batch-Dynamic Graph Algorithms****Author(s)**:**Julian Shun**(MITMIT)

**Abstract**: There has been significant interest in graph analytics due to their applications in many domains, including social network and Web analytics, machine learning, biology, and physical simulations. Real-world graphs today are massive and also dynamic. As many real-world graphs change rapidly, it is crucial to design dynamic algorithms that efficiently maintain graph statistics upon updates, since the cost of re-computation from scratch can be prohibitive. Furthermore, due to the high frequency of updates, we can improve performance by using parallelism to process batches of updates at a time. This talk presents new graph algorithms in this parallel batch-dynamic setting. Specifically, we present the first parallel batch-dynamic algorithm for approximate k-core decomposition that is efficient in both theory and practice. Our algorithm is based on our novel parallel level data structure, inspired by the sequential level data structures of Bhattacharya et al. and Henzinger et al. Given a graph with n vertices and a batch of B updates, our algorithm maintains a (2 + epsilon)-approximation of the coreness values of all vertices (for any constant epsilon > 0) in O(B log^2(n)) amortized work and O(log^2(n) loglog(n)) span (parallel time) with high probability. We implement and experimentally evaluate our algorithm, and demonstrate significant speedups over state-of-the-art serial and parallel implementations for dynamic k-core decomposition. We have also designed new parallel batch-dynamic algorithms for low out-degree orientation, maximal matching, clique counting, graph coloring, minimum spanning forest, single-linkage clustering, some of which use our parallel level data structure.

**[04669] Communication Efficient Stratified SGD on Distributed-Memory Systems****Author(s)**:- Nabil Abubaker (Bilkent University)
- Ozan Karsavuran (Lawrence Berkeley National Lab)
**Cevdet Aykanat**(Bilkent University)

**Abstract**: Stratified Stochastic Gradient Descent (SSGD) enables stale-free low-rank approximation of sparse matrices on distributed-memory systems. We present a framework to efficiently scale SSGD on HPC systems. The framework leverages point-to-point communications to reduce the bandwidth overhead, and a novel combinatorial algorithm to reduce the upper bound on the number of messages from O(K^2) to O(K logK), where K is the number of processors. Our experiments show an order of magnitude performance gap in favor of our framework-enabled SSGD compared to the state-of-the-art.

**[04746] Efficient Data Structures for Sparse Dynamic Graphs****Author(s)**:**Helen Xu**(Lawrence Berkeley National Laboratory)

**Abstract**: Sparse graphs underlie many key applications such as social networks analyses and scientific computing. This talk discusses dynamic graph data structures and how to choose data structures that optimize for locality, which is key to performance in graph computations. The focus is on how different use cases, which can cause different data patterns and access patterns, can influence the design of these structures and how to exploit these differences to maximize performance.

**[05454] Accelerating AI using Fast and Feasible Matrix Multiplication****Author(s)**:**Oded Schwartz**(The Hebrew University)

**Abstract**: AI requires large resources, both for training and for inference. It involves significant time spent on matrix multiplication, typically between 45%-95%. Most current math libraries (for CPU and GPU) and all state-of-the-art hardware accelerators (such as Google’s TPU and Intel’s / Habana Lab’s Gaudi) are based on the cubic-time classic matrix multiplication algorithm, despite more than five decades of research on sub-cubic time algorithms. In this talk I will review several of the challenges in utilizing fast matrix multiplication algorithms, and recent solutions that can allow faster application in practice.