More

matt_d · 2024-09-04T22:21:13.000000Z

Abstract:

"We say that an imperative data structure is snapshottable or supports snapshots if we can efficiently capture its current state, and restore a previously captured state to become the current state again. This is useful, for example, to implement backtracking search processes that update the data structure during search.

Inspired by a data structure proposed in 1978 by Baker, we present a snapshottable store, a bag of mutable references that supports snapshots. Instead of capturing and restoring an array, we can capture an arbitrary set of references (of any type) and restore all of them at once. This snapshottable store can be used as a building block to support snapshots for arbitrary data structures, by simply replacing all mutable references in the data structure by our store references. We present use-cases of a snapshottable store when implementing type-checkers and automated theorem provers.

Our implementation is designed to provide a very low overhead over normal references, in the common case where the capture/restore operations are infrequent. Read and write in store references are essentially as fast as in plain references in most situations, thanks to a key optimisation we call record elision. In comparison, the common approach of replacing references by integer indices into a persistent map incurs a logarithmic overhead on reads and writes, and sophisticated algorithms typically impose much larger constant factors.

The implementation, which is inspired by Baker’s and the OCaml implementation of persistent arrays by Conchon and Filliâtre, is both fairly short and very hard to understand: it relies on shared mutable state in subtle ways. We provide a mechanized proof of correctness of its core using the Iris framework for the Coq proof assistant."

Motivating example:

"Union-Find is a central data structure in several algorithms. For example, it is at the core of ML type inference, which proceeds by repeated unification between type variables. Union-Find can also be used to track equalities between type constructors, as introduced in the typing environment when type-checking Guarded Algebraic Data Types (GADTs) for example.

When using a Union-Find data structure to implement a type system, it is common to need backtracking, which requires the inference state to be snapshottable. For example:

(1) A single unification between two types during ML type inference translates into several unifications between type variables, traversing the structure of the two types. If we discover that the two types are in fact incompatible, we fail with a type error. However, we may want to revert the unifications that were already performed, so that the error message shown to the user does not include confusing signs of being halfway through the unification, or so that the interactive toplevel session can continue in a clean environment.

(2) Production languages unfortunately have to consider backtracking to implement certain less principled typing rules: try A, and if it fails revert to a clean state and try B instead.

(3) GADT equations are only added to the typing environment in the context of a given match clause, and must then be rolled back before checking the other clauses.

We have encountered requirements (1) and (2) in the implementation of the OCaml type-checker, and (1) and (3) in the development of Inferno [Pottier, 2014], a prototype type-inference library implemented in OCaml that aims to be efficient.

Now a question for the reader: how would you change the Union-Find implementation above to support snapshots?"

matt_d · 2024-08-27T21:46:05.000000Z

Abstract:

"Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization."

Observations:

> Observation 1: Achieving good performance on multi-GPU systems requires non-trivial tuning, which depends on the system, message size, communication library, and number of nodes. The default choices made by *CCL and GPU-Aware MPI are not always optimal, and manual tuning can improve performance up to an order of magnitude.

> Observation 2: GPU-Aware MPI provides the highest goodput for intra-node point-to-point transfers on all the analyzed systems. For small transfers, the optimal solution changes across the systems, depending on architectural features and specific optimization implemented by MPI.

> Observation 3: On LUMI, RCCL point-to-point communication primitives do not correctly determine the bandwidth available between GPUs on the same node, thus underutilizing the available bandwidth.

> Observation 4: For single node collectives, *CCL outperforms GPU-Aware MPI in most cases, except for small collectives on LUMI. Indeed, unlike MPI, *CCL collectives are optimized for the specific GPU models. Nevertheless, there is still room for collective algorithms optimization.

> Observation 5: On inter-node point-to-point communications, MPI outperforms *CCL by up to one order of magnitude on small transfers, and by up to 3x on larger transfers.

> Observation 6: On Alps and LUMI, GPU’s network location has a marginal impact on average performance (below 30% for latency and 1% for goodput). On the other hand, on Leonardo, the average latency increases by up to 2x when the GPUs are in different groups rather than under the same switch. Similarly, the average goodput decreases by 17%. This is mainly due to network performance variability caused by network noise.

> Observation 7: *CCL exploits the intra-node GPU-GPU interconnect more effectively than MPI, being specifically optimized for the target devices. Those advantages are more evident at smaller node counts and for larger transfers, for which the performance of intra-node communications has a higher weight on the overall performance. However, we experienced instability at large node counts for the alltoall on both NCCL and RCCL.

> Observation 8: Network noise decreases the goodput of allreduce and alltoall up to 50%.

*CCL refers to NVIDIA Collective Communications Library (NCCL) and AMD ROCm Collective Communication Library (RCCL)

latchkey · 2024-08-27T23:06:08.000000Z

> On the other hand, on Leonardo, the average latency increases by up to 2x when the GPUs are in different groups rather than under the same switch.

I'm in the process of deploying a Dell 128 GPU/NIC cluster with AMD MI300x and we're going entirely into a single Dell Z9864F @ 400G. This is a brand new product and we are one of the first to receive it. Looking forward to doing some benchmarking to see how everything performs.

matt_d · 2024-08-22T22:34:06.000000Z

For the particularly useful subgraph of topics, see "Notes on Graph Algorithms Used in Optimizing Compilers" by Carl Offner: http://www.cs.umb.edu/~offner/files/flow_graph.pdf

That, and pretty much anything (co)authored by Robert E. Tarjan. I'm serious: https://github.com/search?q=repo%3Allvm%2Fllvm-project%20tar...

There's a good recent survey of the three workhorses: "We survey three algorithms that use depth-first search to find the strong components of a directed graph in linear time: (1) Tarjan's algorithm; (2) a cycle-finding algorithm; and (3) a bidirectional search algorithm."

Finding Strong Components Using Depth-First Search Robert E. Tarjan, Uri Zwick https://arxiv.org/abs/2201.07197

matt_d · 2024-07-09T21:17:32.000000Z

Abstract:

"After more than 30 years of research, there is a solid understanding of the consistency guarantees given by CPU systems. Unfortunately, the same is not yet true for GPUs. The growing popularity of general purpose GPU programming has been a call for action which industry players like Nvidia and Khronos have answered by formalizing their Ptx and Vulkan consistency models. These models give precise answers to questions about program’s correctness. However, interpreting them still requires a level of expertise that escapes most developers, and the current tool support is insufficient.

To remedy this, we translated and integrated the Ptx and Vulkan models into the Dartagnan verification tool. This makes Dartagnan the first analysis tool for multiple GPU consistency models that can analyze real GPU code. During the validation of the translated models, we discovered two bugs in the original Ptx and Vulkan consistency models."

Dat3M: Memory Model Aware Verification https://github.com/hernanponcedeleon/Dat3M

matt_d · 2024-05-27T23:57:59.000000Z

In case this is of interest, here's an attempt at gathering the references, including random blog posts (as well as articles, documentation, papers, software, and talks), on these topics:

- assembly & ISA (instruction set architecture):

- Arm: https://github.com/MattPD/cpplinks/blob/master/assembly.arm....

- RISC-V: https://github.com/MattPD/cpplinks/blob/master/assembly.risc...

- x86: https://github.com/MattPD/cpplinks/blob/master/assembly.x86....

- debugging: https://github.com/MattPD/cpplinks/blob/master/debugging.md

- debugging: tracing: https://github.com/MattPD/cpplinks/blob/master/debugging.tra...

- executable and object file formats (ELF, Mach-O, PE); debugging data formats (DWARF, PDB): https://github.com/MattPD/cpplinks/blob/master/executables.m...

- linking and loading: https://github.com/MattPD/cpplinks/blob/master/executables.l...

- compilers: https://github.com/MattPD/cpplinks/blob/master/compilers.md

- compilers correctness: https://github.com/MattPD/cpplinks/blob/master/compilers.cor...

CalChris · 2024-05-28T14:50:08.000000Z

freecompilercamp.org has been hijacked.

matt_d · 2024-05-28T22:52:38.000000Z

Removed, thanks!

matt_d · 2024-03-04T19:55:09.000000Z

The Future of Weak Memory (FOWM) 2024 talks: https://www.youtube.com/playlist?list=PLyrlk8Xaylp6u1S3R6gH0...

Abstract: https://popl24.sigplan.org/details/fowm-2024-papers/16/What-...

The C++11 memory model was first included with thread support in C++11, and then incrementally updated with later revisions. I plan to summarize what I learned, both as a C++ standards committee member, and more recently as a frequent user of this model, mentioning as many of these as I have time for:

The C++ committee began with a view that higher level synchronization facilities like mutexes and barriers should constitute perhaps 90% of thread synchronization, sequentially consistent atomics, maybe another 9%, and weakly ordered atomics the other 1%. What I’ve observed in C++ code is often very far from that. I see roughly as much atomics as mutex use, in spite of some official encouragement to the contrary. Much of that uses weakly ordered atomics. I see essentially no clever lock-free data structures, along the lines of lock-free linked lists in the code I work with. I do see a lot of atomic flags, counters, fixed-size caches implemented with atomics, and the like. Code bases vary, but I think this is not atypical.

In spite of their frequent use, the pay-off from weakly ordered atomics is decreasing, and is much less than it was in Pentium 4 times. The perceived benefit on most modern mainstream CPUs seems to significantly exceed the actual benefit, though probably not so on GPUs. In my mind this casts a bit of doubt on the need to expose dependency-based ordering, as in the unsuccessful memory_order_consume, to the programmer, in spite of an abundance of use cases. Even memory_order_seq_cst is often not significantly slower. I’ll illustrate with a microbenchmark.

We initially knew way too little about implementability on various architectures. This came back to bite us recently [Lahav et al.] This remains scary in places. Hardware constraints forced us into a change that makes the interaction between acquire/release and seq_cst hard to explain, and far less intuitive than I would like. It seems to be generally believed that this is hard or impossible to avoid with very high levels of concurrency, as with GPUs.

We knew at the start that the out-of-thin-air problem would be an issue. We initially tried to side-step it, which was a worse disaster than the current hand-waving. This has not stopped memory_order_relaxed from being widely used. Practical code seems to work, but it is not provably correct given the C++ spec, and I will argue that the line between this and non-working code will inherently remain too fuzzy for working programmers. [P1217]

Unsurprisingly, programmers very rarely read the memory model in the standard. We learned that commonly compiler writers do not either. The real audience for language memory models mostly consists of researchers who generate instruction mapping tables for particular architectures. The translation from a mathematical model to standardese is both error prone, and largely pointless. We need to find a way to avoid the standardese.

Atomics mappings are part of the platform application binary interface, and need to be standardized. They often include arbitrary conventions that need to be consistently followed by all compilers on a system for all programming languages. Later evolution of these conventions is not always practical. I’ll give a recent RISC-V example of such a problem.

matt_d · 2024-03-01T02:35:48.000000Z

I'd recommend by starting with the talk by Daniel Marshall (one of the authors) from Lambda Days 2023, "A Hitchhiker's Guide to Linearity", https://www.youtube.com/watch?v=QtlkqJGdnuM

The Granule Project's website, https://granule-project.github.io/, links relevant resources, too, in particular https://granule-project.github.io/granule.html and https://github.com/granule-project/granule/blob/main/example...

The latter tutorial is based on "Quantitative program reasoning with graded modal types" (ICFP 2019), with the talk and paper available at https://dl.acm.org/doi/10.1145/3341714

See also "Linearity and Uniqueness: An Entente Cordiale" (ESOP 2022), https://granule-project.github.io/papers/esop22-paper.pdf

matt_d · 2024-02-15T20:15:27.000000Z

Paper: https://2023.splashcon.org/details/iwaco-2023-papers/5/Borro...

> Hylo is a language for high-level systems programming that promises safety without loss of efficiency. It is based on mutable value semantics, a discipline that emphasizes the independence of values to support local reasoning. The result—in contrast with approaches based on sophisticated aliasing restrictions—is an efficient, expressive language with a simple type system and no need for lifetime annotations.

> Safety guarantees in Hylo programs are verified by an abstract interpreter processing an intermediate representation, Hylo IR, that models lifetime properties with ghost instructions. Further, lifetime constraints are used to eliminate unnecessary memory allocations predictably.

https://www.hylo-lang.org/

https://github.com/Hylo-lang/Hylo

matt_d · 2024-01-12T20:15:44.000000Z

Abstract:

"Compiler research and development has treated computation as the primary driver of performance improvements in C/C++ programs, leaving memory optimizations as a secondary consideration. Developers are currently handed the arduous task of describing both the semantics and layout of their data in memory, either manually or via libraries, prematurely lowering high-level data collections to a low-level view of memory for the compiler. Thus, the compiler can only glean conservative information about the memory in a program, e.g., alias analysis, and is further hampered by heavy memory optimizations. This paper proposes the Memory Object Intermediate Representation (MEMOIR), a language-agnostic SSA form for sequential and associative data collections, objects, and the fields contained therein. At the core of MEMOIR is a decoupling of the memory used to store data from that used to logically organize data. Through its SSA form, MEMOIR compilers can perform element-level analysis on data collections, enabling static analysis on the state of a collection or object at any given program point. To illustrate the power of this analysis, we perform dead element elimination, resulting in a 26.6% speedup on mcf from SPECINT 2017. With the degree of freedom to mutate memory layout, our MEMOIR compiler performs field elision and dead field elimination, reducing peak memory usage of mcf by 20.8%."

Programming Languages Amenable to MEMOIR

"At its core, MEMOIR proposes collections as value types. In this paper, we implement a library in C/C++ to provide this functionality, however many languages exist which provide the guarantees needed for a MEMOIR compiler. Languages with mutable value semantics [48], which degrades references to second-class citizens, are amenable to SSA construction, as they are analogous to our MUT library. Such languages include Swift’s struct types [69] and Hylo [70].

Languages with single-ownership, i.e., “borrowing”, which guarantee that only one mutable reference will exist at a time can be used to construct a MEMOIR program. An example of this is Rust [71], which is steadily entering the programming zeitgeist. Similarly to Rust, newer languages such as Mojo [72] and Vale [73] have similar ownership models. Of note, use ϕ ’s cannot be constructed for these languages, as multiple immutable references may exist at once. While the aforementioned languages are promising directions of future work, the lack of accepted benchmark suites implemented in them, unlike C/C++, was deemed too large a barrier to adoption in our research at present.

Collection-oriented languages [74–77] have existed for many years now. APL [74] and SETL [77] serve as prime examples of their philosophy, focusing on arrays and sets, respectively, as prime concepts of the language. As such they are interesting source languages for compilation and, furthermore, their implementations provide a wealth of resources on optimizing collection-oriented programs [78]. A recent example of these concepts being exploited outside of their original languages is parallel block-delayed sequences [79], which implements loop-fusion techniques on sequences as a library for Parallel ML and C++. Investigating the extent these optimizations could be performed statically with MEMOIR provides an interesting starting point for this line of research."

matt_d · 2024-01-05T17:21:09.000000Z

Likely worth mentioning that Andy (the author) has been organizing an ML⇄DB Seminar Series (Machine Learning for Databases + Databases for Machine Learning) for the past few months (Fall 2023); materials & lectures: https://db.cs.cmu.edu/seminar2023/, https://www.youtube.com/playlist?list=PLSE8ODhjZXjYVdJKka5g3...