Probably drawing an analogy to how causal pretrained models go through stages of understanding language, words -> grammar -> meaning. Gwen mentions this experience when training character level RNNs. https://gwern.net/scaling-hypothesis#why-does-pretraining-wo...
Totally. Also basic temporal scale or cyclic properties. It’s kind of mind blowing that the shape of most recorded human patterns is reducible in this way.
Watching geohot code a general matrix multiply algorithm from 0.9 GFLOPS and optimising it to 100 glops by only tinkering with cache locality, it makes me wonder how much effort should be put into single threaded performance before ever thinking about multi threading
I've seen stuff like that before with a game called Factroio, The only game I've ever see that is optimized so hard that your RAM Speed can affect large bases rather quickly, same with faster L2 Cache. Their entire blog series[1] covers a large part of how they did this. but for a game written mostly in LUA they sure did a good job on it.
Yes, the Factorio devs had an approach where they optimised everything happening in the game in the original singlethreaded environment, before moving onto multithreaded support. That's where the game is now, and as far as I understand it the multithreading occurs on each independent set of conveyor belts or belt lanes, and there's some info on that in this blog post[0] for anyone interested.
It makes sense that a simulation game like factorio would be memory bandwidth limited: each tick it needs to update the state of a large number of entities using relatively simple operations. The main trick to making it fast is reducing that amount of data you need to update each tick and arranging the data in memory so you can load it fast (i.e. predictably and sequentially so the prefetcher can do is job) and only need to load and modify it once each tick (at least in terms of loading and eviction from cache). The complexity is in how best to do that, especially both at once. For example, in the blog post they have linked lists for active entities. This makes sense from the point of view of not loading data you don't need to process, but it limits how fast you can load data because you can only prefetch one item ahead (compared to an array where you can prefetch as far ahead as your buffers will allow)
Note: Since this post, they've multithreaded a fair bit of the simulation as well. It runs insanely well, the entire game is a marvel of software engineering in my book.
Fintech has mostly determined that 1 thread can get the job done. See LMAX disruptor and related ideas.
What problems exist that generate events or commands faster than 500 million per second? This is potentially the upper bar for 1 thread if you are clever enough.
Latency is the real thing you want to get away from. Adding more than one CPU into the mix screws up the hottest possible path by ~2 orders of magnitude. God forbid you have to wait on the GPU or network. If you have to talk to those targets, it had better be worth the trip.
> What problems exist that generate events or commands faster than 500 million per second?
AAA games, Google search, Weather simulation, etc? I mean it depends on what level of granularity you’re talking about, but many problems have a great deal going on under the hood and need to be multi threaded.
I would add a qualifier of "serializable" to those events or commands. This is the crux of why fintech goes this path. Every order affects subsequent orders and you must deal with everything in the exact sequence received.
The cases you noted are great examples of things that do justify going across the PCIe bus or to another datacenter.
That’s half of it, the other half is each of those events is extremely simple so the amount of computation is viable with a single thread.
If individual threads were dramatically slower the architecture would get unpleasant by necessity. Consider the abomination that is out of order execution on a modern CPU.
Sorry, I should have been more specific. I meant DAWs and computer music environments, not simple audio players. Modern DAWs are heavily multi-threaded.
I am curious on that. 500 million events per second sounds high. Even for games. That many calculations? Sure. I take "events" to mean user generated, though. And that sounds high.
Same for searches. Difficulty there is size of search space, not searches coming in. Right?
Google search isn't a good example. AAA games are a great example when you think about graphics. However, most of that is trivially parallelizable, thus "all you need to do" is assign vertices/pixels to different threads (in quotation marks as that's of course not trivial by itself, but a different kind of engineering problem).
However, once you get into simulations you have billions (or multiple orders of magnitude more) elements interacting with each other. When you simulate a wave every element depends on it's neighbors and the finer the granularity the more accurate your simulation (in theory at least).
Thinking of graphics, though, i would assume most of that is in the GPU side. Simulations do make sense, but I see games like Factorio still focused on single thread first. And then look for natural parallel segments.
That is all to say that millions of events still feels like a lot. I am not shocked to know it can and does happen.
There are no good solutions for something like factorio. There are solutions that work but they aren't worth the trouble. My personal recommendation is that you split the world into independent chunks. A big interconnected factorio map is a nightmare scenario because there is hardly anywhere where you can neatly split things up. Just one conveyor belt and you lose. Aka parallelize disconnected subgraphs.
So the game would have to be programmed so that conveyor belts and train tracks can be placed at region boundaries and that there is a hidden buffer to teleport things between regions. Now you need an algorithm to divide your graph to both minimize the imbalance between the number of nodes in the subgraph but also to minimize the edges between subgraphs.
Just dreaming over here, but if someone had the opportunity to rebuild a Factorio from the ground up, I bet they could design something massively scalable. Something based in cell automata, like how water flow works in Minecraft. Current cell state = f(prev cell state, neighboring cell states).
It would take some careful work to ensure that items didn't get duplicated or lost at junctions, and a back-pressure system for conveyor belt queues. Electrical signals would be transmitted at some speed limit.
This is a wrong view of the problem. Often times your application has to be distributed for reasons other than speed: there are only so many PCIe devices you can connect to a single CPU, there are only so many CPU sockets you can put on a single PCB and so on.
In large systems, parallel / concurrent applications are the baseline. If you have to replicate your data as its being generated into geographically separate location there's no way you can do it in a single thread...
As far as I know the LMAX disrupter is a kind of queue/buffer to send data from one thread/task to another.
Typically, some of the tasks run on different cores. The LMAX disruptor is designed such that there is no huge delay due to cache coherency. It is slow to sync the cache of one core to the cache of another core when both cores write to the same address in RAM. The LMAX disruptor is designed that each memory location is (mostly) written to by at most thread/core.
How is the LMAX disrupter relevant for programs with 1 core?
> How is the LMAX disrupter relevant for programs with 1 core?
It is not relevant outside the problem area of needing to communicate between threads. The #1 case I use it for is MPSC where I have something like an AspNetCore/TCP frontend and a custom database / event processor / rules engine that it needs to talk to.
It's quite rare for problems to be dominated by hot loops in the same way that matrix multiplication is.
Think about something like speeding up a compiler or a web server or a spreadsheet. There's no 50-line function that you can spend a few hours optimising and speed up the whole thing.
That's part of the reason why Python programs (except maths heavy stuff like ML) tend to be so slow despite everyone saying "just write your hot code in C". You can't because there is no hot code.
This advice dates back to when Python was primarily used by scientists to drive simulations. I remember hearing this advice in the early 2000s as a physics student using vpython to drive N-body simulations. They told us to do the physics in C, but everything else in Python due to the simulation math taking too long in raw Python. We couldn’t make the visualization run at a reasonable frame rate without those optimizations.
These days Python is being used for everything, even things without hot loops as you note. Yet the advice persists.
That appears to be one of the major design goals for the Mojo programming language. It allows the developer to code at a high level of abstraction and indicate where parallel execution should be used. Then the execution environment automatically optimizes that at runtime based on the actual hardware available. That hardware may change drastically over the full lifecycle of the application code so in the future it will automatically take advantage of hardware advances such as the introduction of new types of specialized co-processors.
Cache locality is certainly an important aspect and especially when it comes to matrix multiplication even just changing the loop order (and thus access pattern) has a huge performance impact. However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough. And the difference between these two might be as simple as sorting your input data first.
I teach parallel programming to graduates and the first exercise we give them is a sequential optimization for exactly that reason. Think about if your algorithm is efficient before thinking about all the challenges that come with parallelization.
> [re:matrix multiplication] However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough.
You have to be very careful about what 'big enough' means. In practice, Strassen multiplication is not faster than the naive algorithm until you get to the point where you're multiplying matrices with hundreds of rows/columns. Additionally, naive matrix multiplication is well suited to GPUs, while Strassen multiplication on the GPU requires temporary buffers and multiple jobs and sequencing and whatnot.
As a general rule, matrix multiplication with complexity better than the naive algorithm should probably not be used. Do naive matrix multiplication on the CPU. If you need it to be faster, do naive matrix multiplication on the GPU. If you need it to be faster, the numerical stability of your problem has probably already come a gutser and will get worse if you switch to Strassen or any of the other asymptotically faster algorithms.
And the algorithms faster than Strassen? Forget about it. After Strassen multiplication was invented, about a dozen or so other algorithms came along, slowly reducing that O(n^2.8) to about O(n^2.37188) or so. (most recently in 2022; this is still an area of active research) The problem is that for any of these algorithms to be faster than Strassen, you need matrices that are larger than what you can keep in memory. There is no big enough input that will fit in the RAM of a modern computer. One estimate I've heard is that if you convert every atom in the observable universe into one bit of RAM, and you use that RAM to multiply two 10^38 by 10^38 matrices to get a third 10^38 by 10^38 matrix, you're still better off using the O(n^2.8) Strassen multiplication instead of the state of the art O(n^2.37188) algorithm. The constant slowdown in the other algorithms really are that bad.
>However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough.
Sure, this might be the case from a theoretical point of view (as per definition) but this completely disregards the hidden constants that come to light when actually implementing an algorithm. There's a reason why for instance a state-of-the-art matrix multiplication algorithm [0] can be completely useless in practice: The input data will never become large enough in order to amortize the introduced overhead.
I think it's because you want to be able to predict the next token using only 1 token or the whole context window (and any size inbetween). So, you end up getting n different losses for each text snippet (where n is the size of the context window).
If i'm wrong, can someone correct here, would be useful to know.
Why would you train the model on shorter context than you can provide? Why not provide all context you have? Sure the model has to learn to handle short context, but that occurs naturally at the beginning of the document.
Anyways, this still involves only left-side masking. Why mask future tokens when sliding window can do that (without wasting a single token of context)?
> There is more to collections than lists. You can have instances of empty collections, some of which have literal support ([], {}, and ()). Thus there can be no sentinel empty collection value.
Probably my favorite feature over common lisp. Combined with comma (,) being treated as whitespace (ie, use comma if you want to but you don't have to), makes typing out data structures in clojure a much more enjoyable process than other languages.
This ease of typing out data structures also extends to edn (clojure's version of json)
This isn’t really a thing, though: Common Lisp has built-in literal syntax for n-dimensional arrays and structs and libraries like fset provide literal syntax for Clojure-style immutable data structures.
I've used fset and cl21. The problem is that once you start using those you no longer are able to use libraries. Data structures have to be built into core.
The common lisp community has rejected reader macros for interop reasons the last time I checked (in the case of cl21)
Even the docs for fset doesn't show the use of data literals. it does things like (set) and (isetq s2 (set 'c 1 "foo" 'a 'c))
This pretty much proves my point. it's not fun typing (set 'c 1 "foo" 'a 'c) instead of #{'c 1 "foo" 'a 'c}. It's also not possible to put those on the wire (easily) for communication between common lisp processes. Breaking the whole point of homoiconicity.
> once you start using those you no longer are able to use libraries
This isn't true: if the libraries you use are designed to use generic functions for their internals, rather than normal functions. I've written code using custom data structures heavily, with little or no impact on which libraries I could use.
> The common lisp community has rejected reader macros
cl21 is just confusing matters here. Plenty of code uses named-readtables to use arbitrary reader macros.
I don't understand the "interop reasons" you're talking about: as long as you use named-readtables, interop basically Just Works.
> It's also not possible to put those on the wire (easily) for communication between common lisp processes. Breaking the whole point of homoiconicity.
It looks like you're right for the most part. Literals are supported if you use https://github.com/vseloved/rutils. If your cl-edn actually works then a combo of rutils and cl-edn could bring most of the value I found in clojure to common lisp.
Immutability by default is mostly overrated: I’ve written code in many languages (Haskell, Clojure, JS, CL, etc.), and I’ve found I care more whether a library (Immer, ramda, etc.) provides referentially transparent APIs than whether or not the language’s data-structures provide immutability guarantees.
But, additionally, my point was that CL has literal representations for arbitrary data types, not that you can get immutable datastructures from a library.