If you'd like to read something that is less detailed, I4R (Institute for Replication) posted threads on X (formerly known as Twitter) and Bluesky about their investigation into this paper.
It seems crazy to me that vulnerabilities like this still exist today. I understand that not everything will be written in Rust, but this could have been caught by basic testing and/or fuzzing, as Nick (the person who found the vulnerability) points out: https://www.openwall.com/lists/musl/2025/02/14/1. So far, this earned an 8.1 from MITRE: https://nvd.nist.gov/vuln/detail/CVE-2025-26519, although I'm not sure if that's just a function of it being a bug in a library that so many applications use.
This also goes to show how every line matters. The one line change was from
To explain the bug, EUC_KR is Extended Unicode - Korean. After ASCII characters (0-127), each character is multiple bytes. The logic adjusts to look up from a base-0 index, then checks that both bytes (variables c and d) are in the normalized range. If not, the code adjusts the characters and checks the bounds again. This second check is what contained the bug.
0xc6 - 0x81 is the upper bound of this check (69), not 93. The irony is that the second part of the check gets this correct. I wonder what a `git blame` would reveal about this particular line.
Fun to see someone decide to just write a faster JSON parser because they believed it was possible. They mentioned last year that libjansson had some intermediate layer with Emacs, which led to too many memory allocations: https://lists.gnu.org/archive/html/emacs-devel/2024-03/msg00...
Thanks for linking to that thread. Was fun to peruse it reading the feedback and seeing how things went there. Huge kudos to Géza Herman for getting this in. Similar kudos to the the team on the review! Genuinely pleasant to read how that whole process went.
Very interesting. I wonder if this a result of some "swiss cheese" effect due to constraints around UEFI and NVRAM themselves, when updating EFI variables.
NVRAM must maintain atomicity of memory transactions for power failures. Its whole purpose is to store data when you turn your computer off. As a result, when deleteing an EFI variable, you can't manage the individual bytes - you have to delete a whole entry (which can be rather large - based on the EFI specification and the code used for edk2, e.g. https://github.com/tianocore/edk2/blob/83a86f465ccb1a792f5c5...). Deleting these entries might become a problem when you start running against memory constraints and what slots in memory are actually available; hence a possible fragmentation issue.
Additionally, I appreciated how short and specific this blog post was. I enjoy this style of post of someone encountering a problem and solving it.
There's also the fact that an NVRAM variable is never overwritten inplace; the new value is written elsewhere, and the pointer is updated to use the address of the new value. This is probably mainly for wear-leveling, but I guess it could also introduce fragmentation?
Just an observation from when I was debugging a board that selfdestructed when booting a particular efi-file so I had to dig into the flash contents to figure out why, but I think this particular code was straight from tianocore.
Probably for atomicity. It’s likely only a pointer sized block can be updated atomically so in order to safely update the value that may be larger you write it somewhere else and atomically update the pointer. That way you can only observe the old or new value, and not some intermediate result if power was lost part way through writing the new value. The same techniques are used in journaling file systems.
I'm curious how the NVRAM is actually stored. In embedded systems on microcontrollers the constraints of the HW make you do what we called "Emulated EEPROM". You are able to erase data in increments of, for example 4kb and write it in increments of 16 bytes (but it varies with implementation). So on write you just say... this is block foo and it stores value bar and you append it to the latest not written data in the block. When you recover data, you just look for the latest valid value of foo and say "the value of foo is bar". You might have multiple instances of foo written, but only the latest is the valid one. Once the active block is full, you swap out all the current values of all the NvM blocks to the next erasable HW block.
Yes, this achieves atomicity, yes this gets you wear leveling (with the caveat that the more data you store, the worse the lifetime gets because you need to do more swaps) but it also is a consequence of HW constraints and the approach flows directly from it. It might be the consequence of HW/SW co-design at some point in the past as well, but I have no idea whether this is true.
This information is based on my experience in automotive.
Automotive did 'roll your own' flash handling since almost forever...
I have a toyota where the odometer resets back to 299,995 miles on every power cycle because whoever designed the wear levelling didn't plan that far ahead...
True, I was trying to find the variable storage requirements in the UEFI specification but couldn't (is it Section 3? 8?), so I resorted to linking to the struct definition in the EFI shell package that the author used.
I imagine it's not fragmentation in the strictest sense. It's more than likely it's just the result of a bug. Perhaps garbage collection wasn't being triggered and space wasn't getting freed up. It could be that the author caused the problem themselves by how they were managing their nvram.
While this didn't get much attention on History Stack Exchange, see ConeOfArc's YouTube video (which has 963k views as of today): https://www.youtube.com/watch?v=HaRO_dTqO1E
See also ConeOfArc's video from a month later, https://www.youtube.com/watch?v=RO58B6LcTfM (1M views). The video above is about the initial search and problem, this video is after many Internet strangers worked together to solve it.
"The more skilled you are, the less you need advanced tools..."
I wonder how well this claim holds up under strict scrutiny. At a high level, it makes sense: someone who is a master of their craft will understand their tools and how to use them much better than a beginner starting out. However, I think this is a different claim than the one Paolini makes.
One concrete argument against this is the history of technological progress. With each improvement came a step function in the tools available to builders, creators, etc., and those able to master them achieved a new level of ability and progress not seen before.
There is evidence for this everywhere. In sports, athletes today use modern technologies and knowledge to push their bodies and abilites to new levels (low-oxygen training, equipment specialized in each sport, etc.). If you didn't upgrade your golf clubs from wood to titanium, you are going to be left far behind. In other industries, this applies: a specific example might be in welding, where using a multi-process welder will significantly improve your productivity as well as the quality of your work. [1]
In tech, this distinction is also apparent. There are the dual cases of a programmer being completely ineffective with a modern IDE versus Jeff Dean (or insert your favorite programming legend here) with Emacs (or another text editor). However, this does not mean that a great programmer does not require advanced tools. Great toolchains (compilers, linkers, interpreters, debuggers, etc.), advanced computing capabilities (GPUs for LLMs), and high-speed Internet access will dramatically change the quality of your work.
I think that Paolini is more focused on the idea that a master craftsmen, when compared with a beginner, can still handily defeat the beginner using simple tools, as they understand both the tools and what they are trying to accomplish better. I agree with this point, but think that this relative comparison is distinct from the absolute comparison of a master using simple tools vs. the same master using advanced tools.
I appreicate his larger point that you want high-quality (even if simple) tools that you understand and that will last you a lifetime. This is one of the main benefits of mastering free/open source software [2]: freedom 0 of free software is to run a program as you wish, for any purpose, which is not guaranteed when using proprietary software.
Is the main idea to convert the model implementation from Python into C, then hardcode all possible values? Do you do this yourself in the generator code, or could you let the C preprocessor/compiler handle something like this by using macros? (might help with compile time/memory)
"NOTE: Ensure the device you are running on has no form of hardware acceleration like GPU or the results will be skewed"
How much does adding GPUs affect your performance improvement gains? I understand that the point of this optimization is for CPU-only machines, but it would be interesting to consider the affect your optimizations have when running on GPUs as well.
We let the generator code hardcore the weight into the generated source.
GPU performance significantly affects performance by as much as 20X.
This project is only intended for cases where GPU is not available / desired due to cost or other constraints,
Fil-C adds capabilities and garbage, though it mentions performace costs of 1.5x - 5x. I would think the capabilties aspect of this language extension would be better addressed by hardware proposals (like CHERI) to prevent adding too much overhead to systems that are performance-critical.
The comments on this fast range reduction are interesting, I recommend reading them (they're much longer than the post itself). Some highlights:
Versions of this exist as early as 1997, in CMU's Common Lisp implementation. Their GitLab is not loading quicky on my computer, but the commit is supposed to be from December by Douglas Crosher (https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...).
The "Further reading" section (and comments) include implementations of the trick in the wild. Stockfish, the chess engine, used this optimization to reduce memory for storing precomputations of moves in their transposition tables (https://github.com/official-stockfish/Stockfish/commit/2198c...). The commit switches from using bitwise AND (&) and power-of-2 table sizes to using the modulo hash trick, which allows for full use of RAM (even if it's not an exact power of 2). It's also used in Hashtron's modular hash: https://github.com/neurlang/classifier/blob/master/hash/hash..., a straightforward application of the optimization to avoid using modulo.
A GitHub gist with more information: https://github.com/sipa/writeups/tree/main/uniform-range-ext.... It walks through a couple of intuitive explanations on why this technique maintains maximum uniformity in the output distribution, as opposed to the proof in the post.
It's somewhat funny to hear "minimum viable product" in the context of a language standard/specification. I'd never thought of adding a feature to a language in that way before.
The idea of design by contract (DbC) in C++ appears to have been around for a while too. The authors link to a "forthcoming companion paper" (currently a 404 error), but you can find proposals from as early as 2004 with the idea: https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2004/n16...
(and potentially earlier, I just haven't seen one yet)
This is the MVP, as the subset of the all the various contract proposals that stands a chance to pass approval, as the feature has been extremely contentious with irreconcilable positions on some details.
X (Twitter): https://x.com/I4Replication/status/1893842119594258913
Bluesky: https://bsky.app/profile/i4replication.bsky.social/post/3liv...