An electron is just an excitation of the electron field, a vast probabilistic something that's ever-present everywhere in the universe, with a different complex-number amplitude (not observable directly) at each and every point. Sometimes that vagaries of its ever-changing amplitudes cause it to locally coalesce into something that looks like an individual electron if you squint at it in the right manner, and that vision persists until a strong enough interaction with another vast and nebulous field shatters the illusion, at least in the tiny corner of the probabilistic multiverse your local version of your consciousness brazenly imagines to be the objective reality. What kind of existence is that?
An existence that mistakenly took fundamental reality to consist of point particles instead of fields. But we still found out about fields and quantum mechanics.
We don't know that fields are real. We just know that the mathematical language of fields is the most convenient and concise to describe the experimental results we see.
As a starting pint, the magnetic field can be shown to be real with metal filings. Whether fields are the fundamental building block is another matter.
The vast majority come from PG. When they don't it's another public domain transcription source, like Wikisource, Faded Page, or Project Gutenberg Australia. We usually don't create our own transcriptions.
Our producers can and do contribute back upstream! It's up to the individual producer.
One thing the OP didn't mention that I really liked was runtime checks on array/slice access and integer under/overflow. Because dealing with heap allocation is a bit of a hassle, I was incentivized to use static buffers a lot. I quickly figured out that I didn't have to worry about their sizes much, because if they're overrun by the unexpectedly large input or other behavior in my algorithms, I get a nice runtime error with the right line indicated, rather than corrupt memory or a crash. Same thing about choosing which integer type to use: it's not a problem if I made the wrong choice, I'll get a nice error message and fix easily. This made for a lot of peace of mind during coding. Obviously in a real production system I'd be more careful and use dynamic sizes appropriately, but for one-off programs like these it was excellent.
Overall, I really enjoyed using Zig while starting out at AoC problem 1 with zero knowledge of the language. To my mind, it's "C with as much convenience as could be wrung out of it w/o betraying the low-level core behavior". That is, no code execution hidden behind constructors or overloads, no garbage collection, straight imperative code, but with so much done right (type system, generics, errors, optionals, slices) that it feels much more pleasant and uncomparably safer than C.
(you can still get a segmentation fault, and I did a few times - by erroneously holding on to pointers inside a container while it resized. Still, uncomparably safer)
> (you can still get a segmentation fault, and I did a few times - by erroneously holding on to pointers inside a container while it resized. Still, uncomparably safer)
This is a severe problem, and I predict that this is going to cause real security issues that will hurt real people if Zig gets used in production before it gets production-ready memory safety. This exact pattern (pointers into a container that resized, invalidating those pointers) has caused zero-days exploited in the wild in browsers.
> This is a severe problem, and I predict that this is going to cause real security issues
That is a nasty problem, particularly in larger projects with different subsystems interacting (like say an xml parser and another).
I suspect it's worse in some ways as Zig has good marketing as being "safer" language despite still having the same fundamental memory flaws as C/C++. In the worse case that could lull programmers into complacency. I mean it looks "modern" so it's safe right? Just do some testing and it's all good.
Currently I'm skeptical Zig will get a production-ready memory safety. Currently there's only GC's or linear/affine types and Zig doesn't appear to be pursuing either. Aliased pointers aren't something that's properly handled by adhoc testing IMHO.
FWIW, "safe" doesn't appear anywhere on the Zig homepage. I've been trying out Zig for the past couple weeks, and while I love it so far, it gives anything but the feeling of safety. I would say there's guardrails, but those are optionally disabled in the compiler for faster execution.
It seems to be that Zig is really not trying to be a replacement for all programming, but fill its niche as best it can. If your niche requires memory safety as a top priority because it accepts untrusted input, Rust would probably be a better choice than Zig.
Reminds me of modern C++, where the language and standard library features all work together to increase safety. And then it backfires because you feel like it's safe, but in reality it's not, and bugs still happen, they just catch you off-guard.
Some sort of pointer tagging system, like 128-bit pointers where the upper word is a unique generation ID, might be the simplest approach to eliminate security problems from use-after-free, but it's going to have some amount of runtime overhead (though new hardware features may help to reduce it).
I think you don't get it. This isn't "write tests to make sure the vulnerability doesn't exist" this is "as you're testing, all of your code is automatically scanned for these vulnerabilities".
For a big project like a browser, I would imagine the tests would include property tests, fuzzing, etc.
This is obviously strictly less powerful than a proof assistant, which, yes, rust has, but we don't empirically know what the delta and the risk factor is between something like what zig gives you and something like what rust gives you... Moreover, I think it's likely that something like an proof assistant will be developed to track resources based off of zig's AIR. This is something that would be costly, but you could write it as a "linter" that blocks commits as a part of CI.
Yes, when I invariably had to debug the first UAF in Zig I did pause for a bit and pondered my rust. It's definitely an argument against Zig that is unlikely to go away anytime soon.
A better way to put it is "valgrind integration". It is enabled by default (when compiling for a target that has Valgrind support). Mainly it integrates with the `undefined` language feature which helps catch branching on undefined values. The nice thing you get beyond what C gives you, is that in Zig you can set things to undefined when you are done with them. Meanwhile in C, Valgrind is only aware of undefined for uninitialized variables.
But as others have pointed out, although Valgrind is a nice debugging tool, you would not run your application in it under normal circumstances. It's also not available on some important targets, such as macOS and Windows.
I don't think Zig has any particular Valgrind support, it's just a binary after all. In order to properly utilize valgrind though you're going to have to change from the GPA or whatever allocator you're using to the libc one so that Valgrind can trace memory allocations correctly via preloading.
"runtime checks on array/slice access and integer under/overflow"
I'm probably missing something. I feel like you'd get this and a lot of the other benefits you list if you just compile C/C++ with Debug options - or run with Valgrind or something. Are you saying you get automatic checks that can't be disabled in Zig? (that doesn't sound like a good thing.. hence I feel I'm missing something :) )
You're correct: you do get virtually all of the safety benefits of Zig by using sanitizers in C++. (Not speaking to language features in general, obviously.) In fact, C++ with sanitizers gives you more safety, because ASan/TSan/MSan have a lot of features for detecting UB.
The runtime safety checks are enabled in Debug and ReleaseSafe modes, but disabled in ReleaseFast and ReleaseSmall modes. They can be enabled (or disabled) on a per-scope basis using the `@setRuntimeSafety` builtin.
What "Debug options" are you imagining will provide runtime checks for overflow and underflow in C and C++ - languages where this behaviour is deliberately allowed as an optimisation?
In C it's simply a fact that incrementing the unsigned 8-bit integer 255 gets you 0 even though this defies what your arithmetic teacher taught you about the number line it's just how C works, so a "Debug Option" that says no, now that's an error isn't so much a "Debug Option" as a different programming language.
> What "Debug options" are you imagining will provide runtime checks for overflow and underflow in C and C++ - languages where this behaviour is deliberately allowed as an optimisation?
-fsanitize=undefined.
> In C it's simply a fact that incrementing the unsigned 8-bit integer 255 gets you 0 even though this defies what your arithmetic teacher taught you about the number line it's just how C works, so a "Debug Option" that says no, now that's an error isn't so much a "Debug Option" as a different programming language.
Yes, but this happens to be defined behavior, even if it’s what you don’t want most of the time. (Amusingly, a lot of so-called “safe” languages adopt this behavior in their release builds, and sometimes even their debug builds. You’re not getting direct memory corruption out of it, sure, but it’s a great way to create bugs.)
That’s a distinction without a difference. Yes it’s defined behavior. No, there isn’t a strictness check in C++ nor a debug option that will catch it if it causes a buffer overwrite or similar bug. Your comment is basically “no need to watch out for these bugs, they are caused by a feature”.
Did you read the same comment that I wrote? The very first thing I mentioned is a flag to turn on checking for this. And I mentioned the behavior for unsigned arithmetic is defined, but then I immediately mentioned that this behavior is probably not what you want and that other languages are adopting it is kind of sad.
People read the comment that you wrote, in which you, in typical "real programmer" fashion redefined the question so that it matched your preferred answer, by mentioning a flag that does not in fact, check for overflow and then clarifying that you've decided to check for undefined behaviour not for overflow.
[ saagarjha has since explained that in fact the UBSan does sanitize unsigned integer overflow (and several other things that aren't Undefined Behaviour) so this was wrong, left here for posterity ]
Machines are fine with the behaviour being whatever it is. But humans aren't and so the distant ancestor post says they liked the fact Zig has overflow checks in debug builds. So does Rust.
If you'd prefer to reject overflow entirely, it's prohibited in WUFFS. WUFFS doesn't need any runtime checks, since it is making all these decisions at compile time, but unlike Zig or indeed C it is not a general purpose language.
I would personally prefer a stronger invariant–overflows checked in release builds as well. Compile time checks are nice in the scenarios where you can make them work, of course, but not feasible for many applications.
Sorry, perhaps I assumed a bit too much with my response. Are you familiar with -fsanitize=unsigned-integer-overflow? Your response makes me think you might not be aware of it and I wanted you to be on the same footing in this discussion.
I was not. So, UBSan also "sanitizes" defined but undesirable behaviour from the language under the label "undefined". Great nomenclature there.
It also, by the looks of things, does not provide a way to say you want wrapping if that's what you did intend, you can only disable the sanitizer for the component that gets false positives. I don't know whether Zig has this, but Rust does (e.g. functions like wrapping_add() which of course inline to a single CPU instruction, and the Wrapping<> generic that implies all operations on that type are wrapping)
But you are then correct that this catches such overflows. Thanks for pointing to -fsanitize=unsigned-integer-overflow.
Since we're on the topic of sanitizers. These are great for AoC where I always run my real input under Debug anyway, but not much use in real systems where of course the edge case will inevitably happen in the uninstrumented production system and not in your unit tests...
> It also, by the looks of things, does not provide a way to say you want wrapping if that's what you did intend
This would be something for C/C++ to add, which they (for reasons unknown to me) failed to make progress on. I applaud Rust for having them; they're table stakes at this point.
> Since we're on the topic of sanitizers. These are great for AoC where I always run my real input under Debug anyway, but not much use in real systems where of course the edge case will inevitably happen in the uninstrumented production system and not in your unit tests...
Right, they are not perfect. They're a bandaid; a valiant effort but even then not a particularly great bandaid. As I've described elsewhere, I don't actually think this situation is going to get any better :(
Runtime checks for signed overflow can be enabled with -ftrapv in GCC and clang. Having this option open is why some people prefer to use signed integers over unsigned.
C unsigned integers are completely well behaved: they do arithmetic modulo 2^n, and I hope you had a teacher that exposed you to that. C has many problems but that isn't one of them: overflow of unsigned is designed and documented to wrap around.
> C unsigned integers are completely well behaved: they do arithmetic modulo 2^n
Sadly, one rarely finds an excuse to work in the field Z_(2^32) or Z_(2^64), so while that behavior is well-defined, it's rarely correct for whatever your purpose is.
It is usually correct for my purposes (electronic design automation). When it isn't I need to figure out how to handle overflow. There is no automatic solution that is right in all cases, and a trap certainly isn't.
I would have used to argue this, until I learned that Ada not only allows enum-indexing into arrays (compiler handled), but it also allows non-zero-based indexing.
Example: #1
-- You just index into this using 100 .. 200 and let the compiler handle it.
type Offsetted_Array is array (Positive range 100 .. 200) of Integer;
Example: #2
-- Indexing using an enumeration (it's really just a statically sized map)
-- An enumeration.
type c_lflag_t is (ISIG, ICANON, XCase, ... etc.
-- Create an array which maps into a single 32-bit integer.
type Local_Flags is array (c_lflag_t) of Boolean
with Pack, Size => 32;
It's actually super useful, especially since you effectively get a statically sized map. Also, you can iterate over enums, and move forward ('Succ) or backwards ('Pred) or to 'First or 'Last. You can also return VLA arrays, which means fewer "allocate just to return" problems (GNAT uses a second stack per thread allocated ahead of time).
I've only used it a few times but IIRC it was contiguous value ranges of grouped values (I think it was error codes coming from C code) anchored to the middle of a range. e.g. an enum which goes from 0 .. N, but values 10-30 were some specific set of logical values and I didn't care about the rest. It was nice that Ada automatically did all range checks for me and I didn't have to remember to subtract to check the correct array index.
The most common thing I've seen it for is that most arrays (and containers) in Ada are written as 1 .. N, but if you're share index information with C code, you want 0 .. N-1 indexing.
And you can raise the error if that index is actually out of bounds. I don't see why the wrapping specifically is the problem here, the only unsafety is indexing operation itself.
Sure, Rust for example will let you do that (although in debug builds it will panic unless you explicitly said this is what you intended). However from a correctness point of view, it is extremely unlikely that things[n] is doing what you intended if n wrapped.
Most likely you thought you were harmlessly increasing n, after all it's an unsigned integer, and you added something to it. But when it wrapped, adding something to it made it decrease dramatically and you probably didn't consider that.
This can be punished by bad guys, where you expected a value like 10 or maybe 420 instead the bad guys provide a huge number, you do some arithmetic with their huge number, you wrap the offset to the very start of your data structure. Now it's inside the bounds, but not where you expected at all.
This is why people talk about "hoping you get a segfault" in languages like C++ because the alternative is much worse.
If you need to care about this (fiddling with files somebody provided e.g. by uploading or emailing them to you is an obvious place this comes up in web services) you should use WUFFS to do that. You can't make this mistake in WUFFS.
I agree that domain-specific ranged types as found in Ada are close to ideal. Unbounded integers or naturals are second best. Wrapping and checked arithmetic are distant thirds, but I don't think either is intrinsically superior to the other in terms of safety. It depends on the program's specific design IMO, but if we're talking about a C-like language where checked arithmetic is not common, I still think it's clear that indexing should be unsigned. Not the approach I'd take in a new language of course.
The pointer arithmetic you describe is the real source of most unsafety. The reason most C/C++ programmers prefer segfaults is because such arithmetic lacks bounds checking.
Thanks for the reference to WUFF though, looks cool.
Perhaps you'd like Rust, where all the choices are offered, as functions on integers such as:
carrying_add (separate carry flag on input and output)
checked_add (result is None if it would overflow)
unchecked_add (explicitly unsafe, assumes overflow will never occur)
overflowing_add (like carrying_add but does not provide carry flag input)
saturating_add (the integer "saturates" at its maximum or, in the opposite direction, minimum - useful for low-level audio code)
wrapping_add (what C does for unsigned integers)
Rust also has variants that handle potentially confusing interactions e.g. "I have a signed integer, and I want to add this unsigned integer to it". With 8-bit integers, adding 200 to -100 should be 100, and Rust's provided function does exactly what you expected, whereas in C you might end up casting the unsigned integer to signed and maybe it works or maybe it doesn't. Likewise for "What's the magnitude of the difference between these two unsigned integers?" Rust provides a function that gets this right, without needing to consult a textbook for the correct way to tell the compiler what you want.
If you can't afford to ever get it wrong, WUFFS simply forbids overflow (and underflow) entirely, WUFFS programs that could overflow aren't valid WUFFS programs and won't compile.
Right, but in almost all languages one of the possible options is chosen by default because people want "+" to do something instead of having to specify each time. My personal opinion is that "+" should trap by default and the various other behaviors that are available (which 'tialaramex lists below as examples of which Rust provides) via some other mechanism. Some languages (C, C++) do it yet another wrong way in that "+" does a thing and there is no other way to do addition, and it's even worse because they picked one of the bad ones to use as a default.
Defaults matter a lot. Just because something is possible doesnt mean it is likely to happen.
Are most people going to enable asan, run their programs through valgrind extensively, or just do the easy thing and not do any of that?
This is also why neovim is being actively developed and successful and vim is slowly decaying. The path of least resistance is the path most well travelled.
Any project with a decent test coverage and CI can easily set up an ASAN / Valgrind run for their tests. I know I've had this on the last few C++ codebases I've worked with.
I would say that keeping the checks in runtime for release builds is the smart default. For most usages, removing the checks in release builds only adds security holes without measurable impact on performance.
For what it's worth, I find a lot of Zig code benefits from switching to u32/u64 indexes into an array instead of using pointers. This is only really doable if your container doesn't delete entries (you can tombstone them), but the immediate benefit is you don't have pointers which eliminates the use after free errors you mentioned.
The other benefit is that you can start to use your ID across multiple containers to represent an entity that has data stored in multiple places.
See [1] for a semi-popular blog post on this and [2] for a talk by Andrew Kelley (Zig creator) on how he's rebuilding the Zig compiler and it uses this technique.
I'd love to understand virtualization/emulation better, on the technical level. If I understand correctly, in modern CPUs it's all done with the help of specialized support in the instruction set, but I've never learned in depth
- how that works
- how it differs between Intel/AMD/ARM
- whether there are competing approaches
- what are the main software project "players" in the field
- which of them are open-source with humanly understandable
source code
Is there a summary/tutorial about all this stuff someone could recommend (or write)?
Well, yes, you can do that, but Intel's Software Developer's Manuals are terribly written. They are good as references for details, but only after you understand the grand picture. Learning something new from them is quite a torturous process.
Emulations is pretty much literally just mapping instructions between processors. So there may be an instruction in my custom chipset called "Add4", which adds 4 inputs. I would emulate ADD4 RAX, 0x1234, 0x2345, 0x3456 that by
It gets a bit more complicated with architecture differences like memory configurations. But that all emulation is.
When you're virtualizing, you pretty much just need to manage hardware. The hypervisor does this for you by managing which resources go to where. You could virtualize it by just running it like a program. But that's really painful and tedious, so you rely on the CPU to support it. Each chip has it's differences, but it's effectively just like a syscall. You have VMCALL and VMEXIT instructions. And then you have a handler in your vmexit table, which is exactly like a syscall table. So if(exitreason == CPUID_EXIT) cpuid_handler();
I know next to nothing about hardware / instruction set support from the CPU side, but there are several hypervisors who are open source and thus available for study: Qemu itself, Xen, Linux KVM, FreeBSD's bhyve, OpenBSD's vmm. Oh, and there's VirtualBox.
I vaguely recall someone telling me that the qemu source code is fairly readable, but I have not looked at any of these myself, so consider it hearsay.
> I vaguely recall someone telling me that the qemu source code is fairly readable [...]
That someone was either pranking you or knows a mysterious part of the qemu codebase that's unlike the rest of the criminally underdocumented qemu soup. Or my standards are too high.
The source is okay, the problem is that there's really no docs/examples for even the basic stuff. Do you want to boot from a virtual block device and PXE from a network drive ... well, maybe these 300 command line arguments in some random order will work. Eventually. Have fun. So reading the source is a big help in that. But it's not ideal. And libvirt/oVirt/etc are helpful, but not for just that quick and dirty "let's try something out with the linux kernel in qemu" thing.
a quick bird's eye view summary to get your bearings:
(caveat, I'm taking shortcuts so things may not survive scrutiny but it should be good enough for an overall understanding and getting down to the truth if you so desire)
emulation: software that implements and simulate hardware to mimic it as expected by software that would run within emulation
this is slow, because you have to really pretend hardware down to the details (simulate clocks, irqs, video output, etc). often you can get away with not being 100% exact and accurate depending on the emulated software's assumptions or you don't care if some specific output is incorrect (e.g graphical difference or oddity in games), which gives you some nice speedups. 100% accurate emulators can and do exist but you can see the difference in requirements (see mesen, higan) with inaccurate emulators.
some people quickly figured out that if you are running software compiled for a specific target (e.g a x86_64 CPU) that is the same target as the host then it kinda doesn't make sense to emulate the CPU... and that you can kinda obviously pass through cpu instructions instead of simulating the very hardware you're running on... this is:
virtualization: instead of emulating, expose a non-emulated device from the host that the software is known or expected to be compatible with, thus saving a lot of cycles.
for the CPUs this is typically done through an instruction that creates an isolated context in which the software will run. reminder: a CPU has multiple "rings" already under which software is run, for security and stability reasons (memory protection, etc...), e.g the kernel runs under ring 0 and userland processes are under ring 1, each in a different context preventing one to access the other in the same ring, and to go up rings, all checked at the hardware level. the virtualization instructions roughly do the same thing and nest things so that a virtual cpu runs code in a separate ring+context, in a way that the guest OS thinks it's alone. see also type 1 (xen, hyper-v) vs type 2 (qemu, vmware, parallels) hypervisors.
other devices can be implemented in lightweight, efficient ways, and exposed as virtual devices to the virtual machine for the nested OS to pick up. e.g instead of emulating a full PCI network card complete with pseudophysical hardware matching a real one, one can have a simpler device exposed that merely takes ethernet packets and inject them into the host kernel's network stack. or same for IO, instead of emulating a sata device complete with commands and all, one could just expose a block device and ask the host kernel to read its data from some host file directly. this requires dedicated drivers for these virtual devices in the guest OS though.
virtualization is as good as native because often it's literally native, only with hardware sandboxes and maybe a few added indirect wrappers, that Windows on PCs and apps and games on Xboxes actually run in VMs under HyperV all the time!
so back to the CPU side of things, you can notably see how it is impossible with virtualisation for a CPU to execute anything else than its own instruction set.
now, executing foreign instructions more efficiently than literally interpreting them can be achieved through various means such as binary translation or dynamic recompilation, but it still has to take some guest assumptions into account, e.g Rosetta 2 (which is neither virtualisation nor emulation) translates x64 code to arm64 but x64 assumes a certain ordering of operations between cpu and memory that just cannot work as is on arm64, so there's a special Apple-specific instruction to switch a CPU thread to a different memory model, still running arm64 instructions (translated from x64) but in a way that matches the x64 behaviour.
there are many more gritty details, and things are not always as clear cut as I described (or flat out lied about, the Feynman way) but it should be a good enough model to start with.
Didn’t have access to my archive at the time, so got some of the details wrong it seems (e.g. CbV not CbN, result for time is older than I remembered). Main thrust should remain valid, but be careful. In any case, here you go:
Dal Lago, Martini (2008). “The weak lambda calculus as a reasonable machine”. DOI:10.1016/j.tcs.2008.01.044, apparently not on arXiv.
Accattoli (2012). “A fresh look at the lambda-calculus”. DOI: 10.4230/LIPIcs.FSCD.2019.1, apparently not on arXiv.
Accattoli, Dal Lago (2014). “Beta reduction is invariant, indeed”. DOI: 10.1145/2603088.2603105, arXiv: 1405.3311 [cs.LO].
Accattoli, Dal Lago (2016). “(Leftmost-outermost) beta reduction is invariant, indeed”. DOI: 10.2168/LMCS-12(1:4)2016, arXiv: 1601.01233 [cs.PL].
Forster, Kunze, Roth (2020). “The weak call-by-value lambda-calculus is reasonable for both time and space”. DOI: 10.1145/3371095, arXiv: 1902.07515 [cs.CC].
Now that I’m looking through bibliographies, there are apparently other relevant intervening papers in the vicinity, even by some of the same authors, but these are what I’ve looked at personally.
Bonus: the paper
Hackett, Hutton (2019). “Call-by-need is clairvoyant call-by-value”. DOI: 10.1145/3341718, apparently not on arXiv.
is unrelated to questions of complexity but is just so absolutely lovely.
reply