The in-place update work in Koka [1] is super impressive. One of my co-workers, Daan Leijen leads the Koka project and hearing his talks about it have been such a privilege. The work around Koka is really convincing me that functional languages will eventually lead the pack in the effort-to-performance trade-off.
Something that came out of the Koka project that everyone should know about is mimalloc [2]: if your built-in malloc is not doing it for you, this is the alternative allocator you should try first. Mimalloc is tiny and it has consistently great performance in concurrent environments.
Ooooh… Koka is a really neat language. I first encountered it in a seminar when we talked about algebraic effect handlers. Koka has these row types which effectively allow you to statically ensure that all your effects get handled at some point, modulo certain ordering constraints on when those effects get handled.
Essentially, you get nice purity around effects but they're way easier to compose (and imo grok) than monads.
If anyone is interested in learning more, definitely take a look at this [1] paper by the aforementioned Leijen. (OP, very cool that you get to work with this guy.) One of the best-written papers I've seen.
I'm really excited for koka. I still need to dig in and actually try writing some code but it looks so promising. At a minimum I hope that it inspires other languages to look at these features.
There's definitely bits I don't love, but they're nothing I couldn't get over.
I actually attempted some of 2021's Advent of Code in Koka.
It was fun for the first couple of questions, but then got really painful for a couple of reasons. None of this is a complaint on the project, as it is still a research language and has not had a lot of effort invested into getting others to use it.
What I really liked was how easy it was to write parsers due to the effect system. i.e. you'd write a parser that looked like a direct imperative translation of the BNF grammar or the problem statement, and then the effect system would propagate the actual input through the call graph and each parser would only have to deal with its small part.
However several things made me get frustrated and stop which wouldn't have happened with better documentation and more source code to look at.
As someone who has no theory background in effect systems, the syntax was very oblique. i.e. when do I need to mark that my function needs to have a `| e` effect or a `, e` effect. I'm not a fan of the Haskell style of introducing single-letter free variables without any syntactic convention. For example, in Rust, the tick is used to clearly mark that something is a lifetime annotation. The rules about how lifetimes propagate, and so when you need to introduce 2 or more lifetimes are explained clearly (if not in the Rust docs, then somewhere else on the internet). That wasn't the case for Koka's effect syntax. In addition, I would sometimes stumble onto effect annotations that should've worked according to my limited understanding, but didn't. The compiler errors in such cases were really meaningless to me. Other times, I feel like the errors almost indicated that what I was doing was semantically correct, but was not yet implemented in the compiler. Also apparently the `_` has special meaning at the beginning of an effect annotation, but I only found this out from trial and error. Eventually I would just give up on annotating any functions until the compiler complained, but this made it difficult for me to understand my own code (it was like writing Python without type annotations).
The API docs are mostly just type signatures, which means unless it is some simple function, or it is used in the koka implementation sources, you are left scratching your head about how to use the standard library. Things like `ref()` types could also use much more elaboration.
Also, at the end of the day, Advent of Code solutions are generally pretty small scripts, where there isn't a need for effect systems. It was often easier to just use Python and its solid standard library.
Overall, it is an interesting language and I'll definitely keep an eye on it.
This. Said much better than my other posts, but this is exactly what I was trying to say. Naive ARC vs GC... a good GC will kill it, so don't make the ARC naive...cheat and find ways to skip most ref. counter bumps. It became amazingly clear when I started coding Rust: you only have to bump ref count when you get a new owner. Most RC objects die with just a few counter bumps. Most objects don't need RC at all (most objects can be single owner). Now imagine a lang (lobster given as an example above) that does this automatically. Sweet spot.
However non-naive your automatic counting scheme can be, it still has to touch object at least twice. However non-naive your automatic counting scheme can be, it still has twice as much overhead per object compared to GC - structure and count vs just structure. Sometimes you can factor structure out - by saying "the allocator page contains pointer to structure at the very start", - but this can be done for GC too. I guess in that case GC wins even more. And you need structure (virtual methods table, actual object structure, etc) because release of object should decrease counts of referenced objects.
Finally, reference counting can and will fragment heap.
To be clear, I don't hate tracing GC, and in fact find all forms of GC fascinating. It is definitely "trade offs". I just happen to program in a space where I like the trade offs of ARC better than those of a tracing GC. This could change if I start doing different forms of coding, and in higher level languages I often do like a tracing GC, but Rust (and these new upcoming languages) are making me rethink this a bit atm.
> However non-naive your automatic counting scheme can be, it still has to touch object at least twice.
Only if the object is actually ARC. The whole goal is omit that entirely. Most objects live and die on the stack via proving single ownership. To be fair, I'm not sure why the same optimization couldn't be done for a tracing GC (many will say GC allocation and frees are already mostly 'free' because you compact live not dead objects, but by putting everything on the GC path, you force the need for a collection sooner than if it wasn't heap allocated at all).
> Finally, reference counting can and will fragment heap.
This is true, but in return we get deterministic destruction and easy FFI. If we can avoid a solid # of our allocations entirely this becomes less of an issue. When building things like trees which hit the allocator hard we use arenas, which are more like a traditional "tracing GC" minus the tracing (bump allocator + deterministic free when done, all at once).
> I just happen to program in a space where I like the trade offs of ARC better than those of a tracing GC.
Even ARC has to be augmented with some tracing in order to deterministically collect cycles. Ideally, of course, we would want to describe things like "non-cyclic ownership graph" as part of the type system, so that the compiler itself can decide where the addition of tracing infrastructure is necessary. (Delayed deallocation, as often seen in tracing, concurrent etc. GC, might be expressed as moving the ownership of finalized objects to some sort of cleanup task. But if you code carefully, the need for this ought to be very rare.)
RC might not fragment the heap for long! Someone built a defragmenting malloc called Mesh [0] which uses virtual page remapping to effectively merge two non-overlapping pages. Pretty brilliant stuff!
Heap fragmentation is a real issue, but "GC is fast if you throw physical memory at it" is actually a very bad argument for it on personal computers, especially if they have to fit on your wrist, and especially if you've invented technologies that make scanning memory more expensive like swap since 1986.
ObjC/Swift have some tricks to avoid allocating refcounts, mainly relying on storing them in unused pointer bits.
Software based on this intra-program message bus usually follows "throw memory at it". That software is written in Java (hence GC) and process millions of transactions per second. The memory amount is so high that GC is not triggered at all during operations hours. Nevertheless, this is not achevable with refcounting because, well, refcounting has to keep counts.
Refcounts, having non-local access patterns, do not play well with caches, let alone something that have much higher latency like swap.
I guess these tricks of refcount allocation avoidance will work fine unless there's a scale-invariant graph with several tens of thousands nodes. In scale-invariant graphs there is some number of nodes that have links from almost all other nodes.
Actually, I've got exactly that experience: typical weighted finite state transducer's HCLG graph for automatic speech recognition is scale-invariant. Analysis over it (in C++ using OpenFST) usually was done reasonably quickly, but there was a substantial delay between final printf and actual program exit, when all these pescy refcounted smart-pointed structures finally got freed.
I write all that for other readers to have more unbiased view of the problem.
As for me, I greatly prefer arenas - much like "automatic regions with reference counting" mentioned above. Arenas allow for cyclic structures, easy to implement even in C and allow for garbage collection algorithms.
> I guess these tricks of refcount allocation avoidance will work fine unless there's a scale-invariant graph with several tens of thousands nodes. In scale-invariant graphs there is some number of nodes that have links from almost all other nodes.
This wouldn't come up in ARC, though mostly because it doesn't detect cycles so you wouldn't use strong references between graph nodes. (Overly large refcounts can still happen other ways but usually get optimized out.)
As a quick solution, I think you'd keep a list of valid nodes and have it be the only strong references to them. This is like an arena, except it doesn't have the possibility of references escaping the arena and being deallocated.
Reference counts don't need to be a full size_t. Particularly when you're using a mark-sweep or mark-sweep-compact collector to collect reference cycles, you can use a saturating counter, and just give up on reference counting that object if the reference count ever hits its maximum value.
There are tons of applications where lots of objects only ever have one reference, so in addition to the single mark bit in your object header, you can have a single bit reference count that keeps track of if there are "one" or "more than one" references to the object. If a reference passes out of scope and it refers to an object with the reference count bit unset, then it can be immediately collected. Otherwise, you need to leave the object for the tracing collector to handle. This is useful in cases where you have a function that allocates a buffer, passes it to a function that populates the buffer (without storing a reference to the buffer into any objects), passes that buffer to a function that reads out the data (again, without storing the reference into any objects), and then returns.
In the common case of objects where a C++ unique_ptr would do, this single bit reference count never gets modified. There is more overhead in checking the bit when the reference goes out of scope, but there is never a decrement of the reference count, and the common case doesn't have an increment.
> Finally, reference counting can and will fragment heap.
Is there anything that won’t fragment heap and simultaneously doesn’t insist on owning the world? It is really, really painful to allow references to cross the GC heap boundary in both directions (witness the GNOME Shell not-a-memory-leak), and to me that is the principal reason: in a low-level language, you can RC only a bit of stuff, but you cannot really moving-GC only a bit of stuff.
(I don’t expect an affirmative answer to this question as stated, but maybe some more basic abstraction for manual memory management would allow this?..)
- Make threads a "special" tracked API call (for example, like Rust) so that you know when something will or won't be passed between threads. If it never "escapes" the thread, you can use non-atomic ref counts for the lifetime of that object (NOTE: I should add this is something I've been thinking on - I don't know how hard/feasible it is to do in practice)
- Use biased RC - don't have link to paper handy, but idea is that most multithreaded objects spend most of their life on one thread and when they are on that thread, can use non-atomic ref counts by using a separate counter. When on any other thread they use an atomic counter. When both counts hit zero, object is destroyed.
Definitely! Vale, Lobster, and Cone all require explicit annotations for anything that could cross thread boundaries.
Vale's design bends the rules a little bit, it blends structured concurrency with region borrow checking to allow things to temporarily escape the current thread. [0]
I believe Lobster spawns a new VM per thread, and has some special abilities to share data between them, but I'm vague on the details.
Cone requires an explicit permission to allow crossing thread boundaries, which I think will work pretty well. [1]
I've read about biased RC before, but have been skeptical because of the branching involved. How has it worked out in practice?
I've just read about it, never actually used it verbatim tbh, but just found it kinda fascinating. I wrote a somewhat modified version (I don't track local thread ID, I use Rust's type system and Send/Sync properties instead) of the algorithm for a very different purpose: I use it to allow safe moves back and forth between (non-atomic) Rc and (atomic) Arc objects. For this purpose, it is very useful. The actual Rc/Arc operations are no more expensive than regular Rust Rc/Arc (except for the last decrement on Rc - slightly more expensive)
I got the idea from this person (I didn't use their algorithm though - they probably do it similar to how I do it though): https://crates.io/crates/hybrid-rc
Nim's ARC memory management uses the first strategy. Non-atomic refs combined with moves really speed things up! Moving data across threads requires the data be of a special type `Isolate`. It can be a bit tricky to prove to the compiler that the data is isolated, e.g. single ownership.
If you work on software that has hard real-time constraints, e.g. live audio processing, do any of the improvements to GC matter?
GC reminds me of satellite internet: no matter how fast it gets in terms of throughput, the latency will always be unacceptable — but because the latency issue can't be solved, advocates will keep trying to tell you that latency doesn't matter as much as you think it does and your problem isn't really a problem. (All sorts of bad experiences in modern software come down to somone trading away latency to get throughput.)
Rather than the framing of "GC vs. ARC", I'm mostly interested in improvements to the ownership model.
Ironically, even though I believe RC will make great strides, I actually agree with you. Single ownership has the most potential, especially if we decouple it from the borrow checker as we are in Vale. [0] Being able to control when something is deallocated is quite powerful, especially for real-time systems.
> GC reminds me of satellite internet: no matter how fast it gets in terms of throughput, the latency will always be unacceptable — but because the latency issue can't be solved, advocates will keep trying to tell you that latency doesn't matter as much as you think it does and your problem isn't really a problem.
LOL, like this statement, and will have to borrow it for future discussions/debates.
Note that Rust will most likely be getting some support for these features too, at some point. It's what much of the academic PL research on borrow-checked languages is pointing towards. You can alreasy see some of the practical effect with "GhostCell" and similar proposals, even though these are arguably much too clunky and unintuituve for practical use (notably, they use obscure features of the Rust lifetime system to implement their equivalent of regions). Language-level support is likely to be rather easier to work with.
Fascinating. I’ll be honest, I’m moderately skeptical! Various flavors of GC has caused me great pain and suffering. The promise has long been “GC, but less bad” and so far that has not come to fruition.
However, I’m ecstatic at how many languages are trying new and interesting things. The ecosystem of novel languages feel more rich and vibrant now than at any other point in recent memory. Probably due to LLVM I think?
You should give a talk at Handmade Seattle in the fall! Would be a great audience for it I think.
Ehh. Who cares, though? And I mean that in the least dismissive way.
Why do you care? Sell me on it. GCs have gotten to the point where the slowdown is almost the butt of jokes. LuaJIT proves that it can get damn close to C speeds.
What you’ve said sounds like a whole lot of complexity for … what?
Give me the automatic threading speed improvements with no extra programmer effort and I’ll change my stance 270 degrees.
To get the automatic threading speed improvements, a (region-aware) language might add a `parallel` keyword in front of loops, to temporarily freeze pre-existing memory, thus completely eliminating the overhead to access it. Vale is building this for its generational references [0] model but I hope that more RC-based languages attempt something like it. Can't get much easier than that ;)
Perceus and HVM require immutability, which does require programmer effort, I admit. Pure functional programming just isn't as convenient as imperative programming for most. However, languages are learning to blend mutable and immutable styles, such as Pony [1], Vale [2], and Cone [3]. When a language allows you to create an object imperatively and then freeze it, it's wonderfully easy.
I think the next decade in languages will be about this exact topic: making it easy to have fast and safe code. I think RC will soon bring some very compelling factors to that equation.
I think the real point of their question was along the lines of GC works great and these days it's pretty fast, why research cleverer reference-counting?
Will these developments lead to appreciable performance improvements for 'typical' non-embedded, non-real-time programs?
I think the consensus is that GC works great _if_ you throw enough memory at it (about three times what you really need)
On mobile, the problem with that isn’t as much that memory costs money, but that it costs Joules. Peak power usage of RAM may not be large compared to that of your screen, CPU, GPU, GPS, etc. but AFAIK, with current tech, memory keeps eating energy even if your device is sleeping.
Being able to run with less memory, I think, is why iPhones can do with relatively small batteries and yet have decent stand by times.
The fastest system currently out there is iOS, which (as this article says) doesn't use GC and does use clever reference counting, so yes?
Note, most people here arguing that GC is actually always faster are forgetting that reading memory itself has a cost, especially in systems with swap or other reasons it might have a /really/ large cost.
If you won't believe me, you could believe Chris in the article.
> Chris Lattner: Here’s the way I look at it, and as you said, the ship has somewhat sailed: I am totally convinced that ARC is the right way to go upfront.
For me the biggest benefit is deterministic destruction, specifically for the non-memory resources it is holding (files, locks, etc.) the same as if it was stack allocated. It is very freeing knowing that is just being taken care of. No 'defer', 'with' or other special handling needed.
But possibly the bigger benefit: no heap allocation for many objects at all. If you can prove single ownership, you can stack allocate it (or "inline it" if in a struct/object).
> No 'defer', 'with' or other special handling needed.
Just as an aside - in Python, it surprised me that 'with' is _not_ a destructor. The object created will remain in-scope after the 'with' body, but the '__exit__' method will be called. Which for the canonical example of file-handling, only really closes the file.
The surprise is less about resource usage and more about the scope lifetime. 'with' introduces a new binding that outlives its body. It looks, at first glance, to be a temporary variable. But it isn't.
I don't know the _motivation_ for why it leaks, but it leaks because the method calls of a context manager are __enter__ and __exit__, but not __del__. [0]
The motivation is especially opaque, because though CPython does run __exit__ immediately, it isn't guaranteed to do so by the standard. So __exit__ will happen at some point, like __del__ with the GC, but it won't actually call __del__, which makes GC'ing the value more complicated as __del__ can't safely run until __exit__ does.
I _think_ the motivation is something to do with __exit__ being allowed to throw an exception that can be caught, whereas __del__ isn't allowed to do that, but I don't find it that convincing. And I've never seen an example where the introduced variable is ever referenced afterwards in Python's documentation (including in the PEP itself).
Thinking about it, my guess is that properly scoping the variable does not really prevent the user from leaking the value anyway:
with context() as foo:
leak = foo
This is unlike list comprehensions, where it is syntactically impossible to bind the comprehension variable to a new name accessible outside the comprehension.
So maybe the reason is that it is not really worth the trouble.
with open(path) as file:
result = func(file.read())
another_func(result)
If `with` used a separate scope then it would be a lot less ergonomic. How might we bring data outside that scope and be confident that `with` isn't forcibly cleaning things up? Would be a lot more verbose.
Python doesn't have block scope, but most people don't seem to notice:
def foo():
if False:
x = 123
print("x is visible out here:", x)
That will generate an error because `x` is "referenced before assignment", but the variable exists for the entire function. This can bite you in subtle ways:
x = "is global out here"
def foo():
print("all good:", x)
def bar():
print("not good:", x)
if False:
x = "now x is local, even before this declaration"
I would argue that resources need always explicit handling, no matter whether you do reference counting or GC. So if I do a one-liner to dispose an object or call delete on an object is the same beauty. Wait till the variable is out of scope together with many more or hit the GC cleanup is just not the right way to clean up externally resources in a proper way.
Java's "try with resources" is effectively the same as Go's 'defer' and Python's 'with', so the advantage is that you can't forget to do it. No matter what you do, when the object goes out of scope, the resource will be closed.
Everywhere that a clippy lint is there to catch bugs, I wish it wasn't a clippy lint because it won't always get run and those bugs will get missed.
An example of a clippy lint that I like in clippy:
If you try to transmute() &[u8] into a &str clippy will point out that you should probably from_utf8_unchecked() -- this lint results in code that does the same thing but is clearer for maintainers about what we intended. If our reaction to the lint is "But it isn't UTF-8" then we just got an important red flag, but that's not why the lint is there.
On the other hand, clippy lints for truncating casts and I want those gone from the language so the warnings ought to at least live in the compiler not clippy. I would like to see a future Edition outlaw these casts entirely so that you need to write a truncating cast explicitly if that's what you needed.
Is Java escape analysis done at compile time or run time? I don't know.
FYI: As I understand, escape analysis is frequently used to either (a) avoid allocating on the heap, or (b) auto-magically free-ing a malloc'd object from a tight scope.
Personally I wish they would move some escape analysis to compile time just to help with memory consumption. I’m assuming it would require byte code changes though.
>the non-memory resources it is holding (files, locks, etc.)
Locks are entirely a memory construct. You can trivially write one, e.g. Java locks are written entirely in java. The files descriptors (incl. socket) - they should not be relied to be GC'd by default, they do require their own lifecycle. For instance: c# Disposable/ java's try with resource.
>no heap allocation for many objects at all.
This has been already the case for many a year - reference escape analysis, ref. escape elission, trace compilation.
Locks are, in fact, one of many resources your operating system can give you.
Historically, in concurrent software your only option was to either use these OS allocated locks or build a spin lock from atomic operations (and thus burn CPU every time you're waiting for a lock). The OS allocated locks, such as from sem_open() or create_sem() must later be properly released or you're leaking a finite OS resource that will only be reclaimed on process exit or in some cases only at reboot.
Today your operating system probably offers a lightweight "futex" style locking primitive, in which so long as it's not locked no OS resources are reserved it's just memory. But, you're not guaranteed that all your locks are these lightweight structures and even if you were you need to clean up properly because when they are locked the OS is using resources so you need to unlock properly or you'll blow up.
The only part needed to build a lock is to be able to 'sleep' the current thread and the latter to be able to be awaken, plus compare-and-swap or load linked/store conditional. There is nothing really special about the locks as long as the current thread can be made dormant. Technically a single OS lock (e.g. a futex), associated with the current thread would suffice to implement as many locks as one desires. The lock would be released when the thread dies in the clean up code.
The locks that result in thread awaiting are still in-memory, of course. They are just OS provided primitives. Yet again, a well behaved/designed lock should be non-contested for the most calls (easily 99.9%+). In that case there is no mode-switch to get to the kernel. A contested lock that goes to the kernel is a context switch.
In the end there is no need to do anything special about locks as they can implemented in the user-space unlike the file descriptors.
> In the end there is no need to do anything special about locks as they can implemented in the user-space unlike the file descriptors.
If you get semaphores from create_sem() you will need to give them back, you can't just say "This shouldn't be an OS resource" and throw it away, exactly like a file descriptor it's your responsibility to dispose of it properly.
Speaking of which, depending on the OS, your file descriptors might also just be implemented in user space. They're just integers, it's no big deal, of course you must still properly open() and close() files, the fact it's "implemented in user space" does not magically mean it isn't your problem to clean up properly.
Likewise, if you opened a sqlite database, you must close it properly, you can't just say "Eh, it's only userspace, it will be fine" and throw away the memory without closing properly.
>If you get semaphores from create_sem() you will need to give them back...
I don't quite get the continued argument - the created locks in the user space have 'zero' extra dependency on the kernel. In a GC setup they would be an object no different than a mere string (likely less memory as well). They can be GC'd at any time as long as there are no references to them.
If you wish to see an example: java.util.concurrent package. ReentrantLock[0] is a classic lock, and there is even a Semaphore in the same package (technically not in the very same). Originated in 2004 with JMM and the availability of CAS. You can create as many as you choose - no kernel mode switch would occur.
For the record I do no object that proper resource managed is needed - incl. sqldatabase close. The latter does use a file descriptor any time you establish a connection. However, locks need no special support from the kernel. Also in GC'd setup file descriptors would be GC'd in pre (or post) morterm via some finalization mechanism, however relying on the GC to perform resource management is a terrible practice.
I keep telling you that, in fact, these primitives can literally be OS resources such as the semaphores from the BeOS system call create_sem() or the POSIX sem_open or Windows CreateSemaphore.
Your insistence that these don't have a dependency on the kernel, that they're just memory to be garbage collected is wrong, they aren't, this won't work.
You propose the Java ReentrantLock class as an example, but, that's just a thin wrapper for the underlying Java runtime parking mechanism, so, sure it's just a Java object no big deal, but the actual implementation (which is not in the file you just linked) requires an OS primitive that you've not examined at all, that OS primitive is being managed by the runtime (via the "Unsafe" interface) and is excluded from your garbage collection entirely.
In essence you get to "just" garbage collect these locks because they're only a superficial Java language shadow of the real mechanism, which you aren't allowed to touch and can't garbage collect.
>requires an OS primitive that you've not examined at all, that OS primitive is being managed by the runtime (via the "Unsafe" interface) and is excluded from your garbage collection entirely.
This is most definitely false. The methods needed are the compare-and-set + LockSupport/park (parkNanos) and LockSupport.unpark. The former is a standard CPU instruction w/o any special magic about and the latter is setting the current thread to sleep and being awaken.
As for examine: I know how j.u.c works down the OS calls and generated assembly. Unsafe is a class (not an interface), albeit mostly resulting in compiler magic.
The use of unsafe is solely for: LockSupport::park/unpark and AbstractQueuedSynchronizer::compareAndSet. ReentrantLock is NOT any thing wrapper.
> As for examine: I know how j.u.c works down the OS calls and generated assembly.
And here's the big reveal. It's actually calling the OS for "setting the current thread to sleep and being awaken" ie it's just using the OS locks and those are hidden from you so that you can't garbage collect them.
> Unsafe is a class (not an interface), albeit mostly resulting in compiler magic.
Unfortunate choice of language, I didn't mean a Java interface, but an interface in the broader sense. I assure you that "setting the current thread to sleep" is not "compiler magic", it's just using the OS locks, on a modern system it will presumably use a lightweight lock but on older systems it'll be something more heavyweight like the semaphores we saw earlier.
The claim that LockSupport, which is calling the OS is not a wrapper is bogus.
But all this is a distraction, even on systems where old semaphore APIs are now rarely needed they still exist and if you hold one you are still responsible for cleaning it up properly even if you think you know better.
There is a single OS call to put the current thread to sleep; I mentioned it much earlier - a single pthread_mutex_t per Thread NOT per lock
Those are not any part of the developer capability to manage, aside starting the thread (and locking on it). The mutex is released with the thread, itself; Technically hotspot impl java does not ever releases them back to the OS but keeps a free-list. The thread is a lot larger OS resource as well. No new OS resources are exhausted by the OS b/c of the extra locking.
>The claim that LockSupport, which is calling the OS is not a wrapper is bogus.
Of course it does call the OS; it happens on the dormant part (w/o an OS call it'd a busy spin), or if a thread needs to be awaken. I said not a 'thin' wrapper. All the locking and queueing mechanism is outside the OS. The major distinction is there is only a single OS primitive per Thread NOT per lock. (Edit: now I see I have spelled thin as 'thing' which is a definite mistake on my part)
/*
* This is the platform-specific implementation underpinning
* the ParkEvent class, which itself underpins Java-level monitor
* operations. See park.hpp for details.
* These event objects are type-stable and immortal - we never delete them.
* Events are associated with a thread for the lifetime of the thread.
*/
// ParkEvents are type-stable and immortal.
//
// Lifecycle: Once a ParkEvent is associated with a thread that ParkEvent remains
// associated with the thread for the thread's entire lifetime - the relationship is
// stable. A thread will be associated at most one ParkEvent. When the thread
// expires, the ParkEvent moves to the EventFreeList.
Edit:
My point has always been that locks/monitors/semaphores/etc. dont need any special care in a GC-setup as they can be implemented purely in the language itself, and without special care to release them, and manage them similar to file descriptors. The need to block (become dormant) obviously requires OS calls and support but that part doesn't have to scale with the amount of locks/semaphores/latches/etc. created.
> now I see I have spelled thin as 'thing' which is a definite mistake on my part
That does make a lot more sense. I agree that what's going on here is not merely a thin wrapper.
> My point has always been that locks/monitors/semaphores/etc. dont need any special care in a GC-setup as they can be implemented purely in the language itself
And my point was always that you can only depend upon this if you in fact are always using the "implemented purely in the language itself" mechanisms, while plenty of others exist and you might need them - but you can't just treat them as garbage safely.
So overall I think we were mostly talking at cross purposes.
Actually looking at the Java stuff reminded me of m_ou_se's thread about how big boost::shared_mutex is (it's 312 bytes). Most people who need one of these synchronization primitives only really need the one that costs 4 bytes, but it's hard to keep saying "No" to people who want to add features. They're always confident that everybody will love their feature, at the cost of a little more memory, and it spirals out of control.
On most systems the pthread_mutex_t you mentioned is already 40 bytes. A "nightly" Rust can do a 32-byte array with a mutex to protect it, all inside 40 bytes.
like Chris said in the article, it is difficult for GC languages to interop with C/C++ codebase.
in addition, if your system requires precise control over when and when not to use CPU, like resource-intensive gaming, browsers, OS, then GC may not be a good choice.
lastly, if you are using GPU via graphics API, I do not know any API that can garbage collect GPU memory but RC can naturally extend memory cleanup code to cleanup GPU memory as well quite easily.
No it isn't. Chris put his foot in his mouth and went on about how an advantage of garbage collection is that it moves objects. Not all garbage collection moves objects. Garbage collection that does not move objects does not have any such interop problem. Furthermore, even GC that does move objects can avoid that particular interop problem if it simply avoids moving foreign objects. E.g. suppose a foreign object is managed by some gc-heap object of type "foreign pointer". That managing object can be moved during GC, but if the actual foreign pointer doesn't change. If that pointer is all that the foreign code knows about, then there is no problem. Don't give foreign code any pointers to your heap objects and you're good.
To be clear, in enterprise (non-embedded / non-mobile) programming, this is fine. Money (RAM) isn't the issue. Programmers with domain knowledge are much more expensive than adding more hosts or RAM. I am talking about the general case; I know HN loves to focus on "high performance / low latency", but that is less than 1% in "Enterprise Real World".
My TXR Lisp is garbage collected. When I blow away all the compiled files of the standard library and run a rebuild, in a separate window running top, I do not ever see a snapshot of any txr process using more than 27 megs of RAM:
This is close to the worst-case observed:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19467 kaz 20 0 27364 24924 3112 R 100.0 1.2 0:30.55 txr
When this is happening, the library code is being interpreted as it compiles itself,
it by bit. E.g. when stdlib/optimize.tl is being compiled, I see this 27 megs usage.
But after the entire library is compiled, if I just "touch stdlib/optimize.tl" to
force that to rebuild, it's a lot leaner; I see it hit about 16 megs.
Sitting at a REPL prompt, compared to some bash processes, using ps -aux:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
kaz 2433 0.0 0.0 8904 1900 pts/0 Ss Feb22 0:00 -bash
kaz 6179 0.0 0.2 9072 4212 pts/1 Ss+ Apr22 0:01 -bash
kaz 18551 0.0 0.1 11152 3932 pts/2 Ss+ Feb22 1:32 -bash
kaz 18836 0.0 0.1 9064 2196 pts/5 Ss+ Apr13 0:02 -bash
kaz 19079 0.0 0.2 13464 4880 pts/3 Ss Mar01 0:40 -bash
kaz 19849 0.1 0.2 9360 6024 pts/4 S+ 23:45 0:00 txr
Oh, and that's the regular build of TXR. If that's too bloated, you can try CONFIG_SMALL_MEM.
Large is as large does.
Garbage collection was once used on systems in which most of today's software would not fit.
Garbage collection does have waste. You allocate more than you need, and keep it.
Precise, manual memory management does the same thing. Firstly, the technique of keeping around memory is practiced under manual memory management. E.g. C programs keep objects on recycle lists instead of calling free. And free itself, will not release memory to the OS, except in certain circumstances. Then there is internal fragmentation. Generally, programs that "get fat" during some peak workload will stay that way.
If you see some garbage collected system need hundreds of megabytes of RAM just for a base image, it may not be the GC that's to blame, or not entirely: it's all the bloated crap they are pulling into the image. Or it could also be that the GC was tuned by someone under thirty who doesn't think it's weird for a browser with three tabs open to be taking 500 megs. GC can be tuned as badly as you want; you can program it not to collect anything until the VM footprint is past the physical RAM available.
Hi, I am bootcamp grad currently working with fullstack ror development. I have started to get bored of this work and want to venture into other core cse fields. Compiler design and machine learning seems to have caught my interest the most. What would your advise to be someone who wants to break into this field by self study? Does compiler design involve any math that is also used in machine learning and deeplearning? And do they share any similar concepts?
That's great that you're interested in venturing into other fields! Switching fields every once in a while is a great way to grow and become a stellar engineer.
If you have equal passion for those two choices, I would suggest leaning towards machine learning for now. An ML engineer can make $300k-500k/yr at the big companies after a few years.
I left that world and started doing language development as an open source endeavor, and make only $700/yr [0] which isn't that much, though I'm oddly more proud of that than the salary I got from my years in Mountain View. I'm happier, being able to push the state of the art forward and show people what incredible potential there lies ahead in the PL realm. However, I'd recommend graduates to focus on finances for a few years before looking at what lies beyond, lest the financial stress overshadow the dream.
I'm only about 110,000 lines into my journey with Vale, but so far I'd say there's little math. There's a lot of data structures knowledge needed to make a compiler fast, and some interesting algorithms, but not much overlap with ML. But who knows, maybe there's some cool things languages can do specifically for ML, that haven't been discovered yet!
ML utilizes continuous mathematics, while compiler design including static analysis usually favor discrete mathematics. I wouldn't say the concepts are interchangeable.
The classic text book is the Dragon Book. Another good start on the compiler side are projects like https://buildyourownlisp.com/.
A lot of people start with lexing, parsing, and so on before moving on, but you can start from the static analysis side too. I would look at the LLVM framework and the Clang Static Analyzer, and use one of the tutorials for that to write a small analyzer for C.
I'd encourage you to join the following subreddits that might be interesting to you: r/compilers, r/programminglanguages, r/EmuDev, r/databasedevelopment.
You'll find more people like yourself interested in core CS.
My company also runs a Discord community for those kinds of people: discord.multiprocess.io.
Maybe start with a good book or two to see if you're actually interested. Crafting Interpreters (https://craftinginterpreters.com/contents.html) might be a good intro--yes, it's about interpreters, but there's a lot of overlap between creating an interpreter and creating a compiler.
As very much a non-expert in compiler design, I found two books to be helpful introductory reads (neither are free though). I'm not sure whether someone proficient in the field would recommend them as my knowledge in the area is small, but I enjoyed them.
I briefly scanned the Lobster reference, but I didn't see how they're dealing with the undecidable cases. Presumably it punts out to the garbage collector for objects where lifetime isn't statically decidable, right?
My understanding is the answer for "Why doesn't Rust use something like type inference for lifetimes?" is "In the general case, that reduces to the halting problem."
There is also a lot still to do on the GC side. Azul GC with hardware support allowed for amazing results. Even without hardware support the results are quite good.
A little off topic, but may I ask how you managed to work on this full time? I am also developing a programming language, (methodscript.com if you’re curious) but I’ve never been anywhere near being able to work on it full time.
Their site mentions they worked for Google for 6 years. The stock alone from doing that would be enough to self fund for a while. They may also be getting grants or sponsors for their language or game development. Or they could be doing contract work of any kind to pay the bills.
These are all common approaches to working on your own thing full time.
Yep, I worked at Google until they silently started sunsetting all the small endeavors and funneling everyone into Ads/Cloud/Search. Back in 2014, the internal position listings were full of fascinating projects that could bring a lot of good, and looking at it now, it's pretty clear the writing is on the wall for the culture. I left late last year.
Luckily, I've saved up enough to be able to work on Vale full-time for a year or two, hoping I'll have enough sponsors by then to be able to continue. If I play my cards right, I'll be able to make Vale into a nonprofit, a la ZSF. A worthy endeavor!
He gets something subtly wrong though: GC barriers are waaaaay cheaper than ARC, in the limit when you optimize them. Libauto had expensive barriers, but as Chris said, it’s not super interesting since it was overconstrained. But the production high-perf GCs tend to have barriers that are way cheaper than executing an atomic instruction. JSC’s GC has barriers that basically cost nothing, and that’s not unusual. It’s true that low latency GCs tend to have barrier overheads that are higher than high throughput GCs, but even then, the cost isn’t anything like atomic inc/dec. The barrier overhead in RTGCs is often around 5-15% except for the esoteric algorithms nobody uses.
Also, GCs trivially give you safety even if you race to store to a pointer field. ARC doesn’t, unless you introduce additional overheads.
Does what I say mean that ARC is bad? Heck no, ARC is super impressive. But it does mean that it’s a good idea to weigh ARC versus GC as a determinism versus speed decision, and accept that if you chose ARC, you lost some speed.
Oh but ARC also let’s you do CoW better than what a GC could give you, so for some languages ARC is a speed win over any possible GC because it’ll let you do fewer copies. I think that’s true in Swift. But if you don’t have CoW then GCs are def faster.
I think his point about the amount of research done on ARC vs GC was interesting. Only recently have we see things like biased reference counting[1] that can significantly reduce the overhead of ARC.
Reference counting is the oldest garbage collection algorithm, it has seen plenty of research.
What keeps being overseen is the amount of engineering to make reference counting perform in modern hardware, effectively overlaps with other garbage collection optimizations.
> What keeps being overseen is the amount of engineering ...
You probably wanted "overlooked" here. Even though "seen" and "looked" are similar in meaning, "overseen" and "overlooked" are not. This is a subtle but important distinction in English. https://grammarist.com/usage/oversee-vs-overlook/
Historically you could choose "oversee" when you would today use "overlook" or vice versa, but these usages went from archaic to actively disapproved. One remnant though is you would still use "overlook" in a sentence like "Atop the hill, the old castle overlooks the whole town".
Good point. Example showing the two contrasting meanings of the word "oversight" in modern English:
At m.d.s.policy it was our view that Symantec management had inadequate oversight of Symantec's Certificate Authority and as a result on several occasions over an extended period an important oversight relating to the name validation procedures at CrossCert, a Korean business authorised by Symtanec to issue on their behalf - occurred without being detected by Symantec. This definitely resulted in undetected misissuance.
As a result of our efforts to fix this problem Symantec sold their entire CA business to DigiCert, and CrossCert became merely a reseller of DigiCert's products (including "Symantec" branded certificates)
This meant that now DigiCert's trustworthy management exercised oversight over all issuance of certificates from the new CA hierarchy even when the certificates had "Symantec" branding - while an oversight by CrossCert, such as failing to verify the identity of a customer, would no longer cause misissuance because DigiCert would independently use the Ten Blessed Methods.
Lol biased ref counting. I tried that years ago. It’s super fast for serial workloads or very simple parallel ones. I would avoid it for anything real.
> The performance side of things I think is still up in the air because ARC certainly does introduce overhead.
As you said, they can both introduce overhead, but the ARC overhead is way worse. Really the only reason to use garbage collection (which is really complex) over ARC (which is simple) is the performance and control you get. Otherwise it would not be worth the implementation cost. The question is not “up in the air” on performance; production-grade GC’s are a clear winner. I would even be doubtful that CoW could make up the difference. But I would be interested to see data suggesting otherwise.
No doubt straight up ARC with refcounting bumps on every function call,etc. vs a good GC with same allocation paths...a good GC will smoke it. No question.
But that isn't how I read what Lattner said here, and I've been thinking similar things. Essentially with ARC the goal is to cheat (to be fair with tracing GC it is as well, just in different ways). And you cheat by implementing an ownership/borrow collector type system behind the scenes for getting rid of the need for RC on most of your objects. Once you do that, you have GC competing with stack allocation + some malloc/free scattered in. It becomes a much more interesting fight.
"No doubt straight up ARC with refcounting bumps on every function call,etc..."
Yeah, but in Swift you don't do that. From what I'm told, the compiler's static analysis has gotten a lot smarter in determining when a ref-count bump is needed, and the increased use of struct-based value types in the language leads more and more to situations where a refcount bump isn't needed at all since there's no object reference in the first place.
This “cheating” has been going on in GC’d languages for decades. Java, JavaScript, and probably other language implementations have what they call “escape analysis”.
note that GCs are starting to cheat too. Julia (for example) is in the process of implementing compile time escape analysis which will be used to (among other things) remove allocations as long as the compiler can prove you won't notice.
I find this whole conversation frustrating or somewhat confusing because it's a weird distinction: Reference counting is garbage collecting. It's not a mark & sweep or generational tracing collector, but it's a collector. And in fact it even traces, though the tracing is explicit (in the form of the declaration or management of references.)
My copy of the Jones/Lins Garbage Collection book I'm looking at on my shelf makes this quite explicit, the chapter on reference counting makes it clear it's one form of garbage collection.
When I was messing with this stuff 20 years ago, that was my understanding of it, anyways.
There are advantages and disadvantages to reference counting. I think some of the confusion in terminology and some of the frustrations (or opposite) here comes from the rather awkward way that Objective-C implements reference counting as something you have to be explicitly aware of and careful about, even with ARC. My brief stint doing iOS dev made this clear to me.
But I've used (and written) programming languages that had transparent automatic garbage collection via reference counting, and the programmer need not be aware of it at all, esp if you have cycle detection & collection (or alternatively, no cycles at all). And, there are compelling RC implementation that are capable of collecting cycles just fine (albeit through a limited local trace.)
Anyways, he's right that RC can be better for things like database handles etc. Back in the day I/we used reference counting for persistent MUDs/MOOs, where the reference counts were tracked to disk and back, 'twas nice and slick.
You might as well classify allocating without freeing as a form of garbage collection. If the implementation can fail to collect cycles and the language can create them, thus leaking memory, lumping in reference counting with real garbage collection is a bad definition because it smears together a crucial distinction.
I get that some people, or many people, or some book, have defined it that way, but others have used it in the meaning as distinguished from plain refcounting. Many developers have always understood GC and refcounting (in languages with mutation) to be mutually exclusive concepts. That definition is used in practice, and when people say, "I'm going to implement garbage collection for my programming language," what they mean is one which doesn't leak memory.
A reason people use this other definition may be that plain reference counting and "having a garbage collector" are mutually disjoint, even if you define "garbage collection" as including reference-counted implementations. Making garbage collection a distinct concept from the set of acts a garbage collector might do will always be fighting against how language usually works.
This definition is very elegant in that it allows one to reason about program correctness across all garbage collectors without having to know any of the implementation details, be it reference counting or the many flavors of mark and sweep. Basically garbage collection is not an implementation detail, nor is it a specific algorithm or even class of algorithms, garbage collection is a goal, namely the goal of achieving an infinite amount of memory. Different algorithms will reach that goal to different degrees and with different trade-offs.
They get lumped together because when some people say "garbage collection" they mean "whatever Java specifically does", and when other people say "garbage collection" they mean "automatic dynamic lifetime determination" (as a contrast to C-style "manual static lifetime determination", or Rust-style "automatic static lifetime determination").
RC need not leak memory. If it's done right, it doesn't.
That Objective-C exposed the innards of the system badly enough that it was possible to leak, that's not intrinsic to RC. It's just yet another piece of the awkward legacy puzzle that is Objective-C.
As for cycles.. Cycle detection in RC has been a solved problem for over two decades. I implemented it once, based on a paper from some IBM researchers, if I recall.
Cycles in RC is a fundamentally unsolvable problem because there are situations when you in fact want an unreferenced cycle exist in memory at least for a while.
Imagine a slow async operation such as a network request, where a GUI object is locked in a cycle with a self reference in a callback. Before the network response arrives the GUI object may be removed from the view hierarchy (e.g. the user has navigated away), however, whether you want the cycle to be broken or not depends on your needs. You may want to wait for the results, update stuff and break the cycle manually, or you may choose to completely discard the results by declaring weak self.
Therefore, solving the problem of cycles with RC I think means giving more manual control. Explicit weak pointers solve part of the problem, but the "I know what I'm doing" scenario requires great care and is obviously error prone. To my knowledge this hasn't been solved in any more or less safe manner.
I think in the scenario you describe, you always want to hang onto a reference to the unfinished network activity because you will need to be able to interrupt it and clean it up. You never(?) want to leak active running code in the background, especially not code with side effects, which isn't described by some kind of handle somewhere, in part because there's probably a bug and in part because people can't understand how your code works if there isn't a handle to the task stored in a field somewhere, even if you don't use it.
Of course there's a reference to the network resource and that's the reference that creates a cycle via its callback defined in a bigger object (e.g. GIU view) that awaits the results. There may be reasons why you want to keep the cross-referencing pair of objects alive until the request completes. Not saying there aren't alternative approaches, but it may be easier to do that way.
I came across this while coding in Java, when I realized a network request may be cancelled prematurely just because the GC thinks you don't need the network resource anymore, whereas in fact you do. There's no easy solution to this other than reorganizing your whole request cycle and object ownership. Classical abstraction leak.
I mean, the GUI object with its reference to the network request should never be detached like that. The network request handle should be moved to something else that takes care of it, if the GUI object can't, and it should not be calling a callback tied in a strong way to the GUI object's existence. Granted, I wasn't there when you had this, and I'm not saying there's an easy solution. What I am basically saying is there should be a chain of ownership to some local variable on the stack somewhere, instead of a detached GUI view. I'd view the resource leak you describe as hacky, and a class of design decision that has caused me a lot of pain in gui software. Well, maybe I had to be there.
Yes, you can move your references around to find a suitable place, but how does that solve the problem of lifetime control? How do you guarantee a network resource is alive for as long as you need it under GC, and under ARC?
By holding a reference to it that is grounded to a GC root i.e. an an ownership chain to a local variable that will dispose of it. Also which can be used to interrupt the background task. So the network connection, callback, etc., are not in a detached cycle.
With the design you're describing -- without the ability to interrupt the background task -- I would bet you tend to get race conditions of one kind or another.
Can you link to some resources/papers detailing how to solve this problem? I’m only aware of “RC+GC” solution (i.e. adding a full tracing GC in addition to RC to find and collect cycles).
That's the one I was thinking of, and I once implemented that one myself in C++ back in, uh, 2001, 2002, during a bout of unemployment in the .com crash.
I guess I just assumed that there's been some advancement in the world since then.
I was taught that the "garbage collector" term only refers to tracing garbage collectors. It sounds weird and wrong for me to hear the term GC referring to to things like C/Zig's manual memory management (malloc and free) or Rust's borrowck. And it sounds like its the same for you, in reverse.
In my school, the whole class of systems are "memory management systems" and "Garbage Collector" is one such class of approach. (What you would call tracing GCs).
But I don't think either of us is canonically right. Its just a "pop" vs "soda" thing. You and Jones/Lins say "pop". Me and Chris Lattner say "soda". Its no big deal.
Some people refer to only tracing algorithms (mark-sweep, copying, etc) as "garbage collection". The best reason I can think of is marketing.
There are generational reference counting systems e.g. ulterior reference counting [1] and age-oriented collection [2]. "Tracing" usually refers to scanning a heap to work out liveness all at once, which reference counting doesn't do; though deferred reference counting scans the roots to collect, and ulterior reference counting traces a nursery. Deferring these (more common) updates to reference counts improves throughput considerably.
In the modern sense of the word, a GC is an algorithm that manages memory automatically and without cycles. A GC must maintain the property that a leak is only possible if an object is reachable from a root. RC on its own does not obey this property.
Reference counting is not GC in the modern sense unless it also includes a cycle collector. So Swift is not GC’d (RC only) yet Python is GC’d (RC+CC).
> A GC must maintain the property that a leak is only possible if an object is reachable from a root. RC on its own does not obey this property.
This is certainly a false claim then. What you're describing is a precise garbage collector, but there also exist conservative garbage collectors which will fail to reclaim objects even when no cycle exists so long as any integer interpretation of currently allocated memory shares the same integer interpretation of an object's memory address.
GNU Java, and the Mono implementation of .NET both use the Boehm conservative garbage collector and they are most certainly still considered garbage collected systems:
This isn’t formal science. We are practitioners and so definitions have to be practical. If Boehm GCs really leak as much as you claim then nobody would use them.
The issue with cycles is that it can easily lead to significant memory leaks if a large subgraph gets caught in a cycle. A definition of GC that excludes this capability does not have the contract of “You only have to worry about reachability from roots”.
Computer science is absolutely a formal science in every sense of the word.
>If Boehm GCs really leak as much as you claim then nobody would use them.
I mean if anything this argument applies more to your characterization that reference counting easily leads to memory exhaustion due to cycles than it does to conservative garbage collectors. If RC based garbage collectors leaked as much as your post implies then nobody would use them. And yet Swift, a language that is almost exclusively used with large graphs(that's how most UIs are represented in memory) and that exhibits all kinds of non-deterministic shared ownership uses RC based garbage collection. Objects involved in UI code are shared in wildly unpredictable ways with no clear hierarchy among layouts, event handlers, observables, IO systems, etc...
>A definition of GC that excludes this capability does not have the contract of “You only have to worry about reachability from roots”.
Exactly, because that's not part of the contract for a GC. That's part of the contract for a tracing garbage collector:
It is sufficient in a functional language that lacks haskell's lazy capacity to "tie the knot". Erlang's data structures, for example, are fully immutable, and it is not possible to create a cyclic data structure. Since there cannot be a cyclic data structure, reference counting would be sufficient to manage its memory automatically.
It would still suck since it has to constantly write to all objects being referenced, but it would work.
Not necessarily. The issue is that releasing a reference at the top of a tree of objects has the potential to walk a large graph of objects (doing downcounts at each) that are not in cache, causing some thrash.
In an ideal world and in a program running with a not-huge # of objects you could keep the entire basket of reference counts in L1 cache, separate from the objects. Then decrements would be potentially cheap. A bit utopian though.
Nice, will put this paper in my bookmarks for later read.
I feel like once you stop and think about it, the distinction between a 'tracing' GC and a 'non-tracing' GC is specious: To do its work at acquire and release, RC collector must in fact trace a tree of objects. It's just that the semantics of how this is declared might differ, or, more specifically the time(s) in which it does this work.
By "trace" I mean the act of consulting the state of an object, then recursively tracing pointers in the object. A tracing GC propagates a "known to be live" state through an object graph while doing a collection cycle, and a refcounting GC propagates a "known to be dead" state through an object graph, when RC drops to zero.
Umm... but isn't the refcounting GC more deterministic? If I create a chain of 10 object and then release the head, I run through 10 allocations and then 10 release cycles on those 10 objects. Period.
ARC makes ref counting a part of semantics in the sense that the deref that reaches zero immediately runs the destructor, and destructors can have any effects they like.
Therefore, it’s not really valid to say that ARC “is” garbage collection.
Reference counting is only a form of garbage collection if it is totally transparent to the programmer, and it’s not like that in ARC.
Which is why when Apple introduced new ARC optimizations last year, they had a talk about possible gotchas that might crash the application due to them.
I think ref-counting is interesting, but I look at it as sort of like Level 3 self-driving cars where you almost don't have to pay attention but sometimes you really do. That's almost worse than Level 0 because humans are very bad at partial attention.
With GC, assuming your program can accept the slight level of non-determinism and latency, you don't think about deallocation at all. There are just objects and references, and that is the extent of your mental model. That simplicitly frees up a lot of mental energy to focus on your problem domain.
With ARC, you get determininstic deallocation and (maybe) better latency. Those are nice, and critical for some kinds of programs. And you mostly don't have to think about deallocation. Except that thanks to cycles, you can create actual memory leaks where you have completely inaccessible objects that live in the heap forever. And, in order to avoid this, you have to know when and where to use weak or unowneded references. Once you throw closures into the mix, it becomes very easy to accidentally create a cycle.
So ARC gives you more language complexity (strong, weak, and unowned references, closure capture lists, etc.). And there is a subtle property that you must always maintain in your program that is not easily found through static analysis. If you fail to maintain this property, you program may slowly leak memory in ways that rarely show up in tests but will cause real problems to programs running in the wild.
Most of the programs I write don't need finalizers much and can afford a little latency. I relish being able to use closures freely even inside classes. For that kind of code, I strongly prefer tracing GC so that I don't have to worry about cycles at all.
Objective-C GC failed to work reliably due to everything C allows to do, and people mixing frameworks compiler without being enabled.
Automating Cocoa's retain/release calls was more reliable, and makes much more sense.
Swift naturally needed to follow the same path to easily interoperate with the Objective-C runtime.
A tracing GC would mean having something like .NET COM Callable Wrapper and Runtime Callable Wrappers, for interoperability with COM, which is a much greater engineering effort.
So naturally reference counting as GC algorithm for Swift makes sense.
This benchmark is not good enough to make a decision about how you should build an entire OS.
They did make this decision when designing ARC and Swift, and they did have data for it, and first-order effects of how fast you can make a single task can't be the determining factor there. In particular you need to know how disruptive running the task will be to other more important stuff.
How is the amount of RAM relevant here? Different GC algorithms have different tradeoffs, tracing GCs absolutely trash ARC in throughput, but they usually use more memory for that.
This has nothing to do with custom hardware, it's one of the few differences that's actually due to ARM vs x86. Though, part of it is from being a unified memory SoC.
I heard in an interview with Craig Federighi, that in designing the Apple Silicon chips they had looked at specific operations that were frequently used by their OS and software and optimized those on their chips. Reference counting was one of those operations and that was why the speed of that operation was so much faster on Apple Silicon. Sure those other things about unified memory contribute speed to all kinds of operations, but they really focused on some of them to let their hardware and software work well together.
Choosing to use ARM and unified memory is part of designing the SoC, though. There is no magic in the refcount machinery; you can disassemble it and look at it.
I agree with the sentiment, but does Swift actually see any usage outside of Apple’s own hardware? I think Swift’s reliance on such an optimization is justified since the hardware this optimization is applied to is the main target.
While Swift’s approach may be slower in theory, is it actually slower in practice? Just a thought.
Hardly, but having to design hardware to fix ARC performance, proves the point how much "better" it is.
Depends how much one cares about performance beyond iOS GUI apps, specially when Swift's original goal was to become the main language across the whole Apple stack.
Swift has been used to write parts of MacOS that are used on Intel today.
The good faith efforts to write parts of a consumer OS in a garbage collected language that I can remember are Microsoft's effort to write Longhorn in C# and Google's effort to write part of Fuchsia in Go.
Likewise, Midori was powering Asian Bing cluster for a while, and even that wasn't good enough for WinDev.
And in Fuchsia, same thing happened, the parts in Go were only taken away when the Go advocates that driven those modules were no longer around to argue for their implementations.
Politics are more relevant than technical limitations.
Finally, Azure Cloud OS and Windows together have surely more lines of C# powering them than Apple has been doing with Swift.
Naturally you would answer that, from an OS that never had a certified GA build, great proof indeed.
Yet the same group that sabotaged Longhorn, was quite keen in doing exactly the same stuff using a mix of .NET Native and C++/CX, which only failed due to the lack of migration path from Win32, and not much love for the sandboxing.
It's one of the internal emails made public as part of the Microsoft antitrust trial.
>LH is a pig and I don’t see any solution to this problem. If we are to rise to the challenge of Linux and Apple, we need to start taking the lessons of “scenario, simple, fast” to heart.
Sorry, but the person who was in charge of Windows development at the time did not agree with you. Longhorn failed due to performance issues that Jim Allchin believed were unsolvable.
Go has a bad GC. Go relies on the compiler to do escape analysis and allocate on the stack. Once you actually start heap allocating you hit massive pauses. It’s why Go does so bad on the binary tree benchmark.
> Once you actually start heap allocating you hit massive pauses.
citation needed. go afaik has millisecond pauses at most. maybe you're thinking of throughput losses due to forcing more CPU to be spent on collecting?
I always took GC for granted until I started coding in Zig more and had to worry about managing my own memory.
One thing that Zig made pretty easy that's been sort of my go-to is using Arena allocations a lot more. It's pretty frequent that for a chunk of code I know I'll be allocating some chunk of memory and working with it, but then once that's finished I won't need it anymore. So just throwing everything into the arena and then freeing the arena afterwards works great, is fast, and is simple.
Are there any garbage collectors that enable this sort of thing? I think it might be vaguely like "generations", but I think even those are inferred and kept track of in greater detail to know when to move between the generations. I'm thinking of somehow earmarking a scope to use its own bit of memory and to not worry about collecting within the scope, and then when you leave the scope being able to free it all.
I'm sure there's more complexities that I'm missing. It's just that dealing with memory manually really makes you realize that the programmer probably easily knows expected lifetimes better than can be inferred. The only GC's I've used have been all "don't worry about memory we'll figure out everything". But maybe you could provide hints or something.
Pony has a separate GC for every actor. I suspect that in practice, these work a lot like arenas; since they're very short lived, I'd imagine that most actors just allocate a lot and then blast it away. [0]
Odin (not GC'd) has an implicit "context" that has an allocator, [1] effectively decoupling the allocator from the code. One could use an arena allocator for any function they'd like.
Vale (not GC'd) is taking this the context system bit further and makes it memory safe. [2]
Sounds more like ownership annotation than generations.
Generations basically takes advantage of the fact that recently allocated memory is much more likely to be deallocated sooner - ie. it's run time/age based.
What you're describing with arena is more like ownership where siblings can't outlive ancestors - ie. deallocate when parent/root (arena container) is deallocated.
Sadly they didn't discussed the main selling point of GCs (in my opinion). GC can move objects in memory and often does it. It can affect cache locality, so less of cache misses. Moreover no more heap fragmentation. There are different strategies to do that, so theoretically one can choose the best one after the code was profiled. Though practically it may not be an option, because the chosen language gives you a fixed unchangeable implementation of a GC.
I bet that these benefits of GC can be great in some tasks, though I know no real world examples of software explicitly relying on it. Some servers on Java using terabytes of RAM, I think? But no one talks about it. My curiosity have nothing to consume, and it is hungry.
Database engine implementations sometimes work similar to this, being locality optimizing machines. Objects in memory can be moved and reorganized dynamically. Ensuring that references to those objects remain valid is not inexpensive because it requires indirection to dynamically resolve object locations and may require patching state referencing the object.
The performance impact is only tolerable for database engines because the context and scope is sufficiently constrained such that clever software design elements can ensure that the cost is applied lazily and expensive effects are avoided except in rare cases. Even in database engines, thrashing your CPU cache is a common side effect of not getting the design right. I can't imagine trying to make this perform well for general purpose software.
Take a look at Mesh [0] which brings compaction to a malloc-like system. I could see an RC-based or single-ownership-based language using it to avoid fragmentation.
Personally I find ARC really easy to reason about and worth it. Essentially, a strong reference goes to “child” objects, a weak reference goes to “parents” or “siblings”.
You can think of your app’s data structures as a forest. The tree roots are your application and any globals, and the children are their fields, which have more fields, and so on. Sometimes a node stores a reference to something which is not its child/nested field but another node’s: e.g. a view might have a reference to its controller (its parent) another view which is not a subview (its sibling), or even a subview of a subview (descendant). The point is, this other node has a different parent, it already
has a strong reference, so inside the child / sibling it’s a weak reference.
Another perspective if your familiar with functional programming, especially if you’ve worked in a language like Haskell that “doesn’t have” references: weak references are conveniences of the object-oriented paradigm which, in the functional paradigm, you would just pass around as function arguments instead. It can be really annoying to update an object without mutation when the object is referenced by other objects - in mutating, object-oriented code, those would be weak references.
> Personally I find ARC really easy to reason about and worth it
How do you reason about a closure that captures some stuff and gets passed to a library function? @escaping doesn’t give you enough information to reason about all lifetimes that could possibly matter.
Uhh... because I can see what was captured? If I'm overly concerned that a given callback might be held somewhere for all time I'll simply capture a weak reference and be done with it.
I'm basically "whatever" on this. I develop in both Swift and Kotlin. I like the Swift language better. Definitely like the libraries better. But I hate the memory story in Swift. ARC is a half-GC. It's definitely different than malloc/free (which I do a lot of in embedded C land). But the stressing about ownership loops sucks. I like structs and what they enable, but I find myself choosing them over objects sometimes, not because they're the best solution for the problem in front of me, but because ARC sucks. And then after trying to avoid ARC using a struct, I end up biting the bullet and using a class, because sometimes a real object is the better solution. Kotlin's "almost struct" data classes are a better answer IME.
Being a geek who is somewhat into PLT and hobbyist languages, I go back and forth a bit on whether I want a GC or RC in my own hobby lang. For the longest time I was going to go with per-thread heaps that use a simple compacting GC, but only collect on allocation when heap starts running low, so collection wouldn't have to stop threads, the local thread would already be stopped. The plan was to use RC most likely for multithreaded shared objects from a shared heap as a secondary.
Then I started writing Rust and I realized what I think is the same point Lattner has here: Ownership changes everything. If I can declare a single owner most of the time and track that owner, I only have to bump the counter when I add a 2nd (and 3rd, etc.) owner, which isn't nearly as common as single ownership. This gets rid of most of the issues with RC performance (which in naive schemes bumps the counter every time it calls a function and more). RC still can't use bump allocation unfortunately, so any tree building would likely need arenas in addition which is a bummer, but deterministic destruction is a huge bonus, so this is the way I'm leaning atm (if I ever get back to writing it that is!).
In the case of C#, there are very lengthy docs and StackOverflow threads on how to properly use IDisposable -- what you need to use to clean up non-GC-able resources. A great many classes in C# libraries use IDisposable, so outside of pure business logic you end up with a lot of 'using' blocks. I've written a lot of C# code that needs to clean up external resources using Dispose(), and I've had to fix a lot of C# code written by people with a poor understanding of IDisposable and how GC actually works (eg. useless GC.Collect() calls everywhere).
So I think the benefits of deterministic object lifetimes, and it being easy to explain when each object will get cleaned up (all of it, not just the memory), outweighs the marginal benefits of heavyweight Garbage Collection.
Chris makes some points which are interesting in their own right, particularly the one comparing the amount of research which has gone into tracing/copying gc vs rc. But I think there is an unmentioned bias which is worth pointing out. Tracing gc has found massive success in enterprise java applications, which:
- Need to scale in terms of code size and contributor count, and therefore need the modularity afforded by tracing gc
- Need to scale in terms of heap size, and are therefore more likely to run into the pathological fragmentation issues that come from non-compacting gc
- Need the higher throughput afforded by tracing gc
- Generally run on massive servers that have the cores, memory size, memory speed, power budget, and heat dissipation required to support high-quality tracing gc
Contrariwise, swift is a product of apple, a laptop and phone company, and is used to make laptop and phone apps which:
- Are unlikely to ever get really huge (people balk at large downloads)
- May have limited memory, and need to share it with other apps
- Run on battery-powered hardware with little or no active cooling, and so need to attempt to conserve power
No silver bullet, as they say. I implement APL, which absolutely cannot do without reference counting, except perhaps in some novel configurations. I think that for the more usual case of a pointer-chasing language, tracing gc is almost certainly the right choice.
Lastly, however, I would be remiss not to bring up bacon et al, 'A Unified Theory of Garbage Collection'[0], which notes that tracing gc and reference counting are not really opposed to one another, and that a sufficiently advanced implementation of either will approach the other. Lattner mentions barriers (though as a sibling mentions, barriers are an order of magnitude cheaper than reference count twiddling); on the other side of the coin, consider deferred reference counting, one-bit reference counts, ...
The bottom line being that memory management is _hard_, in all cases; a sufficiently robust and general solution will end up being rather complicated no matter what it does; and, again, there is no silver bullet.
I think Chris makes an excellent point that is the mortal flaw of GC-s, anyone who argues about overhead or pauses or throughput is getting it wrong - the problem is determinism, and interaction with native objects.
You cannot reason about when the computer can/will get rid of native objects that are referenced by the GC, and doing so raises a billion of hairy questions, many of them discussed at length in the article.
Why is this important?
This is implies the model of SPA frameworks, which rely on this exact thing, fundamentally cannot be made to work well, since they rely on referencing native objects from JS.
This invalidates the entire way we build GUIs/websites nowadays. I'd say that's a pretty darn important issue.
Could you elaborate how this is connected to js and SPA frameworks? I totally get how reasoning and determinism is critical for sound processing, embedded low level, operating systems etc. But why would this be relevant for normal application or web development?
SPA frameworks manipulate the DOM through JS, which means they need to take a reference to DOM elements.
In a static website, once you navigate away from a page, the browser can deallocate all the DOM resources related to the old page, which can be quite heavy (images etc.).
In an SPA-driven website, the DOM is created by JS. Once you 'navigate' away (you tell the framework to render a different set of DOM components), the old ones still linger around in memory until the next GC cycle, meaning the browser has to wait for the GC to run before it can free them.
Even worse, GC's tend to be tuned for GC memory pressure, and if you have a ton of GC objects holding onto native resources, the GC has no way of telling it needs to run, because from its perspective, the GC heap is small and it has no idea about the native objects.
Still dont get it. Obviously a dom implementation that is tied to a garbage collected script language has to be tied to the garbage collection in some way to free connected resources, this fact only shows that it is not a trivial task which is obvious. i still do not see how this is an argument why a garbage collected language is a bad idea for that usecase.
Yes. And if you read their papers, when they were able to add explicit hardware support they were able to do even better.
I really think had vertical companies from the 90s not all died, seriously developing chips for personal computers (phones,laptops) with hardware support for the most important GC features would have been a huge winner.
Apple could have done this but they were already bound to Object-C because of Next.
Sun had the amazing Self VM in the early 90s already. And they made their own chips as well. But unfortunately they went with Java and instead of making a chip to support GC they tried to make a chip that implemented Java.
I know its more complicated, I was not trying to give a full history.
As I pointed out in my original comment:
> But unfortunately they went with Java and instead of making a chip to support GC they tried to make a chip that implemented Java.
I think in the 2000s Azul hit on how to combine high performance JIT, high performance GC with a number GC specific updates for the hardware and the OS.
> A ton of energy’s been poured into research for garbage collection, particularly since Java has come up. There’s been hundreds of papers written in the academic circles, tons of work in HotSpot and other Java [2:03:30] implementations to do different tweaks and different tunings and different new kinds of algorithms in garbage collecting. That work really hasn’t been done for ARC yet, so really, I think there’s still a a big future ahead.
I disagree strongly; there has been plenty of work in optimising reference counting systems. A quick look through Jones, Moss and Hosking's Garbage Collection Handbook provides many descriptions of high performance reference counting, but they all use deferred reference counting [1], rather than the immediate reference counting that Swift/Objective-C/Apple uses. That Apple is stuck in 1975 does not say much about other reference counting users.
(Though Lattner only says "ARC", and I believe only Apple calls their refcounting "ARC"; I guess Rust has "Arc" as well, but that is a different thing completely. Else I haven't seen the acronym used elsewhere.)
> But to do that, they use this tricolor algorithm which dramatically lowers throughput, because they’re doing almost exactly the same kinds of operations that ARC is doing.
Concurrent tracing algorithms often log into a local buffer, without synchronisation. Whereas concurrent immediate reference counting performs atomic updates - it's almost exactly the same except for the parts where they really aren't the same.
High performance concurrent RC systems also use local buffers [2] to avoid atomic updates too. Go figure.
> I am totally convinced that ARC is the right way to go upfront. [...] It gives you deterministic behavior[...]
Does it? In a meaningful way? For arc with strong recount>1, which constitutes the main use case in my experience (weaks are unenrgonomic, for one), which means that the deterministic destruction is not meaningful locally, as opposed to the modern manual alternative, RAII, for instance. Am I missing something?
I don't think it's quite deterministic because a simple variable going out of a scope can trigger deallocation of a random object tree (which can be large), which you cannot always predict locally (as a programmer) and which can affect performance on a critical path or introduce side effects at unpredictable times, plus the overall possibility of having uncollected cycles - so in the end for a programmer -- who's not willing to manually trace ownerships themselves to understand the behavior in every detail -- it has quite same disadvantages as a tracing GC (deallocation may kick in at "unpredictable" times) with the additional disadvantages such as: fragmentation and poorer memory locality in the long term (no compaction), possible memory leaks (due to object islands), temporary young generation objects being as expensive as old generation objects (no bump pointer allocation in a nursery), excessive RC increments/decrements can trash performance (and if it's atomic it can IIRC force cache invalidation, which also makes the behavior less deterministic).
So I don't think it's a silver bullet, it's just a different kind of GC with its own set of disadvantages. I think most of the time RC is chosen because it's simple to implement or because there's a legacy system where RC is the only choice, and all the other "benefits" are just afterthoughts to further justify the choice.
Determinism in ARC vs. GC discussions is probably more of a reproducability thing. With ARC, a program will always free memory in the same spots in the same way whem running on the same input. In most GCs, this is not the case. Their behaviordepends very much on external factors like passed wall clock time or system-wide memory pressure (e.g. if it rns collection whem requesting more memory from the OS failed).
This kind of GC behavior can make bugs relatedcto external resources much harder to find. And benchmarking an allocation heavy program with a nondeterministic GC can be hellish due to the increased distribution of measured runtimes.
>With ARC, a program will always free memory in the same spots in the same way whem running on the same input
It's true for pure single-threaded functions but I doubt it's 100% the case for asynchronous code which depends on side effects (user input, system events, async I/O): a slightly different timing and your reference count is different from the previous run, and as a consequence your object trees are deallocated differently, and in different configurations, each time as well. But it's only a problem if objects are shared between threads, though.
Even with single-threaded code, a random change to your codebase which increments a reference here and there can invalidate all your prior assumptions and have a ripple effect on the entire system.
> Even with single-threaded code, a random change to your codebase which increments a reference here and there can invalidate all your prior assumptions and have a ripple effect on the entire system.
This is true. The standard library for the language we use at work (Delphi) had a bug in its thread pool class which was of this nature. The threads in the pool would wait for work to be available, pop a work item, do work and wait for more work again.
The issue was that the reference count of the previous work item would not decrement until the local variable holding the work item was overwritten by the new work item (increasing the reference of the new work item also). In particular, if there was no more work, this reference would be held until program termination.
This caught me, as a library user, by surprise as I expected the work item to go out of scope and be destroyed when it had been executed, since no more references should be held at that point.
That said, spotting these bugs are quite easy with ARC once you dig into the code, given that the points where the references can be increased or decreased are deterministic. So as long as you have access to the source code of dependencies it's fairly easy to find the reference counting points.
GC on the other hand is completely async and opaque for the most part.
Keeping dangling references for too long is a problem that is common to all automated memory management schemes. You would have had the same issue with the work issue lifetime with a GC, except that the destruction of the work item would have happened at an undetermined time after the work item reference was replaced.
Arc is deterministic, you get the same results on multiple runs (minus not synchronized async fun of course), you can profile it etc. – you don't have this luxury with gc.
Of course, what is deterministic is deterministic, if you use allocations based on nondeterministic random, then you can't see deterministic allocations - I think this is obvious and doesn't have be to spelled out.
With gc deterministic runtime creates nondeterministic deallocations.
With arc deterministic runtime creates deterministic deallocations - reproducible behavior that can be profiled and allows you to work on optimizing it.
I'm not sure most people use this technique, but when I was using RC-based systems, I would often assert that an object's refcount was 1 just before I let go of that last reference. Suddenly, we have RAII-like predictable destruction.
I even built it into Vale as one of its memory management options called "constraint references" [0] though later switched to generational references which gave us RAII without the halts. I sometimes wonder how far I could have taken that RC + assert model.
In my experience, the vast majority of objects even in an RC'd language do have strong refcount = 1. Perhaps we were in different domains though, I was mostly in game dev and app dev.
What it means to be a “systems” programming language is a matter of some debate, but there is at least one definition that’s only coherent in a resource management model isomorphic to RAII/ARC.
This sounds very nice, but just to make sure I understand your meaning: If you have a properly designed "systems" programming language(even if manually managed) you would release resources exactly(or nearly so) at the time when a proper ARC/RAII implementation would. Is that correct for your meaning? If so I fully agree.
It is more subtle than this. Systems programming sometimes involves a resource life cycle that has semantics that map neither to a GC nor RAII/ARC. This is because systems programming often involves the direct management of hardware resources, like storage.
For example, while resource acquisition and release may be deterministic, there is no implication that a released resource may be immediately reused or reacquired. This is not pattern most software engineers are used to. The reason for this is that hardware may hold a lingering reference to a software resource even if the software does not -- and hardware doesn't use reference counters or destructors. An arbitrary amount of time may pass between a resource being released in software and that same resource being available for reuse again in software. Many people new to systems engineering make the mistake of reusing a resource that was released but not reusable.
Systems programming languages don't just ensure deterministic release of resources, they also need to provide a way to express that the hardware may hold references to objects the software has fully released such that it isn't really released. This is not something the average programmer has to deal with, but this is part of what makes systems engineering different.
My point was that if by “systems” programming, you mean prioritizing efficient use of limited system resources (CPU cycles, bytes of memory, … ) then there’s no superior alternative to the precise control & reliability afforded by ARC/RAII.
At this late date, a language inadequate to express ARC/RAII is better described as a toy. There is nothing wrong with toys, but they are not to be confused with professional tools. A systems language is, first and foremost, a professional tool.
One data point from the Prodfiler folks was that even if the GC cost to throughput is relatively low, some programs will still spend up to 25% of their time doing garbage collection. This will show up on your cloud bill even if it doesn’t show up to that extent in your performance analysis.
I expect you can find papers giving numbers in the same ballpark for malloc/free, too. A MVP of JavaScript, for example, that boxes everything, combined with a benchmark that creates a d destroys lots of objects.
The main thing you need to know about ARC is that you hardly ever need it, so it rarely matters what its performance is. If you find yourself using shared_ptr much, stop.
Also, its susceptibility to cycles doesn't matter, because it is a terrible way to manage graph nodes: just keep them in a vector, with indices instead of pointers.
I’ve never done this personally but I know in some of the languages I’ve used you can turn off garbage collection. If you really cared about the latency, then you couldn’t’t you just manually collect at certain points in the program?
Or you could use a language like D which lets you write a program without the garbage collector when needed. You definitely have less tools when you do that, but if you are in a situation where you know you want to manage memory manually you probably will evaluate that trade off.
What most people seem to forget about this is that if you actually need speed neither of these should be a factor. Heap allocations should be minimal and lifetimes should almost always be known ahead of time.
In C++ you basically never need reference counting, unless you are handing off heap allocations to multiple different threads. In that case you don't know what will finish first and they need to communicate when the allocation isn't being used anymore.
Perhaps the most interesting thing is that it's all moot for the vast majority of applications. If it's done reasonably well, it just won't matter. 'Most' of the time. And for the times it does well, you'd probably be better of finicking away manually with things.
Delphi has legendary ARC not just with its VCL and object hierarchies with TOwner model but also with its wrappers of COM's IUnknown and QueryInterfaces. Yes of course, developers can still make mistakes about sequences and memory management but a well constructed class hierarchy and discipline to manage the freeing of resources immediately after they're no longer needed is the cornerstone of basic Delphi applications.
edit: The objects that are IUnknown linked don't even need to be explicitly freed! There are several clever use cases for this for example changing TCursor from hourglass back to normal when a procedure/function is complete. It just works.
So this is a lot of gibberish but let me sum up to you the RC vs. GC:
GC advantages:
* Scales better when done right - cheap allocations simplified code etc.
* Can be heavily concurrent and use multicore
RC advantages:
* Predictable performance
* Uses less RAM
For GUI RCs smaller memory footprint and predictable performance are great. Hence swift uses ARC which is also faster than most RC implementations (which typically have a lot of overhead).
For servers where overall performance matters more than 100% smooth consistency and RAM is cheaper... GCs have some advantages. A GC can run on idle and clean a huge amount of garbage without a problem. It can do so concurrently and leverage the multicore nature of modern CPUs.
This is one of those, right tool for the right case.
> For servers where overall performance matters more than 100% smooth consistency and RAM is cheaper... GCs have some advantages.
I am having a difficult time parsing this in a way that makes sense.
Generally speaking, overall performance on servers is a direct product of being able to take strict control of when things happen and minimizing RAM footprint. That "smooth consistency" is how performance happens because there are important classes of software optimization that don't work if this is not a property of your runtime. Similarly, high-performance server architectures are usually not heavily concurrent even on massively multicore systems for good reasons that a GC doesn't address.
The only advantage a GC has in this context that I can think of is that it might reduce some types of bugs in some types of code.
I come from the world of JVM so here we do have multicore servers and you might see a peek where a particular request takes longer but overall the performance will be amazing. This isn't OK for GUI but is great if you have a HUGE server that can chug through tons of data with a huge amount of RAM/cache.
I used to write a lot of performance-optimized server code in Java. Even with heroic effort, it is difficult to make the JVM perform even half as fast as not-that-optimized C or C++ -- other languages I wrote the same kinds of code in. And highly optimized C++ is many times higher throughput than the JVM equivalent on the same hardware, pretty much always. The reasons why are well-understood technically.
This cannot be blamed entirely on the GC. The JVM makes many other choices which make optimization difficult, and its original theories about what is important for optimization haven't stood up over time. It was not designed to be performant, just good enough for most applications that are not that performance sensitive. Programming languages often have many objectives and priorities aside performance. I use Python quite a lot and it is usually much slower than the JVM.
When you say "used" how long ago are you talking about and for what types of loads?
I saw Java rewrites running circles around C++ code it replaced. It's a matter of tuning and picking the right configurations/tasks. Some things Java does really well, others it sucks at. If you need near memory access and have constant loops over arrays of objects then Java will be slower than that when compared to C++ because of the memory layout (Valhalla will improve that).
Network throughput is another weak point which amazingly NodeJS does better. Loom resolves that problem.
There's also a lot of choices when it comes to GCs nowadays. Java used to be a dog on large memory heaps. This is no longer the case with Z, G1 etc. but it's very possible you hit a problematic spot.
One of the nice things about GC performance is that you can connect a monitoring tool to a production server and optimize memory performance by viewing actual usage. Monitoring and observability in Java is at another level entirely.
> If you need near memory access and have constant loops over arrays of objects then Java will be slower than that when compared to C++ because of the memory layout (Valhalla will improve that).
In a former life we used Scala magic to pretend we had an array of structs. Otherwise, yeah, this is a huge issue facing JVM apps.
Valhalla will solve this. I disagree that this is a huge issue. There are some edge cases where this is significant but for most web/enterprise applications I don't think this is huge.
But this problem has technically mostly been fixed by Azul and is making its way into open as well, we have ZGC and so on now. Pauses are rather short even with large heaps.
Computers are constrained in bus bandwidth as well, and it’s harder to add more. RC tends to require strongly consistent counter bumps on the bus even for objects whose exact lifetimes you might not care about so much. The costs of GC are amortized over more work, though the pauses can hurt.
I’m personally exhausted by these conversations and you should be too.
ARC was/is a reasonable compromise: no other choice was obviously better. Legacy code is of supreme importance.
But we have a new benchmark for best-in-class now: for all their insufferable Jehova’s witness Jihad approach, and yeah it’s annoying, the Rust people proved linear/affine in production.
I find Carl and Yehuda’s flagrant profiteering in the early days as unpleasant as the next thinking person, but affine/linear is game changing, and it’s time to stop apologizing for why your pet thing can’t do it.
"I'm now a coding monkey but I'm close to becoming a manager".
I really wish managers and aspiring managers would be snapped out of existence. They should be sent to the mines. If a developer does a job interview and they say they actually WANT to become a manager some day, that's the biggest red flag currently known to me. I'd rather hire a drug addict than an aspiring manager.
1. First-class regions allow us to skip a surprising amount of RC overhead. [0]
2. There are entire new fields of static analysis coming out, such as in Lobster which does something like an automatic borrow checker. [1]
3. For immutable data, a lot of options open up, such as the static lazy cloning done by HVM [2] and the in-place updates in Perceus/Koka [3]
Buckle up yall, it's gonna be a wild ride!
[0] https://verdagon.dev/blog/zero-cost-refs-regions
[1] https://aardappel.github.io/lobster/memory_management.html
[2] https://github.com/Kindelia/HVM
[3] https://www.microsoft.com/en-us/research/publication/perceus...