Chris Lattner on garbage collection vs. Automatic Reference Counting (2017)

verdagon · on April 24, 2022

I develop languages full time, and it's clear to me that RC will make massive strides forward in the next decade, for a few reasons:

1. First-class regions allow us to skip a surprising amount of RC overhead. [0]

2. There are entire new fields of static analysis coming out, such as in Lobster which does something like an automatic borrow checker. [1]

3. For immutable data, a lot of options open up, such as the static lazy cloning done by HVM [2] and the in-place updates in Perceus/Koka [3]

Buckle up yall, it's gonna be a wild ride!

[0] https://verdagon.dev/blog/zero-cost-refs-regions

[1] https://aardappel.github.io/lobster/memory_management.html

[2] https://github.com/Kindelia/HVM

[3] https://www.microsoft.com/en-us/research/publication/perceus...

osaariki · on April 24, 2022

The in-place update work in Koka [1] is super impressive. One of my co-workers, Daan Leijen leads the Koka project and hearing his talks about it have been such a privilege. The work around Koka is really convincing me that functional languages will eventually lead the pack in the effort-to-performance trade-off.

Something that came out of the Koka project that everyone should know about is mimalloc [2]: if your built-in malloc is not doing it for you, this is the alternative allocator you should try first. Mimalloc is tiny and it has consistently great performance in concurrent environments.

[1]: https://koka-lang.github.io/koka/doc/index.html

[2]: https://github.com/microsoft/mimalloc

ashton314 · on April 24, 2022

Ooooh… Koka is a really neat language. I first encountered it in a seminar when we talked about algebraic effect handlers. Koka has these row types which effectively allow you to statically ensure that all your effects get handled at some point, modulo certain ordering constraints on when those effects get handled.

Essentially, you get nice purity around effects but they're way easier to compose (and imo grok) than monads.

If anyone is interested in learning more, definitely take a look at this [1] paper by the aforementioned Leijen. (OP, very cool that you get to work with this guy.) One of the best-written papers I've seen.

[1]: https://www.microsoft.com/en-us/research/publication/algebra...

staticassertion · on April 24, 2022

I'm really excited for koka. I still need to dig in and actually try writing some code but it looks so promising. At a minimum I hope that it inspires other languages to look at these features.

There's definitely bits I don't love, but they're nothing I couldn't get over.

nsm · on April 25, 2022

I actually attempted some of 2021's Advent of Code in Koka.

It was fun for the first couple of questions, but then got really painful for a couple of reasons. None of this is a complaint on the project, as it is still a research language and has not had a lot of effort invested into getting others to use it.

What I really liked was how easy it was to write parsers due to the effect system. i.e. you'd write a parser that looked like a direct imperative translation of the BNF grammar or the problem statement, and then the effect system would propagate the actual input through the call graph and each parser would only have to deal with its small part.

However several things made me get frustrated and stop which wouldn't have happened with better documentation and more source code to look at.

As someone who has no theory background in effect systems, the syntax was very oblique. i.e. when do I need to mark that my function needs to have a `| e` effect or a `, e` effect. I'm not a fan of the Haskell style of introducing single-letter free variables without any syntactic convention. For example, in Rust, the tick is used to clearly mark that something is a lifetime annotation. The rules about how lifetimes propagate, and so when you need to introduce 2 or more lifetimes are explained clearly (if not in the Rust docs, then somewhere else on the internet). That wasn't the case for Koka's effect syntax. In addition, I would sometimes stumble onto effect annotations that should've worked according to my limited understanding, but didn't. The compiler errors in such cases were really meaningless to me. Other times, I feel like the errors almost indicated that what I was doing was semantically correct, but was not yet implemented in the compiler. Also apparently the `_` has special meaning at the beginning of an effect annotation, but I only found this out from trial and error. Eventually I would just give up on annotating any functions until the compiler complained, but this made it difficult for me to understand my own code (it was like writing Python without type annotations).

The API docs are mostly just type signatures, which means unless it is some simple function, or it is used in the koka implementation sources, you are left scratching your head about how to use the standard library. Things like `ref()` types could also use much more elaboration.

Also, at the end of the day, Advent of Code solutions are generally pretty small scripts, where there isn't a need for effect systems. It was often easier to just use Python and its solid standard library.

Overall, it is an interesting language and I'll definitely keep an eye on it.

My very limited code if anybody is interested: https://github.com/nikhilm/AdventOfCode2021

staticassertion · on April 25, 2022

Yes, one of my issues is the syntax as well.

nu11ptr · on April 24, 2022

This. Said much better than my other posts, but this is exactly what I was trying to say. Naive ARC vs GC... a good GC will kill it, so don't make the ARC naive...cheat and find ways to skip most ref. counter bumps. It became amazingly clear when I started coding Rust: you only have to bump ref count when you get a new owner. Most RC objects die with just a few counter bumps. Most objects don't need RC at all (most objects can be single owner). Now imagine a lang (lobster given as an example above) that does this automatically. Sweet spot.

thesz · on April 24, 2022

"Garbage collection can be faster than stack allocation" is actual title of this paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84...

However non-naive your automatic counting scheme can be, it still has to touch object at least twice. However non-naive your automatic counting scheme can be, it still has twice as much overhead per object compared to GC - structure and count vs just structure. Sometimes you can factor structure out - by saying "the allocator page contains pointer to structure at the very start", - but this can be done for GC too. I guess in that case GC wins even more. And you need structure (virtual methods table, actual object structure, etc) because release of object should decrease counts of referenced objects.

Finally, reference counting can and will fragment heap.

nu11ptr · on April 24, 2022

To be clear, I don't hate tracing GC, and in fact find all forms of GC fascinating. It is definitely "trade offs". I just happen to program in a space where I like the trade offs of ARC better than those of a tracing GC. This could change if I start doing different forms of coding, and in higher level languages I often do like a tracing GC, but Rust (and these new upcoming languages) are making me rethink this a bit atm.

> However non-naive your automatic counting scheme can be, it still has to touch object at least twice.

Only if the object is actually ARC. The whole goal is omit that entirely. Most objects live and die on the stack via proving single ownership. To be fair, I'm not sure why the same optimization couldn't be done for a tracing GC (many will say GC allocation and frees are already mostly 'free' because you compact live not dead objects, but by putting everything on the GC path, you force the need for a collection sooner than if it wasn't heap allocated at all).

> Finally, reference counting can and will fragment heap.

This is true, but in return we get deterministic destruction and easy FFI. If we can avoid a solid # of our allocations entirely this becomes less of an issue. When building things like trees which hit the allocator hard we use arenas, which are more like a traditional "tracing GC" minus the tracing (bump allocator + deterministic free when done, all at once).

zozbot234 · on April 24, 2022

> I just happen to program in a space where I like the trade offs of ARC better than those of a tracing GC.

Even ARC has to be augmented with some tracing in order to deterministically collect cycles. Ideally, of course, we would want to describe things like "non-cyclic ownership graph" as part of the type system, so that the compiler itself can decide where the addition of tracing infrastructure is necessary. (Delayed deallocation, as often seen in tracing, concurrent etc. GC, might be expressed as moving the ownership of finalized objects to some sort of cleanup task. But if you code carefully, the need for this ought to be very rare.)

kaba0 · on April 25, 2022

> To be fair, I'm not sure why the same optimization couldn't be done for a tracing GC

I may misunderstand you here, but isn’t it just escape analysis done by basically every big GC language out there?

verdagon · on April 24, 2022

RC might not fragment the heap for long! Someone built a defragmenting malloc called Mesh [0] which uses virtual page remapping to effectively merge two non-overlapping pages. Pretty brilliant stuff!

[0] https://tiba-jrchang.medium.com/mesh-compacting-memory-manag...

astrange · on April 24, 2022

Heap fragmentation is a real issue, but "GC is fast if you throw physical memory at it" is actually a very bad argument for it on personal computers, especially if they have to fit on your wrist, and especially if you've invented technologies that make scanning memory more expensive like swap since 1986.

ObjC/Swift have some tricks to avoid allocating refcounts, mainly relying on storing them in unused pointer bits.

thesz · on April 24, 2022

On the other hand I would like to present to you an LMAX Disruptor: https://lmax-exchange.github.io/disruptor/

Software based on this intra-program message bus usually follows "throw memory at it". That software is written in Java (hence GC) and process millions of transactions per second. The memory amount is so high that GC is not triggered at all during operations hours. Nevertheless, this is not achevable with refcounting because, well, refcounting has to keep counts.

Refcounts, having non-local access patterns, do not play well with caches, let alone something that have much higher latency like swap.

Garbage collection can be used to improve locality: https://safari.ethz.ch/architecture/fall2017/lib/exe/fetch.p...

Refcounts have troubles with cyclic structures.

I guess these tricks of refcount allocation avoidance will work fine unless there's a scale-invariant graph with several tens of thousands nodes. In scale-invariant graphs there is some number of nodes that have links from almost all other nodes.

Actually, I've got exactly that experience: typical weighted finite state transducer's HCLG graph for automatic speech recognition is scale-invariant. Analysis over it (in C++ using OpenFST) usually was done reasonably quickly, but there was a substantial delay between final printf and actual program exit, when all these pescy refcounted smart-pointed structures finally got freed.

I write all that for other readers to have more unbiased view of the problem.

As for me, I greatly prefer arenas - much like "automatic regions with reference counting" mentioned above. Arenas allow for cyclic structures, easy to implement even in C and allow for garbage collection algorithms.

astrange · on April 25, 2022

> I guess these tricks of refcount allocation avoidance will work fine unless there's a scale-invariant graph with several tens of thousands nodes. In scale-invariant graphs there is some number of nodes that have links from almost all other nodes.

This wouldn't come up in ARC, though mostly because it doesn't detect cycles so you wouldn't use strong references between graph nodes. (Overly large refcounts can still happen other ways but usually get optimized out.)

As a quick solution, I think you'd keep a list of valid nodes and have it be the only strong references to them. This is like an arena, except it doesn't have the possibility of references escaping the arena and being deallocated.

KMag · on April 26, 2022

Reference counts don't need to be a full size_t. Particularly when you're using a mark-sweep or mark-sweep-compact collector to collect reference cycles, you can use a saturating counter, and just give up on reference counting that object if the reference count ever hits its maximum value.

There are tons of applications where lots of objects only ever have one reference, so in addition to the single mark bit in your object header, you can have a single bit reference count that keeps track of if there are "one" or "more than one" references to the object. If a reference passes out of scope and it refers to an object with the reference count bit unset, then it can be immediately collected. Otherwise, you need to leave the object for the tracing collector to handle. This is useful in cases where you have a function that allocates a buffer, passes it to a function that populates the buffer (without storing a reference to the buffer into any objects), passes that buffer to a function that reads out the data (again, without storing the reference into any objects), and then returns.

In the common case of objects where a C++ unique_ptr would do, this single bit reference count never gets modified. There is more overhead in checking the bit when the reference goes out of scope, but there is never a decrement of the reference count, and the common case doesn't have an increment.

mananaysiempre · on April 24, 2022

> Finally, reference counting can and will fragment heap.

Is there anything that won’t fragment heap and simultaneously doesn’t insist on owning the world? It is really, really painful to allow references to cross the GC heap boundary in both directions (witness the GNOME Shell not-a-memory-leak), and to me that is the principal reason: in a low-level language, you can RC only a bit of stuff, but you cannot really moving-GC only a bit of stuff.

(I don’t expect an affirmative answer to this question as stated, but maybe some more basic abstraction for manual memory management would allow this?..)

nu11ptr · on April 24, 2022

Two more interesting ways to speed it up:

- Make threads a "special" tracked API call (for example, like Rust) so that you know when something will or won't be passed between threads. If it never "escapes" the thread, you can use non-atomic ref counts for the lifetime of that object (NOTE: I should add this is something I've been thinking on - I don't know how hard/feasible it is to do in practice)

- Use biased RC - don't have link to paper handy, but idea is that most multithreaded objects spend most of their life on one thread and when they are on that thread, can use non-atomic ref counts by using a separate counter. When on any other thread they use an atomic counter. When both counts hit zero, object is destroyed.

verdagon · on April 24, 2022

Definitely! Vale, Lobster, and Cone all require explicit annotations for anything that could cross thread boundaries.

Vale's design bends the rules a little bit, it blends structured concurrency with region borrow checking to allow things to temporarily escape the current thread. [0]

I believe Lobster spawns a new VM per thread, and has some special abilities to share data between them, but I'm vague on the details.

Cone requires an explicit permission to allow crossing thread boundaries, which I think will work pretty well. [1]

I've read about biased RC before, but have been skeptical because of the branching involved. How has it worked out in practice?

[0] https://verdagon.dev/blog/seamless-fearless-structured-concu...

[1] https://cone.jondgoodwin.com/safety.html

nu11ptr · on April 24, 2022

I've just read about it, never actually used it verbatim tbh, but just found it kinda fascinating. I wrote a somewhat modified version (I don't track local thread ID, I use Rust's type system and Send/Sync properties instead) of the algorithm for a very different purpose: I use it to allow safe moves back and forth between (non-atomic) Rc and (atomic) Arc objects. For this purpose, it is very useful. The actual Rc/Arc operations are no more expensive than regular Rust Rc/Arc (except for the last decrement on Rc - slightly more expensive)

I got the idea from this person (I didn't use their algorithm though - they probably do it similar to how I do it though): https://crates.io/crates/hybrid-rc

rfoo · on April 24, 2022

Biased GC is implemented the "nogil" Python fork. It worked okay-ish.

elcritch · on April 24, 2022

Nim's ARC memory management uses the first strategy. Non-atomic refs combined with moves really speed things up! Moving data across threads requires the data be of a special type `Isolate`. It can be a bit tricky to prove to the compiler that the data is isolated, e.g. single ownership.

amelius · on April 24, 2022

> Naive ARC vs GC... a good GC will kill it, so don't make the ARC naive...cheat and find ways to skip most ref.

However, what if you applied that same principle to a GC'd language ...

rectang · on April 24, 2022

If you work on software that has hard real-time constraints, e.g. live audio processing, do any of the improvements to GC matter?

GC reminds me of satellite internet: no matter how fast it gets in terms of throughput, the latency will always be unacceptable — but because the latency issue can't be solved, advocates will keep trying to tell you that latency doesn't matter as much as you think it does and your problem isn't really a problem. (All sorts of bad experiences in modern software come down to somone trading away latency to get throughput.)

Rather than the framing of "GC vs. ARC", I'm mostly interested in improvements to the ownership model.

verdagon · on April 24, 2022

Ironically, even though I believe RC will make great strides, I actually agree with you. Single ownership has the most potential, especially if we decouple it from the borrow checker as we are in Vale. [0] Being able to control when something is deallocated is quite powerful, especially for real-time systems.

[0] https://verdagon.dev/blog/generational-references

Tozen · on April 28, 2022

> "GC reminds me of satellite internet."

> GC reminds me of satellite internet: no matter how fast it gets in terms of throughput, the latency will always be unacceptable — but because the latency issue can't be solved, advocates will keep trying to tell you that latency doesn't matter as much as you think it does and your problem isn't really a problem.

LOL, like this statement, and will have to borrow it for future discussions/debates.

pjmlp · on April 24, 2022

There are real time GCs when one is willing to put the money on the table.

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/wp/products-services/jamaicavm/

zozbot234 · on April 24, 2022

Note that Rust will most likely be getting some support for these features too, at some point. It's what much of the academic PL research on borrow-checked languages is pointing towards. You can alreasy see some of the practical effect with "GhostCell" and similar proposals, even though these are arguably much too clunky and unintuituve for practical use (notably, they use obscure features of the Rust lifetime system to implement their equivalent of regions). Language-level support is likely to be rather easier to work with.

forrestthewoods · on April 24, 2022

Fascinating. I’ll be honest, I’m moderately skeptical! Various flavors of GC has caused me great pain and suffering. The promise has long been “GC, but less bad” and so far that has not come to fruition.

However, I’m ecstatic at how many languages are trying new and interesting things. The ecosystem of novel languages feel more rich and vibrant now than at any other point in recent memory. Probably due to LLVM I think?

You should give a talk at Handmade Seattle in the fall! Would be a great audience for it I think.

sillysaurusx · on April 24, 2022

Ehh. Who cares, though? And I mean that in the least dismissive way.

Why do you care? Sell me on it. GCs have gotten to the point where the slowdown is almost the butt of jokes. LuaJIT proves that it can get damn close to C speeds.

What you’ve said sounds like a whole lot of complexity for … what?

Give me the automatic threading speed improvements with no extra programmer effort and I’ll change my stance 270 degrees.

iscoelho · on April 24, 2022

1. LuaJIT can be faster than C due to optimizations regarding function calls.

2. GC is LuaJIT's greatest weakness [1] (It is very slow) - Most high performance projects disable it and only use C FFI allocations.

[1] http://wiki.luajit.org/New-Garbage-Collector

LuaJIT 3.0 (which will have the new garbage collector) also has no ETA and is not in development.

verdagon · on April 24, 2022

A valid question, happy to enlighten!

To get the automatic threading speed improvements, a (region-aware) language might add a `parallel` keyword in front of loops, to temporarily freeze pre-existing memory, thus completely eliminating the overhead to access it. Vale is building this for its generational references [0] model but I hope that more RC-based languages attempt something like it. Can't get much easier than that ;)

Lobster's algorithm doesn't actually incur any additional complexity. It's completely automatic.

Perceus and HVM require immutability, which does require programmer effort, I admit. Pure functional programming just isn't as convenient as imperative programming for most. However, languages are learning to blend mutable and immutable styles, such as Pony [1], Vale [2], and Cone [3]. When a language allows you to create an object imperatively and then freeze it, it's wonderfully easy.

I think the next decade in languages will be about this exact topic: making it easy to have fast and safe code. I think RC will soon bring some very compelling factors to that equation.

[0] https://verdagon.dev/blog/seamless-fearless-structured-concu...

[1] https://tutorial.ponylang.io/reference-capabilities/referenc...

[2] https://vale.dev/guide/structs#mutability

[3] https://cone.jondgoodwin.com/coneref/refperm.html

MaxBarraclough · on April 24, 2022

I think the real point of their question was along the lines of GC works great and these days it's pretty fast, why research cleverer reference-counting?

Will these developments lead to appreciable performance improvements for 'typical' non-embedded, non-real-time programs?

Someone · on April 24, 2022

I think the consensus is that GC works great _if_ you throw enough memory at it (about three times what you really need)

On mobile, the problem with that isn’t as much that memory costs money, but that it costs Joules. Peak power usage of RAM may not be large compared to that of your screen, CPU, GPU, GPS, etc. but AFAIK, with current tech, memory keeps eating energy even if your device is sleeping.

Being able to run with less memory, I think, is why iPhones can do with relatively small batteries and yet have decent stand by times.

astrange · on April 24, 2022

The fastest system currently out there is iOS, which (as this article says) doesn't use GC and does use clever reference counting, so yes?

Note, most people here arguing that GC is actually always faster are forgetting that reading memory itself has a cost, especially in systems with swap or other reasons it might have a /really/ large cost.

kaba0 · on April 25, 2022

You shouldn’t attribute the difference to the GC algorithm used when there are generations between ios’s CPUs and android’s.

astrange · on April 25, 2022

If you won't believe me, you could believe Chris in the article.

> Chris Lattner: Here’s the way I look at it, and as you said, the ship has somewhat sailed: I am totally convinced that ARC is the right way to go upfront.

nu11ptr · on April 24, 2022

For me the biggest benefit is deterministic destruction, specifically for the non-memory resources it is holding (files, locks, etc.) the same as if it was stack allocated. It is very freeing knowing that is just being taken care of. No 'defer', 'with' or other special handling needed.

But possibly the bigger benefit: no heap allocation for many objects at all. If you can prove single ownership, you can stack allocate it (or "inline it" if in a struct/object).

shakna · on April 24, 2022

> No 'defer', 'with' or other special handling needed.

Just as an aside - in Python, it surprised me that 'with' is _not_ a destructor. The object created will remain in-scope after the 'with' body, but the '__exit__' method will be called. Which for the canonical example of file-handling, only really closes the file.

jnwatson · on April 24, 2022

The done vs destroyed dichotomy is a cross-cutting concern for programs that need explicit control over resource lifetimes.

RAII is actually an antipattern in Python if you actually care about resources being released.

shakna · on April 24, 2022

The surprise is less about resource usage and more about the scope lifetime. 'with' introduces a new binding that outlives its body. It looks, at first glance, to be a temporary variable. But it isn't.

chombier · on April 24, 2022

I always wondered why does this variable leak outside the `with` scope, because this is really the one place where it should not.

IIRC python3 fixed scope leaks in list comprehensions, does anyone know why they didn't fix this one as well? Or does it have actual uses in practice?

shakna · on April 24, 2022

I don't know the _motivation_ for why it leaks, but it leaks because the method calls of a context manager are __enter__ and __exit__, but not __del__. [0]

The motivation is especially opaque, because though CPython does run __exit__ immediately, it isn't guaranteed to do so by the standard. So __exit__ will happen at some point, like __del__ with the GC, but it won't actually call __del__, which makes GC'ing the value more complicated as __del__ can't safely run until __exit__ does.

I _think_ the motivation is something to do with __exit__ being allowed to throw an exception that can be caught, whereas __del__ isn't allowed to do that, but I don't find it that convincing. And I've never seen an example where the introduced variable is ever referenced afterwards in Python's documentation (including in the PEP itself).

[0] https://peps.python.org/pep-0343/

chombier · on April 24, 2022

Thinking about it, my guess is that properly scoping the variable does not really prevent the user from leaking the value anyway:

    with context() as foo: 
        leak = foo

This is unlike list comprehensions, where it is syntactically impossible to bind the comprehension variable to a new name accessible outside the comprehension.

So maybe the reason is that it is not really worth the trouble.

ptx · on April 24, 2022

Everything is always* scoped to the entire function in Python. Perhaps that's not always ideal, but at least the rule is simple.

* Well, not variables in generator expressions. There might be a few other exceptions.

pphysch · on April 24, 2022

Consider the common pattern:

    with open(path) as file:
        result = func(file.read())

    another_func(result)

If `with` used a separate scope then it would be a lot less ergonomic. How might we bring data outside that scope and be confident that `with` isn't forcibly cleaning things up? Would be a lot more verbose.

shakna · on April 24, 2022

I think you've missed what I've meant by a new binding. That the body executes is entirely intuitive.

However:

    with open(path) as file:
        ...

    if file.mode == 'r':
        print("Huh? file is still a thing?")

xscott · on April 24, 2022

Python doesn't have block scope, but most people don't seem to notice:

    def foo():
        if False:
             x = 123
        print("x is visible out here:", x)

That will generate an error because `x` is "referenced before assignment", but the variable exists for the entire function. This can bite you in subtle ways:

    x = "is global out here"
    def foo():
         print("all good:", x)

    def bar():
         print("not good:", x)
         if False:
              x = "now x is local, even before this declaration"

oaiey · on April 24, 2022

I would argue that resources need always explicit handling, no matter whether you do reference counting or GC. So if I do a one-liner to dispose an object or call delete on an object is the same beauty. Wait till the variable is out of scope together with many more or hit the GC cleanup is just not the right way to clean up externally resources in a proper way.

SemanticStrengh · on April 24, 2022

What does it bring vs Java autocloseable?

nu11ptr · on April 24, 2022

Java's "try with resources" is effectively the same as Go's 'defer' and Python's 'with', so the advantage is that you can't forget to do it. No matter what you do, when the object goes out of scope, the resource will be closed.

pjmlp · on April 24, 2022

If you actually use static analysis, you can't forget it in Java either.

And it isn't as if using languages without static analysis is still considered a best practice in the 21st century.

tialaramex · on April 24, 2022

> And it isn't as if using languages without static analysis is still considered a best practice in the 21st century.

So in summary, "Please just do better". Not helpful as a video game meme, and really no better for programming.

pjmlp · on April 24, 2022

I guess that is why no one bothers with clippy then, it is superfluous.

tialaramex · on April 24, 2022

Everywhere that a clippy lint is there to catch bugs, I wish it wasn't a clippy lint because it won't always get run and those bugs will get missed.

An example of a clippy lint that I like in clippy:

If you try to transmute() &[u8] into a &str clippy will point out that you should probably from_utf8_unchecked() -- this lint results in code that does the same thing but is clearer for maintainers about what we intended. If our reaction to the lint is "But it isn't UTF-8" then we just got an important red flag, but that's not why the lint is there.

On the other hand, clippy lints for truncating casts and I want those gone from the language so the warnings ought to at least live in the compiler not clippy. I would like to see a future Edition outlaw these casts entirely so that you need to write a truncating cast explicitly if that's what you needed.

throwaway2037 · on April 24, 2022

Is Java escape analysis done at compile time or run time? I don't know.

FYI: As I understand, escape analysis is frequently used to either (a) avoid allocating on the heap, or (b) auto-magically free-ing a malloc'd object from a tight scope.

pjmlp · on April 24, 2022

First of all it depends on the implementation, the language specification doesn't state how it works.

https://en.m.wikipedia.org/wiki/List_of_Java_virtual_machine...

As it stands is mostly done at runtime and only for a).

Otherwise you are better with try-resources, and for something like malloced memory there are off-GC heap APIs.

Also Valhalla will eventually land, although it is getting long by now.

vips7L · on April 24, 2022

Personally I wish they would move some escape analysis to compile time just to help with memory consumption. I’m assuming it would require byte code changes though.

SemanticStrengh · on April 27, 2022

Kotlin doesn't allow mistakes https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.io/use.h... For structured concurrency too.

xxs · on April 24, 2022

>the non-memory resources it is holding (files, locks, etc.)

Locks are entirely a memory construct. You can trivially write one, e.g. Java locks are written entirely in java. The files descriptors (incl. socket) - they should not be relied to be GC'd by default, they do require their own lifecycle. For instance: c# Disposable/ java's try with resource.

>no heap allocation for many objects at all.

This has been already the case for many a year - reference escape analysis, ref. escape elission, trace compilation.

tialaramex · on April 24, 2022

Locks are, in fact, one of many resources your operating system can give you.

Historically, in concurrent software your only option was to either use these OS allocated locks or build a spin lock from atomic operations (and thus burn CPU every time you're waiting for a lock). The OS allocated locks, such as from sem_open() or create_sem() must later be properly released or you're leaking a finite OS resource that will only be reclaimed on process exit or in some cases only at reboot.

Today your operating system probably offers a lightweight "futex" style locking primitive, in which so long as it's not locked no OS resources are reserved it's just memory. But, you're not guaranteed that all your locks are these lightweight structures and even if you were you need to clean up properly because when they are locked the OS is using resources so you need to unlock properly or you'll blow up.

xxs · on April 24, 2022

The only part needed to build a lock is to be able to 'sleep' the current thread and the latter to be able to be awaken, plus compare-and-swap or load linked/store conditional. There is nothing really special about the locks as long as the current thread can be made dormant. Technically a single OS lock (e.g. a futex), associated with the current thread would suffice to implement as many locks as one desires. The lock would be released when the thread dies in the clean up code.

The locks that result in thread awaiting are still in-memory, of course. They are just OS provided primitives. Yet again, a well behaved/designed lock should be non-contested for the most calls (easily 99.9%+). In that case there is no mode-switch to get to the kernel. A contested lock that goes to the kernel is a context switch.

In the end there is no need to do anything special about locks as they can implemented in the user-space unlike the file descriptors.

tialaramex · on April 24, 2022

> In the end there is no need to do anything special about locks as they can implemented in the user-space unlike the file descriptors.

If you get semaphores from create_sem() you will need to give them back, you can't just say "This shouldn't be an OS resource" and throw it away, exactly like a file descriptor it's your responsibility to dispose of it properly.

Speaking of which, depending on the OS, your file descriptors might also just be implemented in user space. They're just integers, it's no big deal, of course you must still properly open() and close() files, the fact it's "implemented in user space" does not magically mean it isn't your problem to clean up properly.

Likewise, if you opened a sqlite database, you must close it properly, you can't just say "Eh, it's only userspace, it will be fine" and throw away the memory without closing properly.

xxs · on April 24, 2022

>If you get semaphores from create_sem() you will need to give them back...

I don't quite get the continued argument - the created locks in the user space have 'zero' extra dependency on the kernel. In a GC setup they would be an object no different than a mere string (likely less memory as well). They can be GC'd at any time as long as there are no references to them.

If you wish to see an example: java.util.concurrent package. ReentrantLock[0] is a classic lock, and there is even a Semaphore in the same package (technically not in the very same). Originated in 2004 with JMM and the availability of CAS. You can create as many as you choose - no kernel mode switch would occur.

For the record I do no object that proper resource managed is needed - incl. sqldatabase close. The latter does use a file descriptor any time you establish a connection. However, locks need no special support from the kernel. Also in GC'd setup file descriptors would be GC'd in pre (or post) morterm via some finalization mechanism, however relying on the GC to perform resource management is a terrible practice.

[0]: https://github.com/openjdk/jdk/blob/master/src/java.base/sha...

tialaramex · on April 25, 2022

> I don't quite get the continued argument

I keep telling you that, in fact, these primitives can literally be OS resources such as the semaphores from the BeOS system call create_sem() or the POSIX sem_open or Windows CreateSemaphore.

Your insistence that these don't have a dependency on the kernel, that they're just memory to be garbage collected is wrong, they aren't, this won't work.

You propose the Java ReentrantLock class as an example, but, that's just a thin wrapper for the underlying Java runtime parking mechanism, so, sure it's just a Java object no big deal, but the actual implementation (which is not in the file you just linked) requires an OS primitive that you've not examined at all, that OS primitive is being managed by the runtime (via the "Unsafe" interface) and is excluded from your garbage collection entirely.

In essence you get to "just" garbage collect these locks because they're only a superficial Java language shadow of the real mechanism, which you aren't allowed to touch and can't garbage collect.

xxs · on April 25, 2022

>requires an OS primitive that you've not examined at all, that OS primitive is being managed by the runtime (via the "Unsafe" interface) and is excluded from your garbage collection entirely.

This is most definitely false. The methods needed are the compare-and-set + LockSupport/park (parkNanos) and LockSupport.unpark. The former is a standard CPU instruction w/o any special magic about and the latter is setting the current thread to sleep and being awaken.

As for examine: I know how j.u.c works down the OS calls and generated assembly. Unsafe is a class (not an interface), albeit mostly resulting in compiler magic.

The use of unsafe is solely for: LockSupport::park/unpark and AbstractQueuedSynchronizer::compareAndSet. ReentrantLock is NOT any thing wrapper.

tialaramex · on April 25, 2022

> As for examine: I know how j.u.c works down the OS calls and generated assembly.

And here's the big reveal. It's actually calling the OS for "setting the current thread to sleep and being awaken" ie it's just using the OS locks and those are hidden from you so that you can't garbage collect them.

> Unsafe is a class (not an interface), albeit mostly resulting in compiler magic.

Unfortunate choice of language, I didn't mean a Java interface, but an interface in the broader sense. I assure you that "setting the current thread to sleep" is not "compiler magic", it's just using the OS locks, on a modern system it will presumably use a lightweight lock but on older systems it'll be something more heavyweight like the semaphores we saw earlier.

The claim that LockSupport, which is calling the OS is not a wrapper is bogus.

But all this is a distraction, even on systems where old semaphore APIs are now rarely needed they still exist and if you hold one you are still responsible for cleaning it up properly even if you think you know better.

xxs · on April 25, 2022

There is a single OS call to put the current thread to sleep; I mentioned it much earlier - a single pthread_mutex_t per Thread NOT per lock

Those are not any part of the developer capability to manage, aside starting the thread (and locking on it). The mutex is released with the thread, itself; Technically hotspot impl java does not ever releases them back to the OS but keeps a free-list. The thread is a lot larger OS resource as well. No new OS resources are exhausted by the OS b/c of the extra locking.

>The claim that LockSupport, which is calling the OS is not a wrapper is bogus.

Of course it does call the OS; it happens on the dormant part (w/o an OS call it'd a busy spin), or if a thread needs to be awaken. I said not a 'thin' wrapper. All the locking and queueing mechanism is outside the OS. The major distinction is there is only a single OS primitive per Thread NOT per lock. (Edit: now I see I have spelled thin as 'thing' which is a definite mistake on my part)

  /*
   * This is the platform-specific implementation underpinning
   * the ParkEvent class, which itself underpins Java-level monitor
   * operations. See park.hpp for details.
   * These event objects are type-stable and immortal - we never delete them.
   * Events are associated with a thread for the lifetime of the thread.
   */

https://github.com/openjdk/jdk/blob/c4c76e2f34f7e709f8b3c960...

  // ParkEvents are type-stable and immortal.
  //
  // Lifecycle: Once a ParkEvent is associated with a thread that ParkEvent remains
  // associated with the thread for the thread's entire lifetime - the relationship is
  // stable. A thread will be associated at most one ParkEvent.  When the thread
  // expires, the ParkEvent moves to the EventFreeList.

https://github.com/openjdk/jdk/blob/739769c8fc4b496f08a92225...

Edit: My point has always been that locks/monitors/semaphores/etc. dont need any special care in a GC-setup as they can be implemented purely in the language itself, and without special care to release them, and manage them similar to file descriptors. The need to block (become dormant) obviously requires OS calls and support but that part doesn't have to scale with the amount of locks/semaphores/latches/etc. created.

tialaramex · on April 25, 2022

> now I see I have spelled thin as 'thing' which is a definite mistake on my part

That does make a lot more sense. I agree that what's going on here is not merely a thin wrapper.

> My point has always been that locks/monitors/semaphores/etc. dont need any special care in a GC-setup as they can be implemented purely in the language itself

And my point was always that you can only depend upon this if you in fact are always using the "implemented purely in the language itself" mechanisms, while plenty of others exist and you might need them - but you can't just treat them as garbage safely.

So overall I think we were mostly talking at cross purposes.

Actually looking at the Java stuff reminded me of m_ou_se's thread about how big boost::shared_mutex is (it's 312 bytes). Most people who need one of these synchronization primitives only really need the one that costs 4 bytes, but it's hard to keep saying "No" to people who want to add features. They're always confident that everybody will love their feature, at the cost of a little more memory, and it spirals out of control.

On most systems the pthread_mutex_t you mentioned is already 40 bytes. A "nightly" Rust can do a 32-byte array with a mutex to protect it, all inside 40 bytes.

12thwonder · on April 24, 2022

like Chris said in the article, it is difficult for GC languages to interop with C/C++ codebase.

in addition, if your system requires precise control over when and when not to use CPU, like resource-intensive gaming, browsers, OS, then GC may not be a good choice.

lastly, if you are using GPU via graphics API, I do not know any API that can garbage collect GPU memory but RC can naturally extend memory cleanup code to cleanup GPU memory as well quite easily.

davnn · on April 24, 2022

CUDA.jl apparently uses Julia‘s GC to free up GPU resources, see https://cuda.juliagpu.org/stable/usage/memory/

12thwonder · on April 24, 2022

good to know, thanks!

I personally would like to know when the GPU resources become available (freed) thus still inclined to prefer RC over GC tho.

kazinator · on April 25, 2022

No it isn't. Chris put his foot in his mouth and went on about how an advantage of garbage collection is that it moves objects. Not all garbage collection moves objects. Garbage collection that does not move objects does not have any such interop problem. Furthermore, even GC that does move objects can avoid that particular interop problem if it simply avoids moving foreign objects. E.g. suppose a foreign object is managed by some gc-heap object of type "foreign pointer". That managing object can be moved during GC, but if the actual foreign pointer doesn't change. If that pointer is all that the foreign code knows about, then there is no problem. Don't give foreign code any pointers to your heap objects and you're good.

hmlongco · on April 26, 2022

Uh... I can solve entire batches of problems by just saying, "don't do that and you're good."

Unfortunately, real-world constraints always seem to get in the way...

nine_k · on April 24, 2022

Think about the typical heap size for a GC-based language. It's usually 50-70% larger than a working set, to let GC run efficiently.

throwaway2037 · on April 24, 2022

To be clear, in enterprise (non-embedded / non-mobile) programming, this is fine. Money (RAM) isn't the issue. Programmers with domain knowledge are much more expensive than adding more hosts or RAM. I am talking about the general case; I know HN loves to focus on "high performance / low latency", but that is less than 1% in "Enterprise Real World".

kazinator · on April 25, 2022

I don't know if there is a typical heap size.

My TXR Lisp is garbage collected. When I blow away all the compiled files of the standard library and run a rebuild, in a separate window running top, I do not ever see a snapshot of any txr process using more than 27 megs of RAM:

This is close to the worst-case observed:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  19467 kaz       20   0   27364  24924   3112 R 100.0  1.2   0:30.55 txr

When this is happening, the library code is being interpreted as it compiles itself, it by bit. E.g. when stdlib/optimize.tl is being compiled, I see this 27 megs usage. But after the entire library is compiled, if I just "touch stdlib/optimize.tl" to force that to rebuild, it's a lot leaner; I see it hit about 16 megs.

Sitting at a REPL prompt, compared to some bash processes, using ps -aux:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
  USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  kaz       2433  0.0  0.0   8904  1900 pts/0    Ss   Feb22   0:00 -bash
  kaz       6179  0.0  0.2   9072  4212 pts/1    Ss+  Apr22   0:01 -bash
  kaz      18551  0.0  0.1  11152  3932 pts/2    Ss+  Feb22   1:32 -bash
  kaz      18836  0.0  0.1   9064  2196 pts/5    Ss+  Apr13   0:02 -bash
  kaz      19079  0.0  0.2  13464  4880 pts/3    Ss   Mar01   0:40 -bash

  kaz      19849  0.1  0.2   9360  6024 pts/4    S+   23:45   0:00 txr

Oh, and that's the regular build of TXR. If that's too bloated, you can try CONFIG_SMALL_MEM.

Large is as large does.

Garbage collection was once used on systems in which most of today's software would not fit.

Garbage collection does have waste. You allocate more than you need, and keep it.

Precise, manual memory management does the same thing. Firstly, the technique of keeping around memory is practiced under manual memory management. E.g. C programs keep objects on recycle lists instead of calling free. And free itself, will not release memory to the OS, except in certain circumstances. Then there is internal fragmentation. Generally, programs that "get fat" during some peak workload will stay that way.

If you see some garbage collected system need hundreds of megabytes of RAM just for a base image, it may not be the GC that's to blame, or not entirely: it's all the bloated crap they are pulling into the image. Or it could also be that the GC was tuned by someone under thirty who doesn't think it's weird for a browser with three tabs open to be taking 500 megs. GC can be tuned as badly as you want; you can program it not to collect anything until the VM footprint is past the physical RAM available.

hmlongco · on April 26, 2022

"it may not be the GC that's to blame, or not entirely: it's all the bloated crap they are pulling into the image"

Again, you keep waving your hands in the air and saying if you don't do that, you'll be fine.

oikawa_tooru · on April 24, 2022

Hi, I am bootcamp grad currently working with fullstack ror development. I have started to get bored of this work and want to venture into other core cse fields. Compiler design and machine learning seems to have caught my interest the most. What would your advise to be someone who wants to break into this field by self study? Does compiler design involve any math that is also used in machine learning and deeplearning? And do they share any similar concepts?

verdagon · on April 24, 2022

That's great that you're interested in venturing into other fields! Switching fields every once in a while is a great way to grow and become a stellar engineer.

If you have equal passion for those two choices, I would suggest leaning towards machine learning for now. An ML engineer can make $300k-500k/yr at the big companies after a few years.

I left that world and started doing language development as an open source endeavor, and make only $700/yr [0] which isn't that much, though I'm oddly more proud of that than the salary I got from my years in Mountain View. I'm happier, being able to push the state of the art forward and show people what incredible potential there lies ahead in the PL realm. However, I'd recommend graduates to focus on finances for a few years before looking at what lies beyond, lest the financial stress overshadow the dream.

I'm only about 110,000 lines into my journey with Vale, but so far I'd say there's little math. There's a lot of data structures knowledge needed to make a compiler fast, and some interesting algorithms, but not much overlap with ML. But who knows, maybe there's some cool things languages can do specifically for ML, that haven't been discovered yet!

[0] https://github.com/sponsors/ValeLang and I'm grateful to all of them!

oikawa_tooru · on April 24, 2022

Thank you Sir for your gracious response.

chaosite · on April 24, 2022

ML utilizes continuous mathematics, while compiler design including static analysis usually favor discrete mathematics. I wouldn't say the concepts are interchangeable.

The classic text book is the Dragon Book. Another good start on the compiler side are projects like https://buildyourownlisp.com/.

A lot of people start with lexing, parsing, and so on before moving on, but you can start from the static analysis side too. I would look at the LLVM framework and the Clang Static Analyzer, and use one of the tutorials for that to write a small analyzer for C.

oikawa_tooru · on April 24, 2022

Thank you so much for the guidance!! I will look into it soon!!

eatonphil · on April 24, 2022

I'd encourage you to join the following subreddits that might be interesting to you: r/compilers, r/programminglanguages, r/EmuDev, r/databasedevelopment.

You'll find more people like yourself interested in core CS.

My company also runs a Discord community for those kinds of people: discord.multiprocess.io.

Good luck!

simscitizen · on April 24, 2022

Maybe start with a good book or two to see if you're actually interested. Crafting Interpreters (https://craftinginterpreters.com/contents.html) might be a good intro--yes, it's about interpreters, but there's a lot of overlap between creating an interpreter and creating a compiler.

kcartlidge · on April 24, 2022

As very much a non-expert in compiler design, I found two books to be helpful introductory reads (neither are free though). I'm not sure whether someone proficient in the field would recommend them as my knowledge in the area is small, but I enjoyed them.

- Writing An Interpreter In Go by Thorsten Ball (https://interpreterbook.com)

- Compiling to Assembly from Scratch by Vladimir Keleshev (https://keleshev.com/compiling-to-assembly-from-scratch)

The first produces an interpreter in Go for a custom language, but as simscitizen says there's a lot of crossover with compilers.

The second creates a compiler written in Typescript that compiles from a minimal form of Typescript into ARM assembly. It's not cheap but is good.

KMag · on April 26, 2022

I briefly scanned the Lobster reference, but I didn't see how they're dealing with the undecidable cases. Presumably it punts out to the garbage collector for objects where lifetime isn't statically decidable, right?

My understanding is the answer for "Why doesn't Rust use something like type inference for lifetimes?" is "In the general case, that reduces to the halting problem."

panick21_ · on April 24, 2022

There is also a lot still to do on the GC side. Azul GC with hardware support allowed for amazing results. Even without hardware support the results are quite good.

LadyCailin · on April 24, 2022

A little off topic, but may I ask how you managed to work on this full time? I am also developing a programming language, (methodscript.com if you’re curious) but I’ve never been anywhere near being able to work on it full time.

eatonphil · on April 24, 2022

Their site mentions they worked for Google for 6 years. The stock alone from doing that would be enough to self fund for a while. They may also be getting grants or sponsors for their language or game development. Or they could be doing contract work of any kind to pay the bills.

These are all common approaches to working on your own thing full time.

verdagon · on April 24, 2022

Yep, I worked at Google until they silently started sunsetting all the small endeavors and funneling everyone into Ads/Cloud/Search. Back in 2014, the internal position listings were full of fascinating projects that could bring a lot of good, and looking at it now, it's pretty clear the writing is on the wall for the culture. I left late last year.

Luckily, I've saved up enough to be able to work on Vale full-time for a year or two, hoping I'll have enough sponsors by then to be able to continue. If I play my cards right, I'll be able to make Vale into a nonprofit, a la ZSF. A worthy endeavor!

kazinator · on April 25, 2022

A simple (obj &val) parameter in C++ eliminates reference counting by lending a reference, which can go an entire tree of helper functions.

NeutralForest · on April 24, 2022

Hey thanks for mentioning Lobster, I didn't know about it and it looks cool, I'll probably try to draw a couple things with it =)

bmitc · on April 24, 2022

What's the state of implementations suitable for real-time systems?

pizlonator · on April 24, 2022

I agree with a lot of what he says.

He gets something subtly wrong though: GC barriers are waaaaay cheaper than ARC, in the limit when you optimize them. Libauto had expensive barriers, but as Chris said, it’s not super interesting since it was overconstrained. But the production high-perf GCs tend to have barriers that are way cheaper than executing an atomic instruction. JSC’s GC has barriers that basically cost nothing, and that’s not unusual. It’s true that low latency GCs tend to have barrier overheads that are higher than high throughput GCs, but even then, the cost isn’t anything like atomic inc/dec. The barrier overhead in RTGCs is often around 5-15% except for the esoteric algorithms nobody uses.

Also, GCs trivially give you safety even if you race to store to a pointer field. ARC doesn’t, unless you introduce additional overheads.

Does what I say mean that ARC is bad? Heck no, ARC is super impressive. But it does mean that it’s a good idea to weigh ARC versus GC as a determinism versus speed decision, and accept that if you chose ARC, you lost some speed.

Oh but ARC also let’s you do CoW better than what a GC could give you, so for some languages ARC is a speed win over any possible GC because it’ll let you do fewer copies. I think that’s true in Swift. But if you don’t have CoW then GCs are def faster.

vvanders · on April 24, 2022

I think his point about the amount of research done on ARC vs GC was interesting. Only recently have we see things like biased reference counting[1] that can significantly reduce the overhead of ARC.

[1] http://iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_pact18....

pjmlp · on April 24, 2022

Reference counting is the oldest garbage collection algorithm, it has seen plenty of research.

What keeps being overseen is the amount of engineering to make reference counting perform in modern hardware, effectively overlaps with other garbage collection optimizations.

tialaramex · on April 24, 2022

> What keeps being overseen is the amount of engineering ...

You probably wanted "overlooked" here. Even though "seen" and "looked" are similar in meaning, "overseen" and "overlooked" are not. This is a subtle but important distinction in English. https://grammarist.com/usage/oversee-vs-overlook/

Historically you could choose "oversee" when you would today use "overlook" or vice versa, but these usages went from archaic to actively disapproved. One remnant though is you would still use "overlook" in a sentence like "Atop the hill, the old castle overlooks the whole town".

waqf · on April 24, 2022

And yet (depending on how it is used) "oversight" often refers to an instance of overlooking, rather than of overseeing.

It seems remiss of Grammarist to write an entire page without mentioning that.

tialaramex · on April 24, 2022

Good point. Example showing the two contrasting meanings of the word "oversight" in modern English:

At m.d.s.policy it was our view that Symantec management had inadequate oversight of Symantec's Certificate Authority and as a result on several occasions over an extended period an important oversight relating to the name validation procedures at CrossCert, a Korean business authorised by Symtanec to issue on their behalf - occurred without being detected by Symantec. This definitely resulted in undetected misissuance.

As a result of our efforts to fix this problem Symantec sold their entire CA business to DigiCert, and CrossCert became merely a reseller of DigiCert's products (including "Symantec" branded certificates)

This meant that now DigiCert's trustworthy management exercised oversight over all issuance of certificates from the new CA hierarchy even when the certificates had "Symantec" branding - while an oversight by CrossCert, such as failing to verify the identity of a customer, would no longer cause misissuance because DigiCert would independently use the Ten Blessed Methods.

pizlonator · on April 24, 2022

Lol biased ref counting. I tried that years ago. It’s super fast for serial workloads or very simple parallel ones. I would avoid it for anything real.

celeritascelery · on April 24, 2022

In the article

> The performance side of things I think is still up in the air because ARC certainly does introduce overhead.

As you said, they can both introduce overhead, but the ARC overhead is way worse. Really the only reason to use garbage collection (which is really complex) over ARC (which is simple) is the performance and control you get. Otherwise it would not be worth the implementation cost. The question is not “up in the air” on performance; production-grade GC’s are a clear winner. I would even be doubtful that CoW could make up the difference. But I would be interested to see data suggesting otherwise.

nu11ptr · on April 24, 2022

No doubt straight up ARC with refcounting bumps on every function call,etc. vs a good GC with same allocation paths...a good GC will smoke it. No question.

But that isn't how I read what Lattner said here, and I've been thinking similar things. Essentially with ARC the goal is to cheat (to be fair with tracing GC it is as well, just in different ways). And you cheat by implementing an ownership/borrow collector type system behind the scenes for getting rid of the need for RC on most of your objects. Once you do that, you have GC competing with stack allocation + some malloc/free scattered in. It becomes a much more interesting fight.

hmlongco · on April 26, 2022

"No doubt straight up ARC with refcounting bumps on every function call,etc..."

Yeah, but in Swift you don't do that. From what I'm told, the compiler's static analysis has gotten a lot smarter in determining when a ref-count bump is needed, and the increased use of struct-based value types in the language leads more and more to situations where a refcount bump isn't needed at all since there's no object reference in the first place.

pizlonator · on April 24, 2022

This “cheating” has been going on in GC’d languages for decades. Java, JavaScript, and probably other language implementations have what they call “escape analysis”.

adgjlsfhk1 · on April 24, 2022

note that GCs are starting to cheat too. Julia (for example) is in the process of implementing compile time escape analysis which will be used to (among other things) remove allocations as long as the compiler can prove you won't notice.

hayley-patton · on April 24, 2022

Java, Go and JavaScript implementations do escape analysis already.

adgjlsfhk1 · on April 24, 2022

cool! (I gave the example of Julia, but because it's the best GC implementation (it's not), but because it's the implementation I know).

cmrdporcupine · on April 24, 2022

I find this whole conversation frustrating or somewhat confusing because it's a weird distinction: Reference counting is garbage collecting. It's not a mark & sweep or generational tracing collector, but it's a collector. And in fact it even traces, though the tracing is explicit (in the form of the declaration or management of references.)

My copy of the Jones/Lins Garbage Collection book I'm looking at on my shelf makes this quite explicit, the chapter on reference counting makes it clear it's one form of garbage collection.

When I was messing with this stuff 20 years ago, that was my understanding of it, anyways.

There are advantages and disadvantages to reference counting. I think some of the confusion in terminology and some of the frustrations (or opposite) here comes from the rather awkward way that Objective-C implements reference counting as something you have to be explicitly aware of and careful about, even with ARC. My brief stint doing iOS dev made this clear to me.

But I've used (and written) programming languages that had transparent automatic garbage collection via reference counting, and the programmer need not be aware of it at all, esp if you have cycle detection & collection (or alternatively, no cycles at all). And, there are compelling RC implementation that are capable of collecting cycles just fine (albeit through a limited local trace.)

Anyways, he's right that RC can be better for things like database handles etc. Back in the day I/we used reference counting for persistent MUDs/MOOs, where the reference counts were tracked to disk and back, 'twas nice and slick.

SamReidHughes · on April 24, 2022

You might as well classify allocating without freeing as a form of garbage collection. If the implementation can fail to collect cycles and the language can create them, thus leaking memory, lumping in reference counting with real garbage collection is a bad definition because it smears together a crucial distinction.

I get that some people, or many people, or some book, have defined it that way, but others have used it in the meaning as distinguished from plain refcounting. Many developers have always understood GC and refcounting (in languages with mutation) to be mutually exclusive concepts. That definition is used in practice, and when people say, "I'm going to implement garbage collection for my programming language," what they mean is one which doesn't leak memory.

A reason people use this other definition may be that plain reference counting and "having a garbage collector" are mutually disjoint, even if you define "garbage collection" as including reference-counted implementations. Making garbage collection a distinct concept from the set of acts a garbage collector might do will always be fighting against how language usually works.

Kranar · on April 24, 2022

>You might as well classify allocating without freeing as a form of garbage collection.

Of course, Java even explicitly supports this garbage collection strategy:

https://openjdk.java.net/jeps/318

The best definition I've read comes from Raymond Chen, which is that garbage collection is a system to simulate an infinite amount of memory:

https://devblogs.microsoft.com/oldnewthing/20100809-00/?p=13...

This definition is very elegant in that it allows one to reason about program correctness across all garbage collectors without having to know any of the implementation details, be it reference counting or the many flavors of mark and sweep. Basically garbage collection is not an implementation detail, nor is it a specific algorithm or even class of algorithms, garbage collection is a goal, namely the goal of achieving an infinite amount of memory. Different algorithms will reach that goal to different degrees and with different trade-offs.

kibwen · on April 24, 2022

They get lumped together because when some people say "garbage collection" they mean "whatever Java specifically does", and when other people say "garbage collection" they mean "automatic dynamic lifetime determination" (as a contrast to C-style "manual static lifetime determination", or Rust-style "automatic static lifetime determination").

cmrdporcupine · on April 24, 2022

RC need not leak memory. If it's done right, it doesn't.

That Objective-C exposed the innards of the system badly enough that it was possible to leak, that's not intrinsic to RC. It's just yet another piece of the awkward legacy puzzle that is Objective-C.

As for cycles.. Cycle detection in RC has been a solved problem for over two decades. I implemented it once, based on a paper from some IBM researchers, if I recall.

mojuba · on April 24, 2022

Cycles in RC is a fundamentally unsolvable problem because there are situations when you in fact want an unreferenced cycle exist in memory at least for a while.

Imagine a slow async operation such as a network request, where a GUI object is locked in a cycle with a self reference in a callback. Before the network response arrives the GUI object may be removed from the view hierarchy (e.g. the user has navigated away), however, whether you want the cycle to be broken or not depends on your needs. You may want to wait for the results, update stuff and break the cycle manually, or you may choose to completely discard the results by declaring weak self.

Therefore, solving the problem of cycles with RC I think means giving more manual control. Explicit weak pointers solve part of the problem, but the "I know what I'm doing" scenario requires great care and is obviously error prone. To my knowledge this hasn't been solved in any more or less safe manner.

SamReidHughes · on April 24, 2022

I think in the scenario you describe, you always want to hang onto a reference to the unfinished network activity because you will need to be able to interrupt it and clean it up. You never(?) want to leak active running code in the background, especially not code with side effects, which isn't described by some kind of handle somewhere, in part because there's probably a bug and in part because people can't understand how your code works if there isn't a handle to the task stored in a field somewhere, even if you don't use it.

mojuba · on April 24, 2022

Of course there's a reference to the network resource and that's the reference that creates a cycle via its callback defined in a bigger object (e.g. GIU view) that awaits the results. There may be reasons why you want to keep the cross-referencing pair of objects alive until the request completes. Not saying there aren't alternative approaches, but it may be easier to do that way.

I came across this while coding in Java, when I realized a network request may be cancelled prematurely just because the GC thinks you don't need the network resource anymore, whereas in fact you do. There's no easy solution to this other than reorganizing your whole request cycle and object ownership. Classical abstraction leak.

SamReidHughes · on April 24, 2022

I mean, the GUI object with its reference to the network request should never be detached like that. The network request handle should be moved to something else that takes care of it, if the GUI object can't, and it should not be calling a callback tied in a strong way to the GUI object's existence. Granted, I wasn't there when you had this, and I'm not saying there's an easy solution. What I am basically saying is there should be a chain of ownership to some local variable on the stack somewhere, instead of a detached GUI view. I'd view the resource leak you describe as hacky, and a class of design decision that has caused me a lot of pain in gui software. Well, maybe I had to be there.

mojuba · on April 24, 2022

Yes, you can move your references around to find a suitable place, but how does that solve the problem of lifetime control? How do you guarantee a network resource is alive for as long as you need it under GC, and under ARC?

SamReidHughes · on April 24, 2022

By holding a reference to it that is grounded to a GC root i.e. an an ownership chain to a local variable that will dispose of it. Also which can be used to interrupt the background task. So the network connection, callback, etc., are not in a detached cycle.

With the design you're describing -- without the ability to interrupt the background task -- I would bet you tend to get race conditions of one kind or another.

mirekrusin · on April 24, 2022

How come python works then? It uses rc with cycle detector, doesn't it?

pjmlp · on April 24, 2022

Yes it does.

tomp · on April 24, 2022

Can you link to some resources/papers detailing how to solve this problem? I’m only aware of “RC+GC” solution (i.e. adding a full tracing GC in addition to RC to find and collect cycles).

hayley-patton · on April 24, 2022

David F. Bacon and V. T. Rajan, Concurrent Cycle Collection in Reference Counted Systems

https://www.cs.purdue.edu/homes/hosking/690M/Bacon01Concurre...

cmrdporcupine · on April 24, 2022

That's the one I was thinking of, and I once implemented that one myself in C++ back in, uh, 2001, 2002, during a bout of unemployment in the .com crash.

I guess I just assumed that there's been some advancement in the world since then.

prionassembly · on April 24, 2022

So many words, so little clarity.

solarengineer · on April 24, 2022

Are you referring to the response by SamReidHughes? I understood it perfectly, and indeed, that was my own understanding as well.

Could you elaborate on what you feel should be explained with more clarity?

josephg · on April 24, 2022

> Reference counting is garbage collecting.

I hear you, but this is just a terminology thing.

I was taught that the "garbage collector" term only refers to tracing garbage collectors. It sounds weird and wrong for me to hear the term GC referring to to things like C/Zig's manual memory management (malloc and free) or Rust's borrowck. And it sounds like its the same for you, in reverse.

In my school, the whole class of systems are "memory management systems" and "Garbage Collector" is one such class of approach. (What you would call tracing GCs).

But I don't think either of us is canonically right. Its just a "pop" vs "soda" thing. You and Jones/Lins say "pop". Me and Chris Lattner say "soda". Its no big deal.

hayley-patton · on April 24, 2022

Some people refer to only tracing algorithms (mark-sweep, copying, etc) as "garbage collection". The best reason I can think of is marketing.

There are generational reference counting systems e.g. ulterior reference counting [1] and age-oriented collection [2]. "Tracing" usually refers to scanning a heap to work out liveness all at once, which reference counting doesn't do; though deferred reference counting scans the roots to collect, and ulterior reference counting traces a nursery. Deferring these (more common) updates to reference counts improves throughput considerably.

[1] https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall20... [2] https://users.cecs.anu.edu.au/~steveb/pubs/papers/aogc-cc-20...

zarzavat · on April 24, 2022

In the modern sense of the word, a GC is an algorithm that manages memory automatically and without cycles. A GC must maintain the property that a leak is only possible if an object is reachable from a root. RC on its own does not obey this property.

Reference counting is not GC in the modern sense unless it also includes a cycle collector. So Swift is not GC’d (RC only) yet Python is GC’d (RC+CC).

Kranar · on April 24, 2022

> A GC must maintain the property that a leak is only possible if an object is reachable from a root. RC on its own does not obey this property.

This is certainly a false claim then. What you're describing is a precise garbage collector, but there also exist conservative garbage collectors which will fail to reclaim objects even when no cycle exists so long as any integer interpretation of currently allocated memory shares the same integer interpretation of an object's memory address.

GNU Java, and the Mono implementation of .NET both use the Boehm conservative garbage collector and they are most certainly still considered garbage collected systems:

https://en.wikipedia.org/wiki/Boehm_garbage_collector

zarzavat · on April 24, 2022

This isn’t formal science. We are practitioners and so definitions have to be practical. If Boehm GCs really leak as much as you claim then nobody would use them.

The issue with cycles is that it can easily lead to significant memory leaks if a large subgraph gets caught in a cycle. A definition of GC that excludes this capability does not have the contract of “You only have to worry about reachability from roots”.

Kranar · on April 24, 2022

>This isn’t formal science.

Computer science is absolutely a formal science in every sense of the word.

>If Boehm GCs really leak as much as you claim then nobody would use them.

I mean if anything this argument applies more to your characterization that reference counting easily leads to memory exhaustion due to cycles than it does to conservative garbage collectors. If RC based garbage collectors leaked as much as your post implies then nobody would use them. And yet Swift, a language that is almost exclusively used with large graphs(that's how most UIs are represented in memory) and that exhibits all kinds of non-deterministic shared ownership uses RC based garbage collection. Objects involved in UI code are shared in wildly unpredictable ways with no clear hierarchy among layouts, event handlers, observables, IO systems, etc...

>A definition of GC that excludes this capability does not have the contract of “You only have to worry about reachability from roots”.

Exactly, because that's not part of the contract for a GC. That's part of the contract for a tracing garbage collector:

https://en.wikipedia.org/wiki/Tracing_garbage_collection

When your definitions don't fit reality then it's time to adjust your definitions, not reality.

knome · on April 24, 2022

>RC on its own does not obey this property.

It is sufficient in a functional language that lacks haskell's lazy capacity to "tie the knot". Erlang's data structures, for example, are fully immutable, and it is not possible to create a cyclic data structure. Since there cannot be a cyclic data structure, reference counting would be sufficient to manage its memory automatically.

It would still suck since it has to constantly write to all objects being referenced, but it would work.

cmrdporcupine · on April 24, 2022

There are RC systems where the counts are held external to the objects, which can be a bit more cache friendly.

Kranar · on April 24, 2022

You want the opposite behavior in order to be cache friendly, an RC system where the counts are stored next to the object.

cmrdporcupine · on April 24, 2022

Not necessarily. The issue is that releasing a reference at the top of a tree of objects has the potential to walk a large graph of objects (doing downcounts at each) that are not in cache, causing some thrash.

In an ideal world and in a program running with a not-huge # of objects you could keep the entire basket of reference counts in L1 cache, separate from the objects. Then decrements would be potentially cheap. A bit utopian though.

melissalobos · on April 24, 2022

> Reference counting is garbage collecting

I am drawing a blank on google scholar at the moment, but I recall a very popular paper on exactly this topic of RC as the other side of Tracing GC.

hayley-patton · on April 24, 2022

"A unified theory of garbage collection"?

https://dl.acm.org/doi/10.1145/1028976.1028982

cmrdporcupine · on April 24, 2022

Nice, will put this paper in my bookmarks for later read.

I feel like once you stop and think about it, the distinction between a 'tracing' GC and a 'non-tracing' GC is specious: To do its work at acquire and release, RC collector must in fact trace a tree of objects. It's just that the semantics of how this is declared might differ, or, more specifically the time(s) in which it does this work.

hayley-patton · on April 25, 2022

I think a fair part of the unified theory is that a tracing GC traces live objects, whereas a refcounting GC traces dead objects.

cmrdporcupine · on April 25, 2022

Upcounts trace live objects though. Just more incrementally.

hayley-patton · on April 25, 2022

By "trace" I mean the act of consulting the state of an object, then recursively tracing pointers in the object. A tracing GC propagates a "known to be live" state through an object graph while doing a collection cycle, and a refcounting GC propagates a "known to be dead" state through an object graph, when RC drops to zero.

hmlongco · on April 26, 2022

Umm... but isn't the refcounting GC more deterministic? If I create a chain of 10 object and then release the head, I run through 10 allocations and then 10 release cycles on those 10 objects. Period.

melissalobos · on April 24, 2022

Yes! Thank you, I couldn't think of the terms to search for to find this.

pizlonator · on April 24, 2022

ARC makes ref counting a part of semantics in the sense that the deref that reaches zero immediately runs the destructor, and destructors can have any effects they like.

Therefore, it’s not really valid to say that ARC “is” garbage collection.

Reference counting is only a form of garbage collection if it is totally transparent to the programmer, and it’s not like that in ARC.

pjmlp · on April 24, 2022

Which is why when Apple introduced new ARC optimizations last year, they had a talk about possible gotchas that might crash the application due to them.

mirekrusin · on April 25, 2022

It would help us all if "gc" term would be used for solutions that require dedicated thread to operate.

munificent · on April 24, 2022

I think ref-counting is interesting, but I look at it as sort of like Level 3 self-driving cars where you almost don't have to pay attention but sometimes you really do. That's almost worse than Level 0 because humans are very bad at partial attention.

With GC, assuming your program can accept the slight level of non-determinism and latency, you don't think about deallocation at all. There are just objects and references, and that is the extent of your mental model. That simplicitly frees up a lot of mental energy to focus on your problem domain.

With ARC, you get determininstic deallocation and (maybe) better latency. Those are nice, and critical for some kinds of programs. And you mostly don't have to think about deallocation. Except that thanks to cycles, you can create actual memory leaks where you have completely inaccessible objects that live in the heap forever. And, in order to avoid this, you have to know when and where to use weak or unowneded references. Once you throw closures into the mix, it becomes very easy to accidentally create a cycle.

So ARC gives you more language complexity (strong, weak, and unowned references, closure capture lists, etc.). And there is a subtle property that you must always maintain in your program that is not easily found through static analysis. If you fail to maintain this property, you program may slowly leak memory in ways that rarely show up in tests but will cause real problems to programs running in the wild.

Most of the programs I write don't need finalizers much and can afford a little latency. I relish being able to use closures freely even inside classes. For that kind of code, I strongly prefer tracing GC so that I don't have to worry about cycles at all.

pjmlp · on April 24, 2022

And yet languages with tracing GC outperform Swift.

https://github.com/ixy-languages/ixy-languages

Objective-C GC failed to work reliably due to everything C allows to do, and people mixing frameworks compiler without being enabled.

Automating Cocoa's retain/release calls was more reliable, and makes much more sense.

Swift naturally needed to follow the same path to easily interoperate with the Objective-C runtime.

A tracing GC would mean having something like .NET COM Callable Wrapper and Runtime Callable Wrappers, for interoperability with COM, which is a much greater engineering effort.

So naturally reference counting as GC algorithm for Swift makes sense.

Everything else is just cargo cult.

astrange · on April 24, 2022

This benchmark is not good enough to make a decision about how you should build an entire OS.

They did make this decision when designing ARC and Swift, and they did have data for it, and first-order effects of how fast you can make a single task can't be the determining factor there. In particular you need to know how disruptive running the task will be to other more important stuff.

pjmlp · on April 24, 2022

You can benchmark it against Android or Windows Phone devices then.

Naturally with similar hardware across the board.

astrange · on April 24, 2022

You can do that by looking up which one ships with less RAM installed. I believe the common wisdom is still that it's not Android.

kaba0 · on April 25, 2022

How is the amount of RAM relevant here? Different GC algorithms have different tradeoffs, tracing GCs absolutely trash ARC in throughput, but they usually use more memory for that.

astrange · on April 25, 2022

Peak memory was by far the most important design consideration for Swift on iOS, and amount of RAM shows how successful that was.

pjmlp · on April 24, 2022

To save you the trouble, any Windows Phone 7 device, using XNA or Silverlight as OS APIs, 512 MB.

GeekyBear · on April 24, 2022

The nice thing about designing your own hardware stack:

>fun fact: retaining and releasing an NSObject takes ~30 nanoseconds on current gen Intel, and ~6.5 nanoseconds on an M1

https://twitter.com/Catfish_Man/status/1326238434235568128

astrange · on April 24, 2022

This has nothing to do with custom hardware, it's one of the few differences that's actually due to ARM vs x86. Though, part of it is from being a unified memory SoC.

Tagbert · on April 24, 2022

I heard in an interview with Craig Federighi, that in designing the Apple Silicon chips they had looked at specific operations that were frequently used by their OS and software and optimized those on their chips. Reference counting was one of those operations and that was why the speed of that operation was so much faster on Apple Silicon. Sure those other things about unified memory contribute speed to all kinds of operations, but they really focused on some of them to let their hardware and software work well together.

astrange · on April 25, 2022

Choosing to use ARM and unified memory is part of designing the SoC, though. There is no magic in the refcount machinery; you can disassemble it and look at it.

pjmlp · on April 24, 2022

Now try Swift outside that hardware stack.

aeldidi · on April 24, 2022

I agree with the sentiment, but does Swift actually see any usage outside of Apple’s own hardware? I think Swift’s reliance on such an optimization is justified since the hardware this optimization is applied to is the main target.

While Swift’s approach may be slower in theory, is it actually slower in practice? Just a thought.

pjmlp · on April 24, 2022

Hardly, but having to design hardware to fix ARC performance, proves the point how much "better" it is.

Depends how much one cares about performance beyond iOS GUI apps, specially when Swift's original goal was to become the main language across the whole Apple stack.

https://github.com/ixy-languages/ixy-languages

GeekyBear · on April 24, 2022

Swift has been used to write parts of MacOS that are used on Intel today.

The good faith efforts to write parts of a consumer OS in a garbage collected language that I can remember are Microsoft's effort to write Longhorn in C# and Google's effort to write part of Fuchsia in Go.

Both efforts failed for performance reasons.

pjmlp · on April 24, 2022

Wrong, Longhorn failed because WinDev sabotaged the project.

https://news.ycombinator.com/item?id=30486727

Likewise, Midori was powering Asian Bing cluster for a while, and even that wasn't good enough for WinDev.

And in Fuchsia, same thing happened, the parts in Go were only taken away when the Go advocates that driven those modules were no longer around to argue for their implementations.

Politics are more relevant than technical limitations.

Finally, Azure Cloud OS and Windows together have surely more lines of C# powering them than Apple has been doing with Swift.

And then there is Android.

GeekyBear · on April 24, 2022

The leaked builds of Longhorn took forever just to boot on the hardware of the era. That's a performance issue.

However, the internal politics of Microsoft certainly didn't help either.

The stated reasons for removing the Go portions of Fuchsia and rewriting them in another language were also firmly based in performance.

pjmlp · on April 24, 2022

Of course debug builds aren't the prime example of performance, that is to be expected.

Performance problems get sorted out when people actually care to solve them instead of driving their own agenda.

GeekyBear · on April 24, 2022

Sorry, but debug builds weren't the only builds that leaked. The performance was just never there.

pjmlp · on April 24, 2022

Naturally you would answer that, from an OS that never had a certified GA build, great proof indeed.

Yet the same group that sabotaged Longhorn, was quite keen in doing exactly the same stuff using a mix of .NET Native and C++/CX, which only failed due to the lack of migration path from Win32, and not much love for the sandboxing.

GeekyBear · on April 24, 2022

Microsoft handed out builds at PDC. It's difficult to take claims that it didn't have massive performance issues seriously.

pjmlp · on April 25, 2022

No one is claiming that performance problems did not exist on an alpha version of Longhorn.

Performance problems get sorted out, instead we got management chaos,

https://hackernoon.com/what-really-happened-with-vista-4ca7f...

GeekyBear · on April 25, 2022

Jim Allchin, who ran Windows development at the time, didn't send an email to Gates and Ballmer warning about management issues.

He told them that Longhorn was a pig and he didn't see any solution to that issue.

Just before the code base was abandoned.

pjmlp · on April 25, 2022

Insider info here, proof would be nice.

In any case, the top level manager for Windows development asserts his failure to make the team make it happen, looks like management issues to me.

The coach failed the team.

GeekyBear · on April 25, 2022

>proof would be nice.

It's one of the internal emails made public as part of the Microsoft antitrust trial.

>LH is a pig and I don’t see any solution to this problem. If we are to rise to the challenge of Linux and Apple, we need to start taking the lessons of “scenario, simple, fast” to heart.

https://web.archive.org/web/20210427171552/http://blog.seatt...

Sorry, but the person who was in charge of Windows development at the time did not agree with you. Longhorn failed due to performance issues that Jim Allchin believed were unsolvable.

pjmlp · on April 25, 2022

The manager failed at his job to steer the team into solving those issues.

The team's failure has a direct relation with the management failing to act upon the issues until it was too late to actually try to sort them out.

So yeah, management failure in all its glory.

The anti-trust trial was decided in 2001, Longhorn was developed from middle 2001 up to 2006, nice try.

GeekyBear · on April 25, 2022

The lead developer said the performance problem could not be solved and the code was abandoned.

I'm not sure how you can spin your way around that and be taken seriously.

vips7L · on April 24, 2022

Go has a bad GC. Go relies on the compiler to do escape analysis and allocate on the stack. Once you actually start heap allocating you hit massive pauses. It’s why Go does so bad on the binary tree benchmark.

jatone · on April 24, 2022

> Once you actually start heap allocating you hit massive pauses.

citation needed. go afaik has millisecond pauses at most. maybe you're thinking of throughput losses due to forcing more CPU to be spent on collecting?

losvedir · on April 24, 2022

I always took GC for granted until I started coding in Zig more and had to worry about managing my own memory.

One thing that Zig made pretty easy that's been sort of my go-to is using Arena allocations a lot more. It's pretty frequent that for a chunk of code I know I'll be allocating some chunk of memory and working with it, but then once that's finished I won't need it anymore. So just throwing everything into the arena and then freeing the arena afterwards works great, is fast, and is simple.

Are there any garbage collectors that enable this sort of thing? I think it might be vaguely like "generations", but I think even those are inferred and kept track of in greater detail to know when to move between the generations. I'm thinking of somehow earmarking a scope to use its own bit of memory and to not worry about collecting within the scope, and then when you leave the scope being able to free it all.

I'm sure there's more complexities that I'm missing. It's just that dealing with memory manually really makes you realize that the programmer probably easily knows expected lifetimes better than can be inferred. The only GC's I've used have been all "don't worry about memory we'll figure out everything". But maybe you could provide hints or something.

verdagon · on April 24, 2022

Some possibly relevant works:

Pony has a separate GC for every actor. I suspect that in practice, these work a lot like arenas; since they're very short lived, I'd imagine that most actors just allocate a lot and then blast it away. [0]

Odin (not GC'd) has an implicit "context" that has an allocator, [1] effectively decoupling the allocator from the code. One could use an arena allocator for any function they'd like.

Vale (not GC'd) is taking this the context system bit further and makes it memory safe. [2]

[0] https://tutorial.ponylang.io/appendices/garbage-collection.h...

[1] https://odin-lang.org/docs/overview/#allocators

[2] https://verdagon.dev/blog/zero-cost-refs-regions

mirekrusin · on April 24, 2022

Sounds more like ownership annotation than generations.

Generations basically takes advantage of the fact that recently allocated memory is much more likely to be deallocated sooner - ie. it's run time/age based.

What you're describing with arena is more like ownership where siblings can't outlive ancestors - ie. deallocate when parent/root (arena container) is deallocated.

jatone · on April 24, 2022

the strategy you're mentioning is also known as slab allocation. all GC implementations allow this strategy naturally.

ordu · on April 24, 2022

Sadly they didn't discussed the main selling point of GCs (in my opinion). GC can move objects in memory and often does it. It can affect cache locality, so less of cache misses. Moreover no more heap fragmentation. There are different strategies to do that, so theoretically one can choose the best one after the code was profiled. Though practically it may not be an option, because the chosen language gives you a fixed unchangeable implementation of a GC.

I bet that these benefits of GC can be great in some tasks, though I know no real world examples of software explicitly relying on it. Some servers on Java using terabytes of RAM, I think? But no one talks about it. My curiosity have nothing to consume, and it is hungry.

hayley-patton · on April 24, 2022

There are moving collectors which use profiling information to try to improve locality. It appears they produce a 15-30% speedup or so.

https://safari.ethz.ch/architecture/fall2017/lib/exe/fetch.p...

https://www.microsoft.com/en-us/research/wp-content/uploads/...

jandrewrogers · on April 24, 2022

Database engine implementations sometimes work similar to this, being locality optimizing machines. Objects in memory can be moved and reorganized dynamically. Ensuring that references to those objects remain valid is not inexpensive because it requires indirection to dynamically resolve object locations and may require patching state referencing the object.

The performance impact is only tolerable for database engines because the context and scope is sufficiently constrained such that clever software design elements can ensure that the cost is applied lazily and expensive effects are avoided except in rare cases. Even in database engines, thrashing your CPU cache is a common side effect of not getting the design right. I can't imagine trying to make this perform well for general purpose software.

verdagon · on April 24, 2022

Take a look at Mesh [0] which brings compaction to a malloc-like system. I could see an RC-based or single-ownership-based language using it to avoid fragmentation.

[0] https://tiba-jrchang.medium.com/mesh-compacting-memory-manag...

armchairhacker · on April 24, 2022

Personally I find ARC really easy to reason about and worth it. Essentially, a strong reference goes to “child” objects, a weak reference goes to “parents” or “siblings”.

You can think of your app’s data structures as a forest. The tree roots are your application and any globals, and the children are their fields, which have more fields, and so on. Sometimes a node stores a reference to something which is not its child/nested field but another node’s: e.g. a view might have a reference to its controller (its parent) another view which is not a subview (its sibling), or even a subview of a subview (descendant). The point is, this other node has a different parent, it already has a strong reference, so inside the child / sibling it’s a weak reference.

Another perspective if your familiar with functional programming, especially if you’ve worked in a language like Haskell that “doesn’t have” references: weak references are conveniences of the object-oriented paradigm which, in the functional paradigm, you would just pass around as function arguments instead. It can be really annoying to update an object without mutation when the object is referenced by other objects - in mutating, object-oriented code, those would be weak references.

fauigerzigerk · on April 24, 2022

> Personally I find ARC really easy to reason about and worth it

How do you reason about a closure that captures some stuff and gets passed to a library function? @escaping doesn’t give you enough information to reason about all lifetimes that could possibly matter.

hmlongco · on April 26, 2022

Uhh... because I can see what was captured? If I'm overly concerned that a given callback might be held somewhere for all time I'll simply capture a weak reference and be done with it.

travisgriggs · on April 24, 2022

I'm basically "whatever" on this. I develop in both Swift and Kotlin. I like the Swift language better. Definitely like the libraries better. But I hate the memory story in Swift. ARC is a half-GC. It's definitely different than malloc/free (which I do a lot of in embedded C land). But the stressing about ownership loops sucks. I like structs and what they enable, but I find myself choosing them over objects sometimes, not because they're the best solution for the problem in front of me, but because ARC sucks. And then after trying to avoid ARC using a struct, I end up biting the bullet and using a class, because sometimes a real object is the better solution. Kotlin's "almost struct" data classes are a better answer IME.

azinman2 · on April 24, 2022

Is it really that hard to deal with memory ownership? What’s situations are you in that it’s so stressful?

nu11ptr · on April 24, 2022

Being a geek who is somewhat into PLT and hobbyist languages, I go back and forth a bit on whether I want a GC or RC in my own hobby lang. For the longest time I was going to go with per-thread heaps that use a simple compacting GC, but only collect on allocation when heap starts running low, so collection wouldn't have to stop threads, the local thread would already be stopped. The plan was to use RC most likely for multithreaded shared objects from a shared heap as a secondary.

Then I started writing Rust and I realized what I think is the same point Lattner has here: Ownership changes everything. If I can declare a single owner most of the time and track that owner, I only have to bump the counter when I add a 2nd (and 3rd, etc.) owner, which isn't nearly as common as single ownership. This gets rid of most of the issues with RC performance (which in naive schemes bumps the counter every time it calls a function and more). RC still can't use bump allocation unfortunately, so any tree building would likely need arenas in addition which is a bummer, but deterministic destruction is a huge bonus, so this is the way I'm leaning atm (if I ever get back to writing it that is!).

gary_0 · on April 24, 2022

In the case of C#, there are very lengthy docs and StackOverflow threads on how to properly use IDisposable -- what you need to use to clean up non-GC-able resources. A great many classes in C# libraries use IDisposable, so outside of pure business logic you end up with a lot of 'using' blocks. I've written a lot of C# code that needs to clean up external resources using Dispose(), and I've had to fix a lot of C# code written by people with a poor understanding of IDisposable and how GC actually works (eg. useless GC.Collect() calls everywhere).

So I think the benefits of deterministic object lifetimes, and it being easy to explain when each object will get cleaned up (all of it, not just the memory), outweighs the marginal benefits of heavyweight Garbage Collection.

12thwonder · on April 24, 2022

this is a really good benefit of RC codebase.

Resource management becomes much more harmonious and easy to follow.

pjmlp · on April 24, 2022

There are Roslyn analysers for that.

moonchild · on April 24, 2022

Chris makes some points which are interesting in their own right, particularly the one comparing the amount of research which has gone into tracing/copying gc vs rc. But I think there is an unmentioned bias which is worth pointing out. Tracing gc has found massive success in enterprise java applications, which:

- Need to scale in terms of code size and contributor count, and therefore need the modularity afforded by tracing gc

- Need to scale in terms of heap size, and are therefore more likely to run into the pathological fragmentation issues that come from non-compacting gc

- Need the higher throughput afforded by tracing gc

- Generally run on massive servers that have the cores, memory size, memory speed, power budget, and heat dissipation required to support high-quality tracing gc

Contrariwise, swift is a product of apple, a laptop and phone company, and is used to make laptop and phone apps which:

- Are unlikely to ever get really huge (people balk at large downloads)

- May have limited memory, and need to share it with other apps

- Run on battery-powered hardware with little or no active cooling, and so need to attempt to conserve power

No silver bullet, as they say. I implement APL, which absolutely cannot do without reference counting, except perhaps in some novel configurations. I think that for the more usual case of a pointer-chasing language, tracing gc is almost certainly the right choice.

Lastly, however, I would be remiss not to bring up bacon et al, 'A Unified Theory of Garbage Collection'[0], which notes that tracing gc and reference counting are not really opposed to one another, and that a sufficiently advanced implementation of either will approach the other. Lattner mentions barriers (though as a sibling mentions, barriers are an order of magnitude cheaper than reference count twiddling); on the other side of the coin, consider deferred reference counting, one-bit reference counts, ...

The bottom line being that memory management is _hard_, in all cases; a sufficiently robust and general solution will end up being rather complicated no matter what it does; and, again, there is no silver bullet.

0. https://courses.cs.washington.edu/courses/cse590p/05au/p50-b...

saagarjha · on April 24, 2022

> Are unlikely to ever get really huge (people balk at large downloads)

Wait until you see how big apps are.

torginus · on April 24, 2022

I think Chris makes an excellent point that is the mortal flaw of GC-s, anyone who argues about overhead or pauses or throughput is getting it wrong - the problem is determinism, and interaction with native objects.

You cannot reason about when the computer can/will get rid of native objects that are referenced by the GC, and doing so raises a billion of hairy questions, many of them discussed at length in the article.

Why is this important?

This is implies the model of SPA frameworks, which rely on this exact thing, fundamentally cannot be made to work well, since they rely on referencing native objects from JS.

This invalidates the entire way we build GUIs/websites nowadays. I'd say that's a pretty darn important issue.

jFriedensreich · on April 24, 2022

Could you elaborate how this is connected to js and SPA frameworks? I totally get how reasoning and determinism is critical for sound processing, embedded low level, operating systems etc. But why would this be relevant for normal application or web development?

torginus · on April 24, 2022

SPA frameworks manipulate the DOM through JS, which means they need to take a reference to DOM elements.

In a static website, once you navigate away from a page, the browser can deallocate all the DOM resources related to the old page, which can be quite heavy (images etc.).

In an SPA-driven website, the DOM is created by JS. Once you 'navigate' away (you tell the framework to render a different set of DOM components), the old ones still linger around in memory until the next GC cycle, meaning the browser has to wait for the GC to run before it can free them.

Even worse, GC's tend to be tuned for GC memory pressure, and if you have a ton of GC objects holding onto native resources, the GC has no way of telling it needs to run, because from its perspective, the GC heap is small and it has no idea about the native objects.

jFriedensreich · on April 24, 2022

Still dont get it. Obviously a dom implementation that is tied to a garbage collected script language has to be tied to the garbage collection in some way to free connected resources, this fact only shows that it is not a trivial task which is obvious. i still do not see how this is an argument why a garbage collected language is a bad idea for that usecase.