Hacker News new | past | comments | ask | show | jobs | submit login

I find this whole conversation frustrating or somewhat confusing because it's a weird distinction: Reference counting is garbage collecting. It's not a mark & sweep or generational tracing collector, but it's a collector. And in fact it even traces, though the tracing is explicit (in the form of the declaration or management of references.)

My copy of the Jones/Lins Garbage Collection book I'm looking at on my shelf makes this quite explicit, the chapter on reference counting makes it clear it's one form of garbage collection.

When I was messing with this stuff 20 years ago, that was my understanding of it, anyways.

There are advantages and disadvantages to reference counting. I think some of the confusion in terminology and some of the frustrations (or opposite) here comes from the rather awkward way that Objective-C implements reference counting as something you have to be explicitly aware of and careful about, even with ARC. My brief stint doing iOS dev made this clear to me.

But I've used (and written) programming languages that had transparent automatic garbage collection via reference counting, and the programmer need not be aware of it at all, esp if you have cycle detection & collection (or alternatively, no cycles at all). And, there are compelling RC implementation that are capable of collecting cycles just fine (albeit through a limited local trace.)

Anyways, he's right that RC can be better for things like database handles etc. Back in the day I/we used reference counting for persistent MUDs/MOOs, where the reference counts were tracked to disk and back, 'twas nice and slick.




You might as well classify allocating without freeing as a form of garbage collection. If the implementation can fail to collect cycles and the language can create them, thus leaking memory, lumping in reference counting with real garbage collection is a bad definition because it smears together a crucial distinction.

I get that some people, or many people, or some book, have defined it that way, but others have used it in the meaning as distinguished from plain refcounting. Many developers have always understood GC and refcounting (in languages with mutation) to be mutually exclusive concepts. That definition is used in practice, and when people say, "I'm going to implement garbage collection for my programming language," what they mean is one which doesn't leak memory.

A reason people use this other definition may be that plain reference counting and "having a garbage collector" are mutually disjoint, even if you define "garbage collection" as including reference-counted implementations. Making garbage collection a distinct concept from the set of acts a garbage collector might do will always be fighting against how language usually works.


>You might as well classify allocating without freeing as a form of garbage collection.

Of course, Java even explicitly supports this garbage collection strategy:

https://openjdk.java.net/jeps/318

The best definition I've read comes from Raymond Chen, which is that garbage collection is a system to simulate an infinite amount of memory:

https://devblogs.microsoft.com/oldnewthing/20100809-00/?p=13...

This definition is very elegant in that it allows one to reason about program correctness across all garbage collectors without having to know any of the implementation details, be it reference counting or the many flavors of mark and sweep. Basically garbage collection is not an implementation detail, nor is it a specific algorithm or even class of algorithms, garbage collection is a goal, namely the goal of achieving an infinite amount of memory. Different algorithms will reach that goal to different degrees and with different trade-offs.


They get lumped together because when some people say "garbage collection" they mean "whatever Java specifically does", and when other people say "garbage collection" they mean "automatic dynamic lifetime determination" (as a contrast to C-style "manual static lifetime determination", or Rust-style "automatic static lifetime determination").


RC need not leak memory. If it's done right, it doesn't.

That Objective-C exposed the innards of the system badly enough that it was possible to leak, that's not intrinsic to RC. It's just yet another piece of the awkward legacy puzzle that is Objective-C.

As for cycles.. Cycle detection in RC has been a solved problem for over two decades. I implemented it once, based on a paper from some IBM researchers, if I recall.


Cycles in RC is a fundamentally unsolvable problem because there are situations when you in fact want an unreferenced cycle exist in memory at least for a while.

Imagine a slow async operation such as a network request, where a GUI object is locked in a cycle with a self reference in a callback. Before the network response arrives the GUI object may be removed from the view hierarchy (e.g. the user has navigated away), however, whether you want the cycle to be broken or not depends on your needs. You may want to wait for the results, update stuff and break the cycle manually, or you may choose to completely discard the results by declaring weak self.

Therefore, solving the problem of cycles with RC I think means giving more manual control. Explicit weak pointers solve part of the problem, but the "I know what I'm doing" scenario requires great care and is obviously error prone. To my knowledge this hasn't been solved in any more or less safe manner.


I think in the scenario you describe, you always want to hang onto a reference to the unfinished network activity because you will need to be able to interrupt it and clean it up. You never(?) want to leak active running code in the background, especially not code with side effects, which isn't described by some kind of handle somewhere, in part because there's probably a bug and in part because people can't understand how your code works if there isn't a handle to the task stored in a field somewhere, even if you don't use it.


Of course there's a reference to the network resource and that's the reference that creates a cycle via its callback defined in a bigger object (e.g. GIU view) that awaits the results. There may be reasons why you want to keep the cross-referencing pair of objects alive until the request completes. Not saying there aren't alternative approaches, but it may be easier to do that way.

I came across this while coding in Java, when I realized a network request may be cancelled prematurely just because the GC thinks you don't need the network resource anymore, whereas in fact you do. There's no easy solution to this other than reorganizing your whole request cycle and object ownership. Classical abstraction leak.


I mean, the GUI object with its reference to the network request should never be detached like that. The network request handle should be moved to something else that takes care of it, if the GUI object can't, and it should not be calling a callback tied in a strong way to the GUI object's existence. Granted, I wasn't there when you had this, and I'm not saying there's an easy solution. What I am basically saying is there should be a chain of ownership to some local variable on the stack somewhere, instead of a detached GUI view. I'd view the resource leak you describe as hacky, and a class of design decision that has caused me a lot of pain in gui software. Well, maybe I had to be there.


Yes, you can move your references around to find a suitable place, but how does that solve the problem of lifetime control? How do you guarantee a network resource is alive for as long as you need it under GC, and under ARC?


By holding a reference to it that is grounded to a GC root i.e. an an ownership chain to a local variable that will dispose of it. Also which can be used to interrupt the background task. So the network connection, callback, etc., are not in a detached cycle.

With the design you're describing -- without the ability to interrupt the background task -- I would bet you tend to get race conditions of one kind or another.


How come python works then? It uses rc with cycle detector, doesn't it?


Yes it does.


Can you link to some resources/papers detailing how to solve this problem? I’m only aware of “RC+GC” solution (i.e. adding a full tracing GC in addition to RC to find and collect cycles).


David F. Bacon and V. T. Rajan, Concurrent Cycle Collection in Reference Counted Systems

https://www.cs.purdue.edu/homes/hosking/690M/Bacon01Concurre...


That's the one I was thinking of, and I once implemented that one myself in C++ back in, uh, 2001, 2002, during a bout of unemployment in the .com crash.

I guess I just assumed that there's been some advancement in the world since then.


So many words, so little clarity.


Are you referring to the response by SamReidHughes? I understood it perfectly, and indeed, that was my own understanding as well.

Could you elaborate on what you feel should be explained with more clarity?


> Reference counting is garbage collecting.

I hear you, but this is just a terminology thing.

I was taught that the "garbage collector" term only refers to tracing garbage collectors. It sounds weird and wrong for me to hear the term GC referring to to things like C/Zig's manual memory management (malloc and free) or Rust's borrowck. And it sounds like its the same for you, in reverse.

In my school, the whole class of systems are "memory management systems" and "Garbage Collector" is one such class of approach. (What you would call tracing GCs).

But I don't think either of us is canonically right. Its just a "pop" vs "soda" thing. You and Jones/Lins say "pop". Me and Chris Lattner say "soda". Its no big deal.


Some people refer to only tracing algorithms (mark-sweep, copying, etc) as "garbage collection". The best reason I can think of is marketing.

There are generational reference counting systems e.g. ulterior reference counting [1] and age-oriented collection [2]. "Tracing" usually refers to scanning a heap to work out liveness all at once, which reference counting doesn't do; though deferred reference counting scans the roots to collect, and ulterior reference counting traces a nursery. Deferring these (more common) updates to reference counts improves throughput considerably.

[1] https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall20... [2] https://users.cecs.anu.edu.au/~steveb/pubs/papers/aogc-cc-20...


In the modern sense of the word, a GC is an algorithm that manages memory automatically and without cycles. A GC must maintain the property that a leak is only possible if an object is reachable from a root. RC on its own does not obey this property.

Reference counting is not GC in the modern sense unless it also includes a cycle collector. So Swift is not GC’d (RC only) yet Python is GC’d (RC+CC).


> A GC must maintain the property that a leak is only possible if an object is reachable from a root. RC on its own does not obey this property.

This is certainly a false claim then. What you're describing is a precise garbage collector, but there also exist conservative garbage collectors which will fail to reclaim objects even when no cycle exists so long as any integer interpretation of currently allocated memory shares the same integer interpretation of an object's memory address.

GNU Java, and the Mono implementation of .NET both use the Boehm conservative garbage collector and they are most certainly still considered garbage collected systems:

https://en.wikipedia.org/wiki/Boehm_garbage_collector


This isn’t formal science. We are practitioners and so definitions have to be practical. If Boehm GCs really leak as much as you claim then nobody would use them.

The issue with cycles is that it can easily lead to significant memory leaks if a large subgraph gets caught in a cycle. A definition of GC that excludes this capability does not have the contract of “You only have to worry about reachability from roots”.


>This isn’t formal science.

Computer science is absolutely a formal science in every sense of the word.

>If Boehm GCs really leak as much as you claim then nobody would use them.

I mean if anything this argument applies more to your characterization that reference counting easily leads to memory exhaustion due to cycles than it does to conservative garbage collectors. If RC based garbage collectors leaked as much as your post implies then nobody would use them. And yet Swift, a language that is almost exclusively used with large graphs(that's how most UIs are represented in memory) and that exhibits all kinds of non-deterministic shared ownership uses RC based garbage collection. Objects involved in UI code are shared in wildly unpredictable ways with no clear hierarchy among layouts, event handlers, observables, IO systems, etc...

>A definition of GC that excludes this capability does not have the contract of “You only have to worry about reachability from roots”.

Exactly, because that's not part of the contract for a GC. That's part of the contract for a tracing garbage collector:

https://en.wikipedia.org/wiki/Tracing_garbage_collection

When your definitions don't fit reality then it's time to adjust your definitions, not reality.


>RC on its own does not obey this property.

It is sufficient in a functional language that lacks haskell's lazy capacity to "tie the knot". Erlang's data structures, for example, are fully immutable, and it is not possible to create a cyclic data structure. Since there cannot be a cyclic data structure, reference counting would be sufficient to manage its memory automatically.

It would still suck since it has to constantly write to all objects being referenced, but it would work.


There are RC systems where the counts are held external to the objects, which can be a bit more cache friendly.


You want the opposite behavior in order to be cache friendly, an RC system where the counts are stored next to the object.


Not necessarily. The issue is that releasing a reference at the top of a tree of objects has the potential to walk a large graph of objects (doing downcounts at each) that are not in cache, causing some thrash.

In an ideal world and in a program running with a not-huge # of objects you could keep the entire basket of reference counts in L1 cache, separate from the objects. Then decrements would be potentially cheap. A bit utopian though.


> Reference counting is garbage collecting

I am drawing a blank on google scholar at the moment, but I recall a very popular paper on exactly this topic of RC as the other side of Tracing GC.


"A unified theory of garbage collection"?

https://dl.acm.org/doi/10.1145/1028976.1028982


Nice, will put this paper in my bookmarks for later read.

I feel like once you stop and think about it, the distinction between a 'tracing' GC and a 'non-tracing' GC is specious: To do its work at acquire and release, RC collector must in fact trace a tree of objects. It's just that the semantics of how this is declared might differ, or, more specifically the time(s) in which it does this work.


I think a fair part of the unified theory is that a tracing GC traces live objects, whereas a refcounting GC traces dead objects.


Upcounts trace live objects though. Just more incrementally.


By "trace" I mean the act of consulting the state of an object, then recursively tracing pointers in the object. A tracing GC propagates a "known to be live" state through an object graph while doing a collection cycle, and a refcounting GC propagates a "known to be dead" state through an object graph, when RC drops to zero.


Umm... but isn't the refcounting GC more deterministic? If I create a chain of 10 object and then release the head, I run through 10 allocations and then 10 release cycles on those 10 objects. Period.


Yes! Thank you, I couldn't think of the terms to search for to find this.


ARC makes ref counting a part of semantics in the sense that the deref that reaches zero immediately runs the destructor, and destructors can have any effects they like.

Therefore, it’s not really valid to say that ARC “is” garbage collection.

Reference counting is only a form of garbage collection if it is totally transparent to the programmer, and it’s not like that in ARC.


Which is why when Apple introduced new ARC optimizations last year, they had a talk about possible gotchas that might crash the application due to them.


It would help us all if "gc" term would be used for solutions that require dedicated thread to operate.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: