I develop languages full time, and it's clear to me that RC will make *massive* ...

osaariki · on April 24, 2022

The in-place update work in Koka [1] is super impressive. One of my co-workers, Daan Leijen leads the Koka project and hearing his talks about it have been such a privilege. The work around Koka is really convincing me that functional languages will eventually lead the pack in the effort-to-performance trade-off.

Something that came out of the Koka project that everyone should know about is mimalloc [2]: if your built-in malloc is not doing it for you, this is the alternative allocator you should try first. Mimalloc is tiny and it has consistently great performance in concurrent environments.

[1]: https://koka-lang.github.io/koka/doc/index.html

[2]: https://github.com/microsoft/mimalloc

ashton314 · on April 24, 2022

Ooooh… Koka is a really neat language. I first encountered it in a seminar when we talked about algebraic effect handlers. Koka has these row types which effectively allow you to statically ensure that all your effects get handled at some point, modulo certain ordering constraints on when those effects get handled.

Essentially, you get nice purity around effects but they're way easier to compose (and imo grok) than monads.

If anyone is interested in learning more, definitely take a look at this [1] paper by the aforementioned Leijen. (OP, very cool that you get to work with this guy.) One of the best-written papers I've seen.

[1]: https://www.microsoft.com/en-us/research/publication/algebra...

staticassertion · on April 24, 2022

I'm really excited for koka. I still need to dig in and actually try writing some code but it looks so promising. At a minimum I hope that it inspires other languages to look at these features.

There's definitely bits I don't love, but they're nothing I couldn't get over.

nsm · on April 25, 2022

I actually attempted some of 2021's Advent of Code in Koka.

It was fun for the first couple of questions, but then got really painful for a couple of reasons. None of this is a complaint on the project, as it is still a research language and has not had a lot of effort invested into getting others to use it.

What I really liked was how easy it was to write parsers due to the effect system. i.e. you'd write a parser that looked like a direct imperative translation of the BNF grammar or the problem statement, and then the effect system would propagate the actual input through the call graph and each parser would only have to deal with its small part.

However several things made me get frustrated and stop which wouldn't have happened with better documentation and more source code to look at.

As someone who has no theory background in effect systems, the syntax was very oblique. i.e. when do I need to mark that my function needs to have a `| e` effect or a `, e` effect. I'm not a fan of the Haskell style of introducing single-letter free variables without any syntactic convention. For example, in Rust, the tick is used to clearly mark that something is a lifetime annotation. The rules about how lifetimes propagate, and so when you need to introduce 2 or more lifetimes are explained clearly (if not in the Rust docs, then somewhere else on the internet). That wasn't the case for Koka's effect syntax. In addition, I would sometimes stumble onto effect annotations that should've worked according to my limited understanding, but didn't. The compiler errors in such cases were really meaningless to me. Other times, I feel like the errors almost indicated that what I was doing was semantically correct, but was not yet implemented in the compiler. Also apparently the `_` has special meaning at the beginning of an effect annotation, but I only found this out from trial and error. Eventually I would just give up on annotating any functions until the compiler complained, but this made it difficult for me to understand my own code (it was like writing Python without type annotations).

The API docs are mostly just type signatures, which means unless it is some simple function, or it is used in the koka implementation sources, you are left scratching your head about how to use the standard library. Things like `ref()` types could also use much more elaboration.

Also, at the end of the day, Advent of Code solutions are generally pretty small scripts, where there isn't a need for effect systems. It was often easier to just use Python and its solid standard library.

Overall, it is an interesting language and I'll definitely keep an eye on it.

My very limited code if anybody is interested: https://github.com/nikhilm/AdventOfCode2021

staticassertion · on April 25, 2022

Yes, one of my issues is the syntax as well.

nu11ptr · on April 24, 2022

This. Said much better than my other posts, but this is exactly what I was trying to say. Naive ARC vs GC... a good GC will kill it, so don't make the ARC naive...cheat and find ways to skip most ref. counter bumps. It became amazingly clear when I started coding Rust: you only have to bump ref count when you get a new owner. Most RC objects die with just a few counter bumps. Most objects don't need RC at all (most objects can be single owner). Now imagine a lang (lobster given as an example above) that does this automatically. Sweet spot.

thesz · on April 24, 2022

"Garbage collection can be faster than stack allocation" is actual title of this paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84...

However non-naive your automatic counting scheme can be, it still has to touch object at least twice. However non-naive your automatic counting scheme can be, it still has twice as much overhead per object compared to GC - structure and count vs just structure. Sometimes you can factor structure out - by saying "the allocator page contains pointer to structure at the very start", - but this can be done for GC too. I guess in that case GC wins even more. And you need structure (virtual methods table, actual object structure, etc) because release of object should decrease counts of referenced objects.

Finally, reference counting can and will fragment heap.

nu11ptr · on April 24, 2022

To be clear, I don't hate tracing GC, and in fact find all forms of GC fascinating. It is definitely "trade offs". I just happen to program in a space where I like the trade offs of ARC better than those of a tracing GC. This could change if I start doing different forms of coding, and in higher level languages I often do like a tracing GC, but Rust (and these new upcoming languages) are making me rethink this a bit atm.

> However non-naive your automatic counting scheme can be, it still has to touch object at least twice.

Only if the object is actually ARC. The whole goal is omit that entirely. Most objects live and die on the stack via proving single ownership. To be fair, I'm not sure why the same optimization couldn't be done for a tracing GC (many will say GC allocation and frees are already mostly 'free' because you compact live not dead objects, but by putting everything on the GC path, you force the need for a collection sooner than if it wasn't heap allocated at all).

> Finally, reference counting can and will fragment heap.

This is true, but in return we get deterministic destruction and easy FFI. If we can avoid a solid # of our allocations entirely this becomes less of an issue. When building things like trees which hit the allocator hard we use arenas, which are more like a traditional "tracing GC" minus the tracing (bump allocator + deterministic free when done, all at once).

zozbot234 · on April 24, 2022

> I just happen to program in a space where I like the trade offs of ARC better than those of a tracing GC.

Even ARC has to be augmented with some tracing in order to deterministically collect cycles. Ideally, of course, we would want to describe things like "non-cyclic ownership graph" as part of the type system, so that the compiler itself can decide where the addition of tracing infrastructure is necessary. (Delayed deallocation, as often seen in tracing, concurrent etc. GC, might be expressed as moving the ownership of finalized objects to some sort of cleanup task. But if you code carefully, the need for this ought to be very rare.)

kaba0 · on April 25, 2022

> To be fair, I'm not sure why the same optimization couldn't be done for a tracing GC

I may misunderstand you here, but isn’t it just escape analysis done by basically every big GC language out there?

verdagon · on April 24, 2022

RC might not fragment the heap for long! Someone built a defragmenting malloc called Mesh [0] which uses virtual page remapping to effectively merge two non-overlapping pages. Pretty brilliant stuff!

[0] https://tiba-jrchang.medium.com/mesh-compacting-memory-manag...

astrange · on April 24, 2022

Heap fragmentation is a real issue, but "GC is fast if you throw physical memory at it" is actually a very bad argument for it on personal computers, especially if they have to fit on your wrist, and especially if you've invented technologies that make scanning memory more expensive like swap since 1986.

ObjC/Swift have some tricks to avoid allocating refcounts, mainly relying on storing them in unused pointer bits.

thesz · on April 24, 2022

On the other hand I would like to present to you an LMAX Disruptor: https://lmax-exchange.github.io/disruptor/

Software based on this intra-program message bus usually follows "throw memory at it". That software is written in Java (hence GC) and process millions of transactions per second. The memory amount is so high that GC is not triggered at all during operations hours. Nevertheless, this is not achevable with refcounting because, well, refcounting has to keep counts.

Refcounts, having non-local access patterns, do not play well with caches, let alone something that have much higher latency like swap.

Garbage collection can be used to improve locality: https://safari.ethz.ch/architecture/fall2017/lib/exe/fetch.p...

Refcounts have troubles with cyclic structures.

I guess these tricks of refcount allocation avoidance will work fine unless there's a scale-invariant graph with several tens of thousands nodes. In scale-invariant graphs there is some number of nodes that have links from almost all other nodes.

Actually, I've got exactly that experience: typical weighted finite state transducer's HCLG graph for automatic speech recognition is scale-invariant. Analysis over it (in C++ using OpenFST) usually was done reasonably quickly, but there was a substantial delay between final printf and actual program exit, when all these pescy refcounted smart-pointed structures finally got freed.

I write all that for other readers to have more unbiased view of the problem.

As for me, I greatly prefer arenas - much like "automatic regions with reference counting" mentioned above. Arenas allow for cyclic structures, easy to implement even in C and allow for garbage collection algorithms.

astrange · on April 25, 2022

> I guess these tricks of refcount allocation avoidance will work fine unless there's a scale-invariant graph with several tens of thousands nodes. In scale-invariant graphs there is some number of nodes that have links from almost all other nodes.

This wouldn't come up in ARC, though mostly because it doesn't detect cycles so you wouldn't use strong references between graph nodes. (Overly large refcounts can still happen other ways but usually get optimized out.)

As a quick solution, I think you'd keep a list of valid nodes and have it be the only strong references to them. This is like an arena, except it doesn't have the possibility of references escaping the arena and being deallocated.

KMag · on April 26, 2022

Reference counts don't need to be a full size_t. Particularly when you're using a mark-sweep or mark-sweep-compact collector to collect reference cycles, you can use a saturating counter, and just give up on reference counting that object if the reference count ever hits its maximum value.

There are tons of applications where lots of objects only ever have one reference, so in addition to the single mark bit in your object header, you can have a single bit reference count that keeps track of if there are "one" or "more than one" references to the object. If a reference passes out of scope and it refers to an object with the reference count bit unset, then it can be immediately collected. Otherwise, you need to leave the object for the tracing collector to handle. This is useful in cases where you have a function that allocates a buffer, passes it to a function that populates the buffer (without storing a reference to the buffer into any objects), passes that buffer to a function that reads out the data (again, without storing the reference into any objects), and then returns.

In the common case of objects where a C++ unique_ptr would do, this single bit reference count never gets modified. There is more overhead in checking the bit when the reference goes out of scope, but there is never a decrement of the reference count, and the common case doesn't have an increment.

mananaysiempre · on April 24, 2022

> Finally, reference counting can and will fragment heap.

Is there anything that won’t fragment heap and simultaneously doesn’t insist on owning the world? It is really, really painful to allow references to cross the GC heap boundary in both directions (witness the GNOME Shell not-a-memory-leak), and to me that is the principal reason: in a low-level language, you can RC only a bit of stuff, but you cannot really moving-GC only a bit of stuff.

(I don’t expect an affirmative answer to this question as stated, but maybe some more basic abstraction for manual memory management would allow this?..)

nu11ptr · on April 24, 2022

Two more interesting ways to speed it up:

- Make threads a "special" tracked API call (for example, like Rust) so that you know when something will or won't be passed between threads. If it never "escapes" the thread, you can use non-atomic ref counts for the lifetime of that object (NOTE: I should add this is something I've been thinking on - I don't know how hard/feasible it is to do in practice)

- Use biased RC - don't have link to paper handy, but idea is that most multithreaded objects spend most of their life on one thread and when they are on that thread, can use non-atomic ref counts by using a separate counter. When on any other thread they use an atomic counter. When both counts hit zero, object is destroyed.

verdagon · on April 24, 2022

Definitely! Vale, Lobster, and Cone all require explicit annotations for anything that could cross thread boundaries.

Vale's design bends the rules a little bit, it blends structured concurrency with region borrow checking to allow things to temporarily escape the current thread. [0]

I believe Lobster spawns a new VM per thread, and has some special abilities to share data between them, but I'm vague on the details.

Cone requires an explicit permission to allow crossing thread boundaries, which I think will work pretty well. [1]

I've read about biased RC before, but have been skeptical because of the branching involved. How has it worked out in practice?

[0] https://verdagon.dev/blog/seamless-fearless-structured-concu...

[1] https://cone.jondgoodwin.com/safety.html

nu11ptr · on April 24, 2022

I've just read about it, never actually used it verbatim tbh, but just found it kinda fascinating. I wrote a somewhat modified version (I don't track local thread ID, I use Rust's type system and Send/Sync properties instead) of the algorithm for a very different purpose: I use it to allow safe moves back and forth between (non-atomic) Rc and (atomic) Arc objects. For this purpose, it is very useful. The actual Rc/Arc operations are no more expensive than regular Rust Rc/Arc (except for the last decrement on Rc - slightly more expensive)

I got the idea from this person (I didn't use their algorithm though - they probably do it similar to how I do it though): https://crates.io/crates/hybrid-rc

rfoo · on April 24, 2022

Biased GC is implemented the "nogil" Python fork. It worked okay-ish.

elcritch · on April 24, 2022

Nim's ARC memory management uses the first strategy. Non-atomic refs combined with moves really speed things up! Moving data across threads requires the data be of a special type `Isolate`. It can be a bit tricky to prove to the compiler that the data is isolated, e.g. single ownership.

amelius · on April 24, 2022

> Naive ARC vs GC... a good GC will kill it, so don't make the ARC naive...cheat and find ways to skip most ref.

However, what if you applied that same principle to a GC'd language ...

rectang · on April 24, 2022

If you work on software that has hard real-time constraints, e.g. live audio processing, do any of the improvements to GC matter?

GC reminds me of satellite internet: no matter how fast it gets in terms of throughput, the latency will always be unacceptable — but because the latency issue can't be solved, advocates will keep trying to tell you that latency doesn't matter as much as you think it does and your problem isn't really a problem. (All sorts of bad experiences in modern software come down to somone trading away latency to get throughput.)

Rather than the framing of "GC vs. ARC", I'm mostly interested in improvements to the ownership model.

verdagon · on April 24, 2022

Ironically, even though I believe RC will make great strides, I actually agree with you. Single ownership has the most potential, especially if we decouple it from the borrow checker as we are in Vale. [0] Being able to control when something is deallocated is quite powerful, especially for real-time systems.

[0] https://verdagon.dev/blog/generational-references

Tozen · on April 28, 2022

> "GC reminds me of satellite internet."

> GC reminds me of satellite internet: no matter how fast it gets in terms of throughput, the latency will always be unacceptable — but because the latency issue can't be solved, advocates will keep trying to tell you that latency doesn't matter as much as you think it does and your problem isn't really a problem.

LOL, like this statement, and will have to borrow it for future discussions/debates.

pjmlp · on April 24, 2022

There are real time GCs when one is willing to put the money on the table.

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/wp/products-services/jamaicavm/

zozbot234 · on April 24, 2022

Note that Rust will most likely be getting some support for these features too, at some point. It's what much of the academic PL research on borrow-checked languages is pointing towards. You can alreasy see some of the practical effect with "GhostCell" and similar proposals, even though these are arguably much too clunky and unintuituve for practical use (notably, they use obscure features of the Rust lifetime system to implement their equivalent of regions). Language-level support is likely to be rather easier to work with.

forrestthewoods · on April 24, 2022

Fascinating. I’ll be honest, I’m moderately skeptical! Various flavors of GC has caused me great pain and suffering. The promise has long been “GC, but less bad” and so far that has not come to fruition.

However, I’m ecstatic at how many languages are trying new and interesting things. The ecosystem of novel languages feel more rich and vibrant now than at any other point in recent memory. Probably due to LLVM I think?

You should give a talk at Handmade Seattle in the fall! Would be a great audience for it I think.

sillysaurusx · on April 24, 2022

Ehh. Who cares, though? And I mean that in the least dismissive way.

Why do you care? Sell me on it. GCs have gotten to the point where the slowdown is almost the butt of jokes. LuaJIT proves that it can get damn close to C speeds.

What you’ve said sounds like a whole lot of complexity for … what?

Give me the automatic threading speed improvements with no extra programmer effort and I’ll change my stance 270 degrees.

iscoelho · on April 24, 2022

1. LuaJIT can be faster than C due to optimizations regarding function calls.

2. GC is LuaJIT's greatest weakness [1] (It is very slow) - Most high performance projects disable it and only use C FFI allocations.

[1] http://wiki.luajit.org/New-Garbage-Collector

LuaJIT 3.0 (which will have the new garbage collector) also has no ETA and is not in development.

verdagon · on April 24, 2022

A valid question, happy to enlighten!

To get the automatic threading speed improvements, a (region-aware) language might add a `parallel` keyword in front of loops, to temporarily freeze pre-existing memory, thus completely eliminating the overhead to access it. Vale is building this for its generational references [0] model but I hope that more RC-based languages attempt something like it. Can't get much easier than that ;)

Lobster's algorithm doesn't actually incur any additional complexity. It's completely automatic.

Perceus and HVM require immutability, which does require programmer effort, I admit. Pure functional programming just isn't as convenient as imperative programming for most. However, languages are learning to blend mutable and immutable styles, such as Pony [1], Vale [2], and Cone [3]. When a language allows you to create an object imperatively and then freeze it, it's wonderfully easy.

I think the next decade in languages will be about this exact topic: making it easy to have fast and safe code. I think RC will soon bring some very compelling factors to that equation.

[0] https://verdagon.dev/blog/seamless-fearless-structured-concu...

[1] https://tutorial.ponylang.io/reference-capabilities/referenc...

[2] https://vale.dev/guide/structs#mutability

[3] https://cone.jondgoodwin.com/coneref/refperm.html

MaxBarraclough · on April 24, 2022

I think the real point of their question was along the lines of GC works great and these days it's pretty fast, why research cleverer reference-counting?

Will these developments lead to appreciable performance improvements for 'typical' non-embedded, non-real-time programs?

Someone · on April 24, 2022

I think the consensus is that GC works great _if_ you throw enough memory at it (about three times what you really need)

On mobile, the problem with that isn’t as much that memory costs money, but that it costs Joules. Peak power usage of RAM may not be large compared to that of your screen, CPU, GPU, GPS, etc. but AFAIK, with current tech, memory keeps eating energy even if your device is sleeping.

Being able to run with less memory, I think, is why iPhones can do with relatively small batteries and yet have decent stand by times.

astrange · on April 24, 2022

The fastest system currently out there is iOS, which (as this article says) doesn't use GC and does use clever reference counting, so yes?

Note, most people here arguing that GC is actually always faster are forgetting that reading memory itself has a cost, especially in systems with swap or other reasons it might have a /really/ large cost.

kaba0 · on April 25, 2022

You shouldn’t attribute the difference to the GC algorithm used when there are generations between ios’s CPUs and android’s.

astrange · on April 25, 2022

If you won't believe me, you could believe Chris in the article.

> Chris Lattner: Here’s the way I look at it, and as you said, the ship has somewhat sailed: I am totally convinced that ARC is the right way to go upfront.

nu11ptr · on April 24, 2022

For me the biggest benefit is deterministic destruction, specifically for the non-memory resources it is holding (files, locks, etc.) the same as if it was stack allocated. It is very freeing knowing that is just being taken care of. No 'defer', 'with' or other special handling needed.

But possibly the bigger benefit: no heap allocation for many objects at all. If you can prove single ownership, you can stack allocate it (or "inline it" if in a struct/object).

shakna · on April 24, 2022

> No 'defer', 'with' or other special handling needed.

Just as an aside - in Python, it surprised me that 'with' is _not_ a destructor. The object created will remain in-scope after the 'with' body, but the '__exit__' method will be called. Which for the canonical example of file-handling, only really closes the file.

jnwatson · on April 24, 2022

The done vs destroyed dichotomy is a cross-cutting concern for programs that need explicit control over resource lifetimes.

RAII is actually an antipattern in Python if you actually care about resources being released.

shakna · on April 24, 2022

The surprise is less about resource usage and more about the scope lifetime. 'with' introduces a new binding that outlives its body. It looks, at first glance, to be a temporary variable. But it isn't.

chombier · on April 24, 2022

I always wondered why does this variable leak outside the `with` scope, because this is really the one place where it should not.

IIRC python3 fixed scope leaks in list comprehensions, does anyone know why they didn't fix this one as well? Or does it have actual uses in practice?

shakna · on April 24, 2022

I don't know the _motivation_ for why it leaks, but it leaks because the method calls of a context manager are __enter__ and __exit__, but not __del__. [0]

The motivation is especially opaque, because though CPython does run __exit__ immediately, it isn't guaranteed to do so by the standard. So __exit__ will happen at some point, like __del__ with the GC, but it won't actually call __del__, which makes GC'ing the value more complicated as __del__ can't safely run until __exit__ does.

I _think_ the motivation is something to do with __exit__ being allowed to throw an exception that can be caught, whereas __del__ isn't allowed to do that, but I don't find it that convincing. And I've never seen an example where the introduced variable is ever referenced afterwards in Python's documentation (including in the PEP itself).

[0] https://peps.python.org/pep-0343/

chombier · on April 24, 2022

Thinking about it, my guess is that properly scoping the variable does not really prevent the user from leaking the value anyway:

    with context() as foo: 
        leak = foo

This is unlike list comprehensions, where it is syntactically impossible to bind the comprehension variable to a new name accessible outside the comprehension.

So maybe the reason is that it is not really worth the trouble.

ptx · on April 24, 2022

Everything is always* scoped to the entire function in Python. Perhaps that's not always ideal, but at least the rule is simple.

* Well, not variables in generator expressions. There might be a few other exceptions.

pphysch · on April 24, 2022

Consider the common pattern:

    with open(path) as file:
        result = func(file.read())

    another_func(result)

If `with` used a separate scope then it would be a lot less ergonomic. How might we bring data outside that scope and be confident that `with` isn't forcibly cleaning things up? Would be a lot more verbose.

shakna · on April 24, 2022

I think you've missed what I've meant by a new binding. That the body executes is entirely intuitive.

However:

    with open(path) as file:
        ...

    if file.mode == 'r':
        print("Huh? file is still a thing?")

xscott · on April 24, 2022

Python doesn't have block scope, but most people don't seem to notice:

    def foo():
        if False:
             x = 123
        print("x is visible out here:", x)

That will generate an error because `x` is "referenced before assignment", but the variable exists for the entire function. This can bite you in subtle ways:

    x = "is global out here"
    def foo():
         print("all good:", x)

    def bar():
         print("not good:", x)
         if False:
              x = "now x is local, even before this declaration"

oaiey · on April 24, 2022

I would argue that resources need always explicit handling, no matter whether you do reference counting or GC. So if I do a one-liner to dispose an object or call delete on an object is the same beauty. Wait till the variable is out of scope together with many more or hit the GC cleanup is just not the right way to clean up externally resources in a proper way.

SemanticStrengh · on April 24, 2022

What does it bring vs Java autocloseable?

nu11ptr · on April 24, 2022

Java's "try with resources" is effectively the same as Go's 'defer' and Python's 'with', so the advantage is that you can't forget to do it. No matter what you do, when the object goes out of scope, the resource will be closed.

pjmlp · on April 24, 2022

If you actually use static analysis, you can't forget it in Java either.

And it isn't as if using languages without static analysis is still considered a best practice in the 21st century.

tialaramex · on April 24, 2022

> And it isn't as if using languages without static analysis is still considered a best practice in the 21st century.

So in summary, "Please just do better". Not helpful as a video game meme, and really no better for programming.

pjmlp · on April 24, 2022

I guess that is why no one bothers with clippy then, it is superfluous.

tialaramex · on April 24, 2022

Everywhere that a clippy lint is there to catch bugs, I wish it wasn't a clippy lint because it won't always get run and those bugs will get missed.

An example of a clippy lint that I like in clippy:

If you try to transmute() &[u8] into a &str clippy will point out that you should probably from_utf8_unchecked() -- this lint results in code that does the same thing but is clearer for maintainers about what we intended. If our reaction to the lint is "But it isn't UTF-8" then we just got an important red flag, but that's not why the lint is there.

On the other hand, clippy lints for truncating casts and I want those gone from the language so the warnings ought to at least live in the compiler not clippy. I would like to see a future Edition outlaw these casts entirely so that you need to write a truncating cast explicitly if that's what you needed.

throwaway2037 · on April 24, 2022

Is Java escape analysis done at compile time or run time? I don't know.

FYI: As I understand, escape analysis is frequently used to either (a) avoid allocating on the heap, or (b) auto-magically free-ing a malloc'd object from a tight scope.

pjmlp · on April 24, 2022

First of all it depends on the implementation, the language specification doesn't state how it works.

https://en.m.wikipedia.org/wiki/List_of_Java_virtual_machine...

As it stands is mostly done at runtime and only for a).

Otherwise you are better with try-resources, and for something like malloced memory there are off-GC heap APIs.

Also Valhalla will eventually land, although it is getting long by now.

vips7L · on April 24, 2022

Personally I wish they would move some escape analysis to compile time just to help with memory consumption. I’m assuming it would require byte code changes though.

SemanticStrengh · on April 27, 2022

Kotlin doesn't allow mistakes https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.io/use.h... For structured concurrency too.

xxs · on April 24, 2022

>the non-memory resources it is holding (files, locks, etc.)

Locks are entirely a memory construct. You can trivially write one, e.g. Java locks are written entirely in java. The files descriptors (incl. socket) - they should not be relied to be GC'd by default, they do require their own lifecycle. For instance: c# Disposable/ java's try with resource.

>no heap allocation for many objects at all.

This has been already the case for many a year - reference escape analysis, ref. escape elission, trace compilation.

tialaramex · on April 24, 2022

Locks are, in fact, one of many resources your operating system can give you.

Historically, in concurrent software your only option was to either use these OS allocated locks or build a spin lock from atomic operations (and thus burn CPU every time you're waiting for a lock). The OS allocated locks, such as from sem_open() or create_sem() must later be properly released or you're leaking a finite OS resource that will only be reclaimed on process exit or in some cases only at reboot.

Today your operating system probably offers a lightweight "futex" style locking primitive, in which so long as it's not locked no OS resources are reserved it's just memory. But, you're not guaranteed that all your locks are these lightweight structures and even if you were you need to clean up properly because when they are locked the OS is using resources so you need to unlock properly or you'll blow up.

xxs · on April 24, 2022

The only part needed to build a lock is to be able to 'sleep' the current thread and the latter to be able to be awaken, plus compare-and-swap or load linked/store conditional. There is nothing really special about the locks as long as the current thread can be made dormant. Technically a single OS lock (e.g. a futex), associated with the current thread would suffice to implement as many locks as one desires. The lock would be released when the thread dies in the clean up code.

The locks that result in thread awaiting are still in-memory, of course. They are just OS provided primitives. Yet again, a well behaved/designed lock should be non-contested for the most calls (easily 99.9%+). In that case there is no mode-switch to get to the kernel. A contested lock that goes to the kernel is a context switch.

In the end there is no need to do anything special about locks as they can implemented in the user-space unlike the file descriptors.

tialaramex · on April 24, 2022

> In the end there is no need to do anything special about locks as they can implemented in the user-space unlike the file descriptors.

If you get semaphores from create_sem() you will need to give them back, you can't just say "This shouldn't be an OS resource" and throw it away, exactly like a file descriptor it's your responsibility to dispose of it properly.

Speaking of which, depending on the OS, your file descriptors might also just be implemented in user space. They're just integers, it's no big deal, of course you must still properly open() and close() files, the fact it's "implemented in user space" does not magically mean it isn't your problem to clean up properly.

Likewise, if you opened a sqlite database, you must close it properly, you can't just say "Eh, it's only userspace, it will be fine" and throw away the memory without closing properly.

xxs · on April 24, 2022

>If you get semaphores from create_sem() you will need to give them back...

I don't quite get the continued argument - the created locks in the user space have 'zero' extra dependency on the kernel. In a GC setup they would be an object no different than a mere string (likely less memory as well). They can be GC'd at any time as long as there are no references to them.

If you wish to see an example: java.util.concurrent package. ReentrantLock[0] is a classic lock, and there is even a Semaphore in the same package (technically not in the very same). Originated in 2004 with JMM and the availability of CAS. You can create as many as you choose - no kernel mode switch would occur.

For the record I do no object that proper resource managed is needed - incl. sqldatabase close. The latter does use a file descriptor any time you establish a connection. However, locks need no special support from the kernel. Also in GC'd setup file descriptors would be GC'd in pre (or post) morterm via some finalization mechanism, however relying on the GC to perform resource management is a terrible practice.

[0]: https://github.com/openjdk/jdk/blob/master/src/java.base/sha...

tialaramex · on April 25, 2022

> I don't quite get the continued argument

I keep telling you that, in fact, these primitives can literally be OS resources such as the semaphores from the BeOS system call create_sem() or the POSIX sem_open or Windows CreateSemaphore.

Your insistence that these don't have a dependency on the kernel, that they're just memory to be garbage collected is wrong, they aren't, this won't work.

You propose the Java ReentrantLock class as an example, but, that's just a thin wrapper for the underlying Java runtime parking mechanism, so, sure it's just a Java object no big deal, but the actual implementation (which is not in the file you just linked) requires an OS primitive that you've not examined at all, that OS primitive is being managed by the runtime (via the "Unsafe" interface) and is excluded from your garbage collection entirely.

In essence you get to "just" garbage collect these locks because they're only a superficial Java language shadow of the real mechanism, which you aren't allowed to touch and can't garbage collect.

xxs · on April 25, 2022

>requires an OS primitive that you've not examined at all, that OS primitive is being managed by the runtime (via the "Unsafe" interface) and is excluded from your garbage collection entirely.

This is most definitely false. The methods needed are the compare-and-set + LockSupport/park (parkNanos) and LockSupport.unpark. The former is a standard CPU instruction w/o any special magic about and the latter is setting the current thread to sleep and being awaken.

As for examine: I know how j.u.c works down the OS calls and generated assembly. Unsafe is a class (not an interface), albeit mostly resulting in compiler magic.

The use of unsafe is solely for: LockSupport::park/unpark and AbstractQueuedSynchronizer::compareAndSet. ReentrantLock is NOT any thing wrapper.

tialaramex · on April 25, 2022

> As for examine: I know how j.u.c works down the OS calls and generated assembly.

And here's the big reveal. It's actually calling the OS for "setting the current thread to sleep and being awaken" ie it's just using the OS locks and those are hidden from you so that you can't garbage collect them.

> Unsafe is a class (not an interface), albeit mostly resulting in compiler magic.

Unfortunate choice of language, I didn't mean a Java interface, but an interface in the broader sense. I assure you that "setting the current thread to sleep" is not "compiler magic", it's just using the OS locks, on a modern system it will presumably use a lightweight lock but on older systems it'll be something more heavyweight like the semaphores we saw earlier.

The claim that LockSupport, which is calling the OS is not a wrapper is bogus.

But all this is a distraction, even on systems where old semaphore APIs are now rarely needed they still exist and if you hold one you are still responsible for cleaning it up properly even if you think you know better.

xxs · on April 25, 2022

There is a single OS call to put the current thread to sleep; I mentioned it much earlier - a single pthread_mutex_t per Thread NOT per lock

Those are not any part of the developer capability to manage, aside starting the thread (and locking on it). The mutex is released with the thread, itself; Technically hotspot impl java does not ever releases them back to the OS but keeps a free-list. The thread is a lot larger OS resource as well. No new OS resources are exhausted by the OS b/c of the extra locking.

>The claim that LockSupport, which is calling the OS is not a wrapper is bogus.

Of course it does call the OS; it happens on the dormant part (w/o an OS call it'd a busy spin), or if a thread needs to be awaken. I said not a 'thin' wrapper. All the locking and queueing mechanism is outside the OS. The major distinction is there is only a single OS primitive per Thread NOT per lock. (Edit: now I see I have spelled thin as 'thing' which is a definite mistake on my part)

  /*
   * This is the platform-specific implementation underpinning
   * the ParkEvent class, which itself underpins Java-level monitor
   * operations. See park.hpp for details.
   * These event objects are type-stable and immortal - we never delete them.
   * Events are associated with a thread for the lifetime of the thread.
   */

https://github.com/openjdk/jdk/blob/c4c76e2f34f7e709f8b3c960...

  // ParkEvents are type-stable and immortal.
  //
  // Lifecycle: Once a ParkEvent is associated with a thread that ParkEvent remains
  // associated with the thread for the thread's entire lifetime - the relationship is
  // stable. A thread will be associated at most one ParkEvent.  When the thread
  // expires, the ParkEvent moves to the EventFreeList.

https://github.com/openjdk/jdk/blob/739769c8fc4b496f08a92225...

Edit: My point has always been that locks/monitors/semaphores/etc. dont need any special care in a GC-setup as they can be implemented purely in the language itself, and without special care to release them, and manage them similar to file descriptors. The need to block (become dormant) obviously requires OS calls and support but that part doesn't have to scale with the amount of locks/semaphores/latches/etc. created.

tialaramex · on April 25, 2022

> now I see I have spelled thin as 'thing' which is a definite mistake on my part

That does make a lot more sense. I agree that what's going on here is not merely a thin wrapper.

> My point has always been that locks/monitors/semaphores/etc. dont need any special care in a GC-setup as they can be implemented purely in the language itself

And my point was always that you can only depend upon this if you in fact are always using the "implemented purely in the language itself" mechanisms, while plenty of others exist and you might need them - but you can't just treat them as garbage safely.

So overall I think we were mostly talking at cross purposes.

Actually looking at the Java stuff reminded me of m_ou_se's thread about how big boost::shared_mutex is (it's 312 bytes). Most people who need one of these synchronization primitives only really need the one that costs 4 bytes, but it's hard to keep saying "No" to people who want to add features. They're always confident that everybody will love their feature, at the cost of a little more memory, and it spirals out of control.

On most systems the pthread_mutex_t you mentioned is already 40 bytes. A "nightly" Rust can do a 32-byte array with a mutex to protect it, all inside 40 bytes.

12thwonder · on April 24, 2022

like Chris said in the article, it is difficult for GC languages to interop with C/C++ codebase.

in addition, if your system requires precise control over when and when not to use CPU, like resource-intensive gaming, browsers, OS, then GC may not be a good choice.

lastly, if you are using GPU via graphics API, I do not know any API that can garbage collect GPU memory but RC can naturally extend memory cleanup code to cleanup GPU memory as well quite easily.

davnn · on April 24, 2022

CUDA.jl apparently uses Julia‘s GC to free up GPU resources, see https://cuda.juliagpu.org/stable/usage/memory/

12thwonder · on April 24, 2022

good to know, thanks!

I personally would like to know when the GPU resources become available (freed) thus still inclined to prefer RC over GC tho.

kazinator · on April 25, 2022

No it isn't. Chris put his foot in his mouth and went on about how an advantage of garbage collection is that it moves objects. Not all garbage collection moves objects. Garbage collection that does not move objects does not have any such interop problem. Furthermore, even GC that does move objects can avoid that particular interop problem if it simply avoids moving foreign objects. E.g. suppose a foreign object is managed by some gc-heap object of type "foreign pointer". That managing object can be moved during GC, but if the actual foreign pointer doesn't change. If that pointer is all that the foreign code knows about, then there is no problem. Don't give foreign code any pointers to your heap objects and you're good.

hmlongco · on April 26, 2022

Uh... I can solve entire batches of problems by just saying, "don't do that and you're good."

Unfortunately, real-world constraints always seem to get in the way...

nine_k · on April 24, 2022

Think about the typical heap size for a GC-based language. It's usually 50-70% larger than a working set, to let GC run efficiently.

throwaway2037 · on April 24, 2022

To be clear, in enterprise (non-embedded / non-mobile) programming, this is fine. Money (RAM) isn't the issue. Programmers with domain knowledge are much more expensive than adding more hosts or RAM. I am talking about the general case; I know HN loves to focus on "high performance / low latency", but that is less than 1% in "Enterprise Real World".

kazinator · on April 25, 2022

I don't know if there is a typical heap size.

My TXR Lisp is garbage collected. When I blow away all the compiled files of the standard library and run a rebuild, in a separate window running top, I do not ever see a snapshot of any txr process using more than 27 megs of RAM:

This is close to the worst-case observed:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  19467 kaz       20   0   27364  24924   3112 R 100.0  1.2   0:30.55 txr

When this is happening, the library code is being interpreted as it compiles itself, it by bit. E.g. when stdlib/optimize.tl is being compiled, I see this 27 megs usage. But after the entire library is compiled, if I just "touch stdlib/optimize.tl" to force that to rebuild, it's a lot leaner; I see it hit about 16 megs.

Sitting at a REPL prompt, compared to some bash processes, using ps -aux:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
  USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
  kaz       2433  0.0  0.0   8904  1900 pts/0    Ss   Feb22   0:00 -bash
  kaz       6179  0.0  0.2   9072  4212 pts/1    Ss+  Apr22   0:01 -bash
  kaz      18551  0.0  0.1  11152  3932 pts/2    Ss+  Feb22   1:32 -bash
  kaz      18836  0.0  0.1   9064  2196 pts/5    Ss+  Apr13   0:02 -bash
  kaz      19079  0.0  0.2  13464  4880 pts/3    Ss   Mar01   0:40 -bash

  kaz      19849  0.1  0.2   9360  6024 pts/4    S+   23:45   0:00 txr

Oh, and that's the regular build of TXR. If that's too bloated, you can try CONFIG_SMALL_MEM.

Large is as large does.

Garbage collection was once used on systems in which most of today's software would not fit.

Garbage collection does have waste. You allocate more than you need, and keep it.

Precise, manual memory management does the same thing. Firstly, the technique of keeping around memory is practiced under manual memory management. E.g. C programs keep objects on recycle lists instead of calling free. And free itself, will not release memory to the OS, except in certain circumstances. Then there is internal fragmentation. Generally, programs that "get fat" during some peak workload will stay that way.

If you see some garbage collected system need hundreds of megabytes of RAM just for a base image, it may not be the GC that's to blame, or not entirely: it's all the bloated crap they are pulling into the image. Or it could also be that the GC was tuned by someone under thirty who doesn't think it's weird for a browser with three tabs open to be taking 500 megs. GC can be tuned as badly as you want; you can program it not to collect anything until the VM footprint is past the physical RAM available.

hmlongco · on April 26, 2022

"it may not be the GC that's to blame, or not entirely: it's all the bloated crap they are pulling into the image"

Again, you keep waving your hands in the air and saying if you don't do that, you'll be fine.

oikawa_tooru · on April 24, 2022

Hi, I am bootcamp grad currently working with fullstack ror development. I have started to get bored of this work and want to venture into other core cse fields. Compiler design and machine learning seems to have caught my interest the most. What would your advise to be someone who wants to break into this field by self study? Does compiler design involve any math that is also used in machine learning and deeplearning? And do they share any similar concepts?

verdagon · on April 24, 2022

That's great that you're interested in venturing into other fields! Switching fields every once in a while is a great way to grow and become a stellar engineer.

If you have equal passion for those two choices, I would suggest leaning towards machine learning for now. An ML engineer can make $300k-500k/yr at the big companies after a few years.

I left that world and started doing language development as an open source endeavor, and make only $700/yr [0] which isn't that much, though I'm oddly more proud of that than the salary I got from my years in Mountain View. I'm happier, being able to push the state of the art forward and show people what incredible potential there lies ahead in the PL realm. However, I'd recommend graduates to focus on finances for a few years before looking at what lies beyond, lest the financial stress overshadow the dream.

I'm only about 110,000 lines into my journey with Vale, but so far I'd say there's little math. There's a lot of data structures knowledge needed to make a compiler fast, and some interesting algorithms, but not much overlap with ML. But who knows, maybe there's some cool things languages can do specifically for ML, that haven't been discovered yet!

[0] https://github.com/sponsors/ValeLang and I'm grateful to all of them!

oikawa_tooru · on April 24, 2022

Thank you Sir for your gracious response.

chaosite · on April 24, 2022

ML utilizes continuous mathematics, while compiler design including static analysis usually favor discrete mathematics. I wouldn't say the concepts are interchangeable.

The classic text book is the Dragon Book. Another good start on the compiler side are projects like https://buildyourownlisp.com/.

A lot of people start with lexing, parsing, and so on before moving on, but you can start from the static analysis side too. I would look at the LLVM framework and the Clang Static Analyzer, and use one of the tutorials for that to write a small analyzer for C.

oikawa_tooru · on April 24, 2022

Thank you so much for the guidance!! I will look into it soon!!

eatonphil · on April 24, 2022

I'd encourage you to join the following subreddits that might be interesting to you: r/compilers, r/programminglanguages, r/EmuDev, r/databasedevelopment.

You'll find more people like yourself interested in core CS.

My company also runs a Discord community for those kinds of people: discord.multiprocess.io.

Good luck!

simscitizen · on April 24, 2022

Maybe start with a good book or two to see if you're actually interested. Crafting Interpreters (https://craftinginterpreters.com/contents.html) might be a good intro--yes, it's about interpreters, but there's a lot of overlap between creating an interpreter and creating a compiler.

kcartlidge · on April 24, 2022

As very much a non-expert in compiler design, I found two books to be helpful introductory reads (neither are free though). I'm not sure whether someone proficient in the field would recommend them as my knowledge in the area is small, but I enjoyed them.

- Writing An Interpreter In Go by Thorsten Ball (https://interpreterbook.com)

- Compiling to Assembly from Scratch by Vladimir Keleshev (https://keleshev.com/compiling-to-assembly-from-scratch)

The first produces an interpreter in Go for a custom language, but as simscitizen says there's a lot of crossover with compilers.

The second creates a compiler written in Typescript that compiles from a minimal form of Typescript into ARM assembly. It's not cheap but is good.

KMag · on April 26, 2022

I briefly scanned the Lobster reference, but I didn't see how they're dealing with the undecidable cases. Presumably it punts out to the garbage collector for objects where lifetime isn't statically decidable, right?

My understanding is the answer for "Why doesn't Rust use something like type inference for lifetimes?" is "In the general case, that reduces to the halting problem."

panick21_ · on April 24, 2022

There is also a lot still to do on the GC side. Azul GC with hardware support allowed for amazing results. Even without hardware support the results are quite good.

LadyCailin · on April 24, 2022

A little off topic, but may I ask how you managed to work on this full time? I am also developing a programming language, (methodscript.com if you’re curious) but I’ve never been anywhere near being able to work on it full time.

eatonphil · on April 24, 2022

Their site mentions they worked for Google for 6 years. The stock alone from doing that would be enough to self fund for a while. They may also be getting grants or sponsors for their language or game development. Or they could be doing contract work of any kind to pay the bills.

These are all common approaches to working on your own thing full time.

verdagon · on April 24, 2022

Yep, I worked at Google until they silently started sunsetting all the small endeavors and funneling everyone into Ads/Cloud/Search. Back in 2014, the internal position listings were full of fascinating projects that could bring a lot of good, and looking at it now, it's pretty clear the writing is on the wall for the culture. I left late last year.

Luckily, I've saved up enough to be able to work on Vale full-time for a year or two, hoping I'll have enough sponsors by then to be able to continue. If I play my cards right, I'll be able to make Vale into a nonprofit, a la ZSF. A worthy endeavor!

kazinator · on April 25, 2022

A simple (obj &val) parameter in C++ eliminates reference counting by lending a reference, which can go an entire tree of helper functions.

NeutralForest · on April 24, 2022

Hey thanks for mentioning Lobster, I didn't know about it and it looks cool, I'll probably try to draw a couple things with it =)

bmitc · on April 24, 2022

What's the state of implementations suitable for real-time systems?