Hacker News new | past | comments | ask | show | jobs | submit login
Datatype99: C99 with Sum Types, v0.1.0 (github.com/hirrolot)
157 points by hirrolot on Feb 4, 2021 | hide | past | favorite | 64 comments



Too often that I find the C extension projects introduced here on HN would omit examples for dynamically allocated objects. In the binary tree example, apparently I can't just memcpy an existing tree to a malloc'd tree because the nodes would point to the ones on the old tree. I was halfway writing a copy function that recursively traverses the source tree,

    void copy(BinaryTree *dest, const BinaryTree *src) {
      match(*src) {
        of(Leaf, x) {
          *dest = TREE(Leaf(x));
         }
      }
    }
then realised that this still points to a leaf on the stack. Need a way to replace all the pointers to be relative to the destination tree...


Isn’t that a problem even in managed, garbaged-collected languages? For most (procedural) languages it’s a bit involved if you want to deep-copy heap-allocated objects. If you don’t want the overhead of re-creating the tree from scratch, you can use indices instead of pointers and store the nodes in a heap-allocated contiguous array, and then when you want to create a deep copy of the tree you can just memcpy the whole array in one go.


I'm guessing they're coming from C++ or similar where the copy constructors already take care of making a deep copy.


Still, inside that copy constructor you're going to write the exact boilerplate code for recursively copying each node, so the complexity of code won't change that much. Also, you're going to call quite a lot of malloc/free, which could be a potential performance hit if you're creating/destroying large trees a lot, as well as fragmenting your memory if you're not careful.

Personally I have a much easier time dealing with tree/graph data structures using indices, regardless of the language I'm using. Using indices as handles in general are just so much easier than using pointers in terms of memory management, and generally performs much better in terms of cache coherency. Indices are also easier to serialize/deserialize, and sometimes it just "makes sense" (for example, if you're developing a physics simulation, representing objects as indices are natural because you're going to use those index values inside various math equations.)

The downside might be that if you're creating and destroying nodes in a tree continuously. then you have to be careful about reusing indices. There's a pattern in gamedev (generational indices) that solve the issue (basically, in addition to an index you also store a counter value with it, and you increment the counter everytime you allocate a new node in the same spot of the array. It's a bit more bookkeeping, but there's lots of code examples for it. It's kind of a well-known trick inside gamedev (and recently in the Rust community, in order to bypass borrow checker issues).

A few pointers:

- https://floooh.github.io/2018/06/17/handles-vs-pointers.html

- https://kyren.github.io/2018/09/14/rustconf-talk.html

- https://news.ycombinator.com/item?id=17995634 (steveklabnik explaining how generational indices work with code).


The allocation strategy used is only a tangential problem. The purpose of the example is to demonstrate the functionality of the library in brief, not to walk the reader through problems that C programmers face in general.

I would suggest recursively walking through the original tree, creating an equivalent tree in the process instead of copying the original elements.


Yeah, I suppose.

    /* get size required for malloc */
    size_t getsize(const BinaryTree *tree) {
      size_t acc = sizeof(*tree);
      match(*tree) {
          of(Leaf, x) {
              return acc;
          }
          of(Node, lhs, x, rhs) {
              return getsize(*lhs) + acc + getsize(*rhs);
          }
      }
    }

    BinaryTree* copy(BinaryTree *dest, const BinaryTree *src) {
      dest->tag = src->tag;
      match(*src) {
         of(Leaf, x) {
            *dest = Leaf(*x);
         }
         of(Node, lx, x, rx) {
            match(*dest) {
              of(Node, ly, y, ry) {
                *ly = dest + 1;
                *ry = copy(*ly, *lx);
                *y = *x;
                copy(*ry, *rx);
              }
              otherwise {}
            }
         }
      }
      return dest + 1;
    }
Kind of hackish, but I guess it works.


Hackish????


I created something similar [1] awhile back:

    SUM(foo) {
        foo_one,
        foo_two,
    };
    /* declare each sum case */
    CASE(foo, foo_one) { int i; char c; };
    CASE(foo, foo_two) { double d; };

    void do_bar(foo f) {
        MATCH(f) {
            AS(foo_one, y) printf("foo_one: %d, %c\n", y->i, y->c);
            AS(foo_two, y) printf("foo_two: %d\n", y->d);
            MATCHANY 
            fprintf(stderr, "No such case!");
            exit(1);
        }
    }
[1] https://github.com/naasking/libsum


It's syntax sugar for tagged unions, of course. It's not entirely out there that C could include such support natively in a future standard.


I predict all languages will adopt some form of sum types and pattern matching in the next 10 years. They will go from esoteric to mainstream in the same way lambdas did.


I find it really interresting that such a simple concept as a tagged union, which appeared at least as far back as ALGOL 68, could be completely abandoned during the 90's and 00's where OOP reigned surpreme, and only seems to be gaining popularity again now that pattern matching is all the rage.

It's just baffling to me that tagged unions could be shunned in favour of the visitor pattern. How could that ever look like progress?

Being a millenial, I don't know what programming was like the 80's, but surely tagged unions must have been bread and butter for Pascal programmers?


Tagged unions never went away, it's just that, as you mention, C++ style OOP became very popular and they worked the other way around: instead of having a single function that matches the tag to figure out what to do with the data, you ship data-specific functions straight inside the object (virtual methods).

In theory you can have both of course, and C++ now has support for variants (because we know C++ tries to be come a superset of all programming languages in existence). It's still a bit clunky to use though, especially because of the lack of "match" support in the language.

But I think a problem is that it's easy to mess up tagged union in low level, non-managed languages. Rust's enum work great, but that's thanks to the very strong type system and borrow checker which makes them effectively impossible to mess up in safe code.

In C or C++ it can become messy and hard to debug. In C you don't really have a choice, so you do it anyway but in C++ virtual methods and inheritance are usually a bit easier to manage.


Tagged unions have always been around as runtime constructs, they just had different names (e.g. "variant"). In C it's understandable why they aren't part of the language: they would be the only data type that requires a "builtin" opaque struct type to hold the tag and payload union (e.g. something like this, but hidden from the programmer):

  struct my_tagged_union {
      enum my_tag_enum tag;
      union {
        ....
      } payload;
  };
That would be a pretty big break with "tradition" for the C language ;) (e.g. it would be similar to a "builtin string type")

And of course tagged unions as "generic user-defined types" are much less convenient/useful than being directly integrated into the language syntax and type system.


Structs aren't really any more fundamental than tagged unions though. They also contain bytes which are hidden from the language (padding). If a struct can have automatically generated padding bytes, why shouldn't a tagged union have an automatically generated tag?


Mostly because C structs have a very exactly bound memory representation, and padding bytes are just about the only thing you can - mostly - get away with not caring about.

Coincidentally, even padding bytes break things technically. Since they're undefined, you - in theory - can't ever calculate a bytewise hash value over a struct that has any padding bytes. It's still done frequently, it's just nulled out beforehand and copying is bytewise anyway. Doesn't change the fact that to the spec letter it's undefined behaviour. GCC also added __builtin_clear_padding() recently.

Hidden tags ... no. Just no.


Doesn't need to be hidden. Just usable with language features (e.g. pattern matching) without accessing it explicitly.


Tagged unions built-in into the language can have layout optimizations, e.g. in Rust sizeof(Option<&i32>) == sizeof(&i32), because Rust knows it can repurpose NULL as the None value.


To be fair, tagged unions are too error prone to be useful without strong types. That never stopped people from using them, but when a statically enforced alternative came along, everybody jumped ship.


"I find it really interresting that such a simple concept as a tagged union, which appeared at least as far back as ALGOL 68, could be completely abandoned during the 90's and 00's where OOP reigned surpreme"

From a modern perspective, it may be difficult to understand just how crushing the OO consensus was in the 1980s and 1990s. It still reverberates in our language design today. OO was good design was OO. If it didn't fit into OO, that was ipso facto proof that it was bad. Close the book. No further evidence required. Sum types? Kinda overlaps inheritance. Bad design. Not OO, therefore bad design. Why is Java a better language than C++? It's more OO, so QED.

OO here, by the way, is specifically that variant that emerged in the form of C++ and Java. Very important to have defined, enforced "private" keywords and deep hierarchies. Smalltalk by this definition is not an OO language. (And also, therefore, bad. Also possibly heretical for trying to steal the term OO... and I mean all the implications of the word "heretical".)

My "Software Engineering" course circa 1999 was basically "How to Design Software via OO". (It has the distinction of being the one course from my entire college career that I'm not sure I agree with a single thing I "learned" in that course. Even at the time I had done enough real work to be suspicious, and from here in 2021 I find the whole thing risible. But I got a 4.0, so....) Formal diagrams with strictly enforced diagram languages like UML, because OO was the future of Good Design and UML was how OO was going to help bring programming to the masses who would only have to draw Object Hierarchies and then {magic goes here that never was worked out} and perfect programs!

The internet hit programmers and programming earliest, and now there's only a remnant of the OO dogma just hanging on by its teeth in school curricula that haven't hardly been changed in 25 years, and it lacks the power to envelop the students the way it used to.

There was of course always a counterculture that didn't listen, which is where you get Smalltalk, Perl, Haskell and its ancestry, Python, etc. But you have to bear in mind that that was a counterculture...

... and you know, even today, Java and C++ are still pretty darned popular and it's not that hard to find people who still only know one of those (or maybe both) and still think OO ≡ good design ≡ OO, and the vast chaotic landscape of languages that don't even have "private" is just kid stuff for people who can't handle Real Design. Which is OO. Because OO is good, and good is OO.


There weren't many Pascal programmers, but for C programmers, sure.


Sure there were, in Europe during the heyday of CP/M, MS-DOS, Mac OS (until MPW and PowerPlant).


Currently with D the two people in charge both favour library solutions, but so far I'm seriously leaning towards a language one - it seems to hard to do type erasure efficiently otherwise, for example.


If `match` compiles down to a switch case on the type discriminator then it can easily beat current implentations of C++'s std::visit on std::variant.

I assume pattern matching must be exhaustive. Can there be a default case?


Yes, just write otherwise { ... } to handle a default case.


What else can std visit generate? I’d imagine it does exactly that


std::variant is implemented with variadic templates; but C++ doesn't have a "variadic switch" statement. So std::visit cannot directly use a switch statement. Instead it's usually implemented as a recursive function with the hope that the compiler will unroll the recursion and optimize it into a switch. This hope usually doesn't work out in practice.


AFAIK the usual implementation is statically generating an array of function pointers and dispatching based on the type discriminator. This works even less for inlining.


On my phone so cannot check, but I thought this was mostly a solved problem? Inlining/folding a jump table should be easy enough especially since the table can be const


https://godbolt.org/z/d4838K

Doesn't happen, apparently.

edit: For a short while libc++ adopted mpark's variant visit, which AFAIK compiles down to switch case with some hand coding. AFAIK it had other problems so they rolled back to the constexpr array of function pointers.

https://github.com/mpark/variant


If you call foo with an int or with a double, then it gets inlined.

I don’t think they can inline foo as you’ve written it because the variant could be valueless by exception.


I just looked into this a bit more. For the simple examples I tried, clang and gcc inline + const prop calls to std::visit with O2.


> Pattern matching is exhaustive too.

Sweet. C seems to be getting there :)


Indeed, exhaustive pattern matching is sweet. What is the noddy explanation of how it is achieved, I wonder?


datatype99 generates a switch expression under the hood, and modern compilers check this switch expression for exhaustiveness.


How do compilers check for exhaustiveness when the domain is unbound? Or rather: how does one constrain an integer in C? C enums are glorified consts, it’s still legal to do `myenum_t x = 0xFFFF` where 0xFFFF is not a defined value. This is necessary to allow for flags/bitfield enums.


Oh, this is from the poica author! Neat!

That Poica can't work on C99 made it a non-starter for me; but all I really wanted was the sum types.


Yeah, actually the most missing thing in C is sum types, as for me. Poica was rather an experiment; datatype99 is rather a practical library. I'm planning to use it on my work.


How does it compare to the tagged unions in Zig, which aims to be “C with the bugs fixed”?


Zig has the advantage that it can add the syntax sugar right into the language instead of using macro "workarounds", but I'd guess in the end both have similar lowlevel memory layout (a struct with an enum tag, followed by a payload "raw union") and generated code (e.g. it would be interesting to look at the compiler outputs, they should be very similar).


This "advantage" is also its greatest disadvantage. I can stick these macro definitions in any existing project and start incrementally rolling them out. Migrating something — even partially — to Zig is a much bigger task.


Hmm, I don't use Zig so can't say much about it.


This is what it looks like in Zig:

https://ziglang.org/documentation/master/#Tagged-union

You create two types, a tag enum, and the tagged union itself, which has a typed "payload" for each tag.

Plus the necessary "syntax sugar" for initialization and extracting the payload by tag.


Can it be used to implement something similar to typen_ame() or type_id()?


Something similar, yes. Behind the scenes, datatype99 deduces a type of a matched expression using the typedefs generated earlier.


You mean, type_id and type_name would have to be handwritten functions, right?

If it is possible to automatically generate these, could you provide an example?


We can generate a type ID by defining an extern variable of a sum type inside the `datatype` macro. Its address then will be a type ID (though I am not sure that such two variables will always have different addresses, I need to check the standard).

About type_name: you can deduce a type name if you know at least one tag of a sum type. For example, if a tag is `Foo`, then `FooSumT` is a typedef to an outer sum type (datatype99 generates this typedef).


Two non-const variables will always have different addresses if they are of the same type. Different types can [AFAIR] per the standard theoretically be in different address spaces, though this is rare in practice.


> Different types can [AFAIR] per the standard theoretically be in different address spaces

If you can force the types to contain a member of the same type (say, a int or enum foobar_discriminant), the pointers &const_var_foo.disc and &const_var_bar.disc would have to differ for obvious reasons.


What is this sorcery?

    #include <datatype99.h>
    
    datatype(
            BinaryTree,
            (Leaf, int),
            (Node, struct BinaryTree *, int, struct BinaryTree *)
    );

What kind of preprocessor black magic can you do so that this code compiles? I don't dare to open the file datatype99.h lest it casts a dark spell on me.


Yes, as others have already noticed, it is implemented on top of https://github.com/Hirrolot/epilepsy, a metalanguage for preprocessor metaprogramming.


> a metalanguage for preprocessor metaprogramming

That sentence is everything I was taught to avoid. Props to you.

I like the drug/alcohol metaphor for these things, as a child avoid them, as an adult do your thing as long as you remain sensible.


Funny you use this example, the author's GH page says he's 16; in many countries he wouldn't be able to drink, event responsibly ^^


> in many countries he wouldn't be able to drink

But in most countries they would be able to drink (especially in a private setting), just not purchase. In Europe for instance less than half a dozen countries have a minimum drinking age for private settings, and a dozen more have an MDA in public (but not private), though for some 16 is enough to drink fermented alcohols (alongside a meal with adults for the UK, regardless for Germany and Austria).


In case anybody mistakenly thinks doing so is 'cheating', you're the author of both libraries!

This is absurd, but in a good way.


LOL, I cannot get enough of this shit! Can you point to some cool implementation tricks that you used to achieve that? I guess there's quite a bit of ## concatenation to create temporary names and stuff?


Well, mostly I use idioms from the Cloak Wiki (https://github.com/pfultz2/Cloak/wiki/C-Preprocessor-tricks,...). It takes some time to realise them :)


The whole system seems to be built on top of variadic macros (which had been standardized in C99).


https://github.com/Hirrolot/epilepsy

According to the GitHub, which appears to be a metaprogramming library by the same author.


Combine it with Checked C...Why do we need Rust? :)


- proper strings and arrays, bounded checked by default.

By the way, it doesn't need to be Rust, there are plenty of alternatives, some of them like NEWP and PL/I variants are about 10 years older than C.


- compiler prevents use of uninitialised or freed memory

- langage actually uses safety features in api design (good luck updating libc to use this)


working and sane package manager and build tool


> Pure C99. No external tools are required; datatype99 is implemented using only preprocessor macros.

And sure enough, the code is unreadable and unmaintainable as you'd expect.

You can build Boost out of C99 macros. There's a really good reason why you shouldn't.


The reason why I created this project is that I miss sum types in C all the time, literally everywhere I need a custom error type or just any other alternative data representation. And I do think the implementation of datatype99 is straightforward -- only long names, all the dirty hacks are encapsulated into the underlying framework.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: