Hacker News new | past | comments | ask | show | jobs | submit login
Mildly interesting quirks of C (gist.github.com)
292 points by goranmoomin on Nov 20, 2022 | hide | past | favorite | 84 comments



Whenever the subject of C/C++ quirks is brought up, I always like to point out the Deep C/C++ presentation:

http://www.pvv.org/~oma/DeepC_slides_oct2011.pdf

Source: https://freecomputerbooks.com/Deep-C-and-Cpp.html#downloadLi...

Previous discussion: https://news.ycombinator.com/item?id=3093323

It could be considered a bit dated at this point (It's before C++11) but I find it still both entertaining and educating.


Loved that, read it start to finish. C is already a minefield and it looks positively tame when compared to C++!


Yea this also one reason I don't like how these two languages are often lumped as C/C++ so often.

These days they have diverged even further. So I find funny to see C/C++ on job listings like it's one thing. While you can write code in a very C like manner in C++. That is not how typical C++ is often written these days.


That was very interesting!


My favorite C "quirk": If you have an array and you want to access an item of it, you can swap the variable and the index number (put the variable name inside brackets and the number outside):

    a[5] 
is the same as:

    5[a]
why?

    a[5] is actually sugar for *(a + 5), so by commutative property, you can also do *(5 + a) to access the same memory position :-)


It's one of the B leftovers in C - In B, the only type is "machine word", and words are interpreted as ints or pointers depending on the operators used. Thus, distinguishing between a[i] and i[a] is impossible, so both were valid.

Array-to-pointer decay is another manifestation of this.


That's #15 on the list.


That‘s #list on the 15.



*(list + 14) ?


*15#


Connection problem or invalid MMI code.


One funny variant is this expression: "abcde"[4]


You mean: 4["abcde"]


Actually yes, oops :)


Was about to post this. I may have posted it before on HN. IIRC, I first read about it in the K&R C book.


Reminds me a bit of "Who Says C is Simple?" written by the people who wrote a C parser & analyser in OCaml (CIL):

https://cil-project.github.io/cil/doc/html/cil/cil016.html

Also: https://cil-project.github.io/cil/doc/html/cil/cil012.html


> "Who Says C is Simple?"

People who don't know what "simple" means and confuse it with "easy".

https://www.entropywins.wtf/blog/2017/01/02/simple-is-not-ea...

https://www.infoq.com/presentations/Simple-Made-Easy/

"Easy" things almost always lead to astonishing complexity.

Also it's easy to see just how complex C is: Have a look at a formal description of it! (And compare to a truly simple language like e.g. LISP).

https://github.com/kframework/c-semantics/tree/master/semant...

In contrast some basic Lambda calculus language semantics fit 0.5 of a page in K.

https://www.youtube.com/watch?v=eSaIKHQOo4c

https://www.youtube.com/watch?v=y5Tf1EZVj8E


+1 for simple is not easy, yet with enough thinking and ingenious ideas, it is achievable. Thank for the links.

"simplicity is the ultimate sophistication." -- Leonardo da Vinci


> return ({goto L; 0;}) && ({L: 5;});

What in the world…!!

It’s in the GCC section so I assume it’s some kind of lambda-function-like compiler extension? That allows jumping between bodies of two different functions…!


It's "statement exprs": "A compound statement enclosed in parentheses may appear as an expression in GNU C. This allows you to use loops, switches, and local variables within an expression. [...] The last thing in the compound statement should be an expression followed by a semicolon; the value of this subexpression serves as the value of the entire construct." [0]

[0] https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html


Regarding 12, alignment of bitfields, how I believe it works is that when the bitfield of type long is laid out, then the structure so far is considered to be a vector of storage cells whose size and alignment are those of long:

  struct foo {
    char a;
    long b: 16;
    char c;
  };
So, a has been laid into the structure, so the current offset is 1 byte. This is considered to be occupying a portion of an existing long type bitfield cell. In other words a is essentially taken to be an 8-bit field in the first long-sized cell of the structure. That cell looks like it has 56 bits left in it (if we assume 64 bit long). Since 56 > 16, the new bitfield b is placed into that cell. When that field is placed, the placement offset becomes 3. The type of c being char, that offset is acceptable for c.

I've painstakingly reverse engineered the rules when developing the FFI for TXR Lisp:

  1> (sizeof (struct foo (a char) (b (bit 16 long)) (c char)))
  8
  2> (alignof (struct foo (a char) (b (bit 16 long)) (c char)))
  8
  3> (offsetof (struct foo (a char) (b (bit 16 long)) (c char)) a)
  0
  4> (offsetof (struct foo (a char) (b (bit 16 long)) (c char)) b)
  ** ffi-offsetof: b is a bitfield in #<ffi-type (struct foo (a char) (b (bit 16 long)) (c char))>
  4> (offsetof (struct foo (a char) (b (bit 16 long)) (c char)) c)
  3
I've summarized my empirically-obtained understanding for the benefit of users and anyone else doing similar work in a different project.

https://www.nongnu.org/txr/txr-manpage.html#N-027D075C


The cell size is not necessarily the same as "long" - it can be whatever the compiler wants, so long as alignment of non-bitfield fields is appropriate. It doesn't even have to be the same for every bitfield.


If the bitfield is declared as long, then based on that specific cell size the decision is made whether to pack the bits into the current cell or a new cell.

If a leading char member is followed by a uint64_t bitfield that is 57 bits wide, a new cell will be allocated for those 57 bits at offset 8. The char is considered to be a field of 8 bits allocated in an existing 64 bit cell, leaving 56. 57 cannot fit, and so the offset is bumped to the next cell alignment.

This is testable.

I'm only writing about GCC, not about ISO C, which specifies very little, allowing implementations latitude in choosing the underlying storage unit size and alignment for bitfields regardless of their declared type.


Here are two of my favorite obscure quirks of C:

    struct X { char x[8]; };
    struct X awoo(void);
    printf("%s\n", awoo().x);
The above is UB in <= C99 and valid in >= C11. [0]

    struct X { char b[8]; } foo();
    int *b = foo().b;
    printf("%s\n", b);
The above is UB in >= C11 and valid in <= C99. [1]

[0] https://wiki.sei.cmu.edu/confluence/plugins/servlet/mobile?c...

[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1285.htm


I really wish both would be valid in C11. Or rather I wish I had "systems-C" where all the undefined behaviour added for high performance computing was filed off and defined as "whatever the platform does".


> Or rather I wish I had "systems-C" where all the undefined behaviour added for high performance computing was filed off and defined as "whatever the platform does".

Depending on what you mean by undefined behavior, now you've made register allocation an invalid optimization. You really don't want to use that version of C.


> all the undefined behaviour added for high performance computing

UBs were added for cross-incompatibilities, where operations were too "core" (and / or untestable) for IBs to be acceptable. The reason was not performance (aside from not imposing a runtime check where that would have been possible) but portability:

> 3.4.3 undefined behavior behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

Those UBs were leveraged later on by optimising compilers, because they provide constraints compensating for C's useless type system.

So you can just use a non-optimising compiler or one which only does simple optimisations (e.g. tcc), and see what the compiler generates from your UBs.


The standard also has implementation-defined behavior, doesn't it?


Their existence is mentioned in my comment so I am not sure what you are trying to say.


I think volatile variables are a good example of how we really don't want to rely on platform semantics most the time. We use volatile variables to tell the compiler "hey, the platform actually cares about these writes, so do them the way I wrote them."* But this is relatively rare! The vast majority of the time, we want the compiler to go nuts with constant propagation and reordering and all the other good stuff. A language where volatile was the default and "go nuts" was an explicit keyword would be really annoying to use.

* There are more things in heaven and earth (MMIO!) than are dreamt of in your memory model.


Any given implementation is free to define any particular instance of UB. "Whatever the platform does" is still UB for portable code though, since the set of possible platforms and their behavior is unbounded.


Another one, adhoc struct declaration in the return type of a function:

    struct bla_t { int a, b, c, d; } make_bla(void) {
        return (struct bla_t){ .a=1, .b=2, .c=3, .d=4 };
    }
https://www.godbolt.org/z/Pha7dPzeq

Also to be pedantic: "= {};" is not valid C (at least until C23) and fails to compile on MSVC - GCC and Clang accept it as a non-standard language extension though (the proper form would be "= {0};").


"Flat Initializer Lists" is given as an example in K&R C I think, at least the first edition, when writing those extra braces to fill out an initializer must have felt very redundant.

These days many compilers will warn if you do this, however, as it is rare people do this and usually indicates a misunderstanding of the type used.

I think it's quite readable though, so it's a shame it causes warnings. What do you think?

  struct { const char *name; int age; } records[] = {
      "John",   20,
      "Bertha", 40,
      "Andrew", 30,
  };


I find it slightly worse to read. It's C, so my brain is in "newlines don't matter" reading mode, so I see an array of 6 things and then have to mentally split them back up.


    8. Modifiers to array sizes in parameter definitions [https://godbolt.org/z/FnwYUs]
    void foo(int arr[static const restrict volatile 10]) {
        // static: the array contains at least 10 elements
        // const, volatile and restrict all apply to the array type.
    }
I imagine most of these depend on the C version, but this one specifically bit me because one tool only supported c99 and the other was c11 or something later.


> UB is impossible

What? UB is clearly undesirable, but assuming it is impossible and deducing other outcomes must be meant are clearly wrong assumptions by the compiler writer.

More sensible compilers (including older version of clang) do the right thing (TM) here and yield a compiler error.

There were earlier attempts at do-what-i-mean programming languages. They are rightfully buried in history.


UB is not impossible; I think the author is being a little cheeky there. But the standard does grant compilers extreme liberties as far as how they deal with programs which can execute UB. LLVM's choice of what to do with that liberty, in this case, seems to be to assume the UB is unreachable and continue legally optimizing the program under that assumption. That's not a wrong assumption according to the definition of C.

It's debatable whether it's a good assumption. But not wrong.


> UB is clearly undesirable, but assuming it is impossible and deducing other outcomes must be meant are clearly wrong assumptions by the compiler writer.

Compilers can and absolutely do assume that UB is impossible in this code (no integer overflow) and deduce other outcomes must be meant (the loop operates on contiguous memory):

  void foo(char* arr, int32_t end)
  {
    for (int32_t i = 0; i != end; ++i)
      arr[i] = 0;
  }
(Based on code from the gist comments.)


Assuming undefined behavior is impossible is a way for the compiler to optimize code. In fact, it is the main reason why UB exists.

It is exemplified by the C++23 std::unreachable() function, its description is "invokes undefined behavior". It is intended to mark part of the code that are unreachable, so that they don't appear in optimized builds, but may appear in debug builds. It is an explicit use of "the power of UB", an optimizing compiler considers that calling std::unreachable() is impossible, so all code paths that lead to it can be safely pruned. In an debug build, the code may be generated anyways, and the compiler will chose something sensible for what happens when it is called, typically a crash, but it can be anything, it is UB.


From where I stand many of do-what-i-mean programming languages are doing just great in distributed computing, Web and mobile OSes, taking over roles that used to be done in C and C++ during the last century.


A better way to phrase it is that UB means "anything can happen". This, by definition, includes calling the function even though the pointer is null.


I think 'anything can happen' needs to be extended to the C standards committee's bank accounts.


The top comment in the gist looks like from "Hacker News Parody Thread".


The parody thread should include a comment that references another comment by its position, not realizing that it might change.


It’s also not good advice, because if you put your code through that many off by default compiler warnings, you’ll just find bugs in the warnings. eg -Wstrict-aliasing in gcc can be wrong and -Wdeprecated can be literally impossible to fix.


Note that:

  void foo(int p[static 1]);
is effectively a standard way to declare that p must be non-null pointer. I always wondered if any compiler actually makes use of this for optimization purposes.


Wonder no more: at least clang does, since around v3.5. https://godbolt.org/z/jEq5xbMna


If you have a null check on the pointer in the body, GCC will elide it.


After learning about a few of these I started to understand why people coming from C always said that PHP is a well designed language…

But OK, I understand that my mind is just not made for the complexity of C. Most likely I'm not a real programmer.

I get instantly knots in my brain and start to bang my head against the wall when I need to look for too long on C code. Actually even C documentation is enough to trigger this. (I get mad every time I have to look on a Linux system man page).

This is highly subjective of course. Other people seem to love C!

I'm more of a grug brain¹, who mostly only understands plain pure functions.

Input in, output out. No magic. Everything else's too taxing.

¹ https://grugbrain.dev/


PHP is a well-designed language because it has value types (immutable data structures). That puts it far above any language without them for correctness. The poorly designed old database libraries aren’t really a language issue.


C quirks. This is interesting. I have used some of the tricks myself, #1,#2,#4,#5.

#2 and #5 can be combined to make and interesting hack. When combinded with memcpy you can do

    int *a = memcpy(&(int){0}, b, sizeof *b);
C23 typeof makes this even more interesting

If you what an challenge here is a standard compliant c code. Try to undestand it. If can understand you are a master of c's type system

    static int* (*const *(*restrict x)[5])(volatile union {struct{int a;int b;};}[static const restrict 5], register enum{HELLO,WORLD} a) = {0};


> Main directly calls this_is_not_directly_called_by_main in this implementation. This happens because: [...] LLVM assumes that bar() will have executed by the time main() runs.

I think this reasoning is slightly incorrect, although I don't blame the author as this is a very common misconception. I believe the correct reasoning might be as follows:

1. The compiler sees the pointer is dereferenced.

2. The compiler infers the pointer was not NULL.

3. The compiler determines a set of candidates for its target (which may be the universal set).

4. If it finds only one candidate, it just substitutes the target.

The critical thing to notice here is that the compiler doesn't need to care about the reachability of that candidate. It's making a conservative over-approximation, after all. You can witness the effect of this by formulating an impossible condition inside bar() that the compiler completely ignores: see [1]. Note the pointer assignment cannot have been implied by "bar() will have executed", as the execution of bar() could never lead to that assignment anyway!

[1] https://godbolt.org/z/W19javzqW


Are there any practical cases where you'd want "extern void foo"?


You could use it for getting an address that will be linked in later. On GCC I get a warning (which I don't think I can mask) for taking the address of such an object, because its expression is type void. A better way of achieving this is usually to declare something like extern unsigned char foo[] instead, but that has a type other than void*.


> Typedef goes anywhere

WTF. Here have some keyword soup, the order doesn't matter at all. Must be fun to write a C compiler.


This one is actually pretty simple, it works a lot like static or typedef. It's really just a modifier for what is being declared - in a typedef we're not declaring a name to refer to an instance (variable) of a type, but we're declaring a name to refer to the type itself.


Quote: "4. Flexible array members ..... int elems[]; // <-- flexible array member"

TIL that a dynamic array is also called flexible. This generation, out of boringness, is trying to redefine well established paradigms? Because, for me, a 90's formed developer, "flexible" means maybe inheritance, or even better polymorphism. There is nothing flexible about a dynamic array. Its structure is well defined in the stack/heap, and with current compiler optimizations can even be demoted to a simple static array for faster access within CPU registries.


"Dynamic array" refers to block of memory allocated via malloc() which you just happen to use as array.

"Flexible array member" [0] is when you have a struct and its last member is an array with unspecified size.

An example:

  #include <stdio.h>
  #include <stdlib.h>
  
  struct Foo {
      int len;
      int* arr; // dynamic "array"
  };
  
  struct Bar {
      int len;
      int arr[]; // FAM
  };
  
  int main()
  {
      const int n = 12;
  
      // have to allocate myself; no guarante it will be nearby the rest of struct
      struct Foo* a = malloc(sizeof a);
      a->arr = malloc(n * sizeof *(a->arr));
  
      // array is part of the memory allocated for struct
      struct Bar* x = malloc((sizeof x) + n*(sizeof *(x->arr)));
  
      return 0;
  }

[0]: https://en.wikipedia.org/wiki/Flexible_array_member


>"Dynamic array" refers to block of memory allocated via malloc() which you just happen to use as array.<

No. A dynamic array is an array which can be expanded or shrinked during its runtime life. The fact that C/C++ uses malloc for that (and btw, it's not the only way to do it) it's her problem. In other languages you have dynamic arrays that can be expanded/shrinked without using an extra line - main reason why nowadays Rust is a replacement for C/C++

>[0]< From you own wiki reference: "the flexible array member must be last"

LMAO, really? Well, that indeed is a bigger C quirk. In Pascal, as an example, I can have it anywhere inside the record (struct equivalent of C), and it can be just as "flexible".


It has to be last because it's not a pointer to the array, it is the array. The array elements are immediately after the struct in memory. You can't resize it without reallocating the whole struct.


While a few of these were interesting I'd love to see a short technical explanation of each quirk for the feeble high-level programmer (me). The first one for example, is foo initialised? How so?


The reason is that a struct doesn't generate a new scope, like in C++. If you define something inside a struct it will also be available outside of the struct.


I think it's aimed at C programmers. foo is a struct, so it's a type, it's not a variable. The point is just that struct bar is also defined by the definition of struct foo.


Related: "A primer on some C obfuscation tricks"

https://news.ycombinator.com/item?id=22961054


Bundle these together as C That Will Get Your Ass Kicked


The "Compound literals are lvalues" is one that caught me out recently.

I've been doing C a long time and thought I knew all the "decent" tricks. When I saw it, I went "Oh, that's one of the silly new dynamic features that I ignore."

Nope.

It's been in the language forever. I'm surprised I never tripped over it before given all the embedded work I do.


I consider the array pointer stuff a bit of a foot-gun in C. I've seen too many examples of people mixing up uint8_t[][] and uint8_t**.

The "compound literals are lvalues" pattern I've seen many times for inline initializing a struct that's only going to be around as a parameter to a single function call.


I just fixed some neural-net code to use #7. I hate passing pointers to layers that have a fixed size, and passing an array causes problems sometimes that require too many cats. Typedef array pointer to the sized array is precisely what I needed.


Special mentions #2 looks like it might be useful, but rightly produces errors in C++ (and warnings in C). OTOH, `__builtin_constant_p()` is true for other things than constant literals.


I wouldn't call function typedefs a quirk, they seem pretty useful and common, at least in my world of hobbyist microcontroller programming.


See also the comp.lang.c Frequently Asked Questions: https://c-faq.com/


The switch/case anywhere looks equally useful and dangerous, and is so close to assembly that it really illustrates the low-level capabilities of C.


Can someone explain how "A constant-expression macro that tells you if an expression is an integer constant" works ?


If `x` is a constant, `(x) * 0l` is a zero constant, so `(void*)((x) * 0l)` is a null pointer. When a null void pointer is one branch of a ternary conditional, the expression takes the (pointer) type of the other branch.

If `x` is not a constant, `(void*)((x) * 0l)` is a void pointer to address 0 (which may not even be a null pointer at runtime, since null may have a runtime address distinct from zero!). The ternary conditional then unifies the types of the branches, resulting in `void*`.


Thanks!


My understanding of how it works is, with constant value, the compiler replaces (x) with the constant 0 and converts (void *) into (int *) which makes the size equality to return true. But I am not entire sure :)


Fantastic language. Truly freedom for a programmer. Thank you Dennis. Modern languages program you. :)


The zero initializer tip looks interesting, can't believe I didn't know about it before.


As someone who's moved on to Rust, I see this as one long list of nightmares.


I don't think the "I use Rust btw" comments contribute much to the discussion.

C and Rust don't perfectly overlap, especially since Rust is more a replacement to C++ than C.


Most people that still cling to C instead of C++, do it because they are stuck in UNIX clones kernel stuff, embedded, or are religiously against anything else.

So whatever language Rust "replaces" is a kind of moot point, and then there is the whole ongoing integration with Linux, a UNIX clone kernel.


Pointing out that a comment doesn't contribute to discussion doesn't contribute to discussion either. I'm definitely not contributing much by saying this.


I also see a fair few elements on that list as being problematic, to say the least. Can't stand Rust, though, so for those times I really need high performance I try and keep my C knowledge sharp-ish.

Fortunately GCC has a whole bucket-list of warnings that can be enabled (I like compiling with -Wall -Wextra -pedantic, myself) which can, combined with proper tooling, catch many issues.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: