Hacker News new | past | comments | ask | show | jobs | submit login
Rebuild of the Debian archive with clang (debian.net)
175 points by yungchin on Feb 29, 2012 | hide | past | favorite | 54 comments



Looks like the most comprehensive list of differences between clang and gcc. I'm amazed.

Apparently, most differences are either because clang and gcc use different standards or interpret them differently.

Also, only 9% of the Debian packages have issues, meaning clang is getting more and more worth considering.


If you put basic standard compliance aside, it also strips down the errors from arguable code to horrifying coding practice.

http://clang.debian.net/status.php?version=3.0&key=VARIA...

http://clang.debian.net/status.php?version=3.0&key=NON-P...


Really I never got the argument for disallowing variable-length arrays at the end of a C structure. I completely agree with disallowing non-PoD variable-length arrays as well as variable-length arrays in the middle of a structure though.


VLAs at the end of C structures could have been allowed.

However, if you read the standard, you'll realize that it would have involved specifying far more exceptional cases than the mostly equivalent solution via flexible array members (ie allowing the last member to have incomplete array type).


I'm not altogether sure about VLAs' place in C itself, but they (along with contiguous storage of heterogeneous data) are neat ideas that let you do a few things cleverly. I think the usual example is a linked list of strings. Done the "normal" way each string will cause two cache misses - one for the string data, and one for the list node. If you're super unlucky you might get three cache misses (one for the node, one for the string object, one for the string data...). Done this way you only get one cache miss, which is pretty good and better than any other language will give you.

Another few cool things: If elements "know about themselves" (have a size element or a vtable etc etc) you can make stacks and queues without the need for a separate index.

There are also a few things you can probably only do in ASM: Say you have an "array" of objects "inheriting" from a common base (with sizes depending on their type) and all you ever want to do with them is iterate over them calling virtual functions on them. A nice way to do it would be to keep the iteration logic at the end of the virtual functions (which know about the sizes of their respective objects) and do something that smells a little like tail-call-elimination, maybe with a sentinel object on the end to wrap things up nicely. To get the best bang for your buck you'd need an architecture that isn't preachy about the stack or calling conventions, of course, but it still might be worth doing for a laugh.

Erm, I guess this wasn't too clear. In some pseudocode below:

    virtual void Factorial::process_and_print() {
        //processing logic:
        this->n *= (++this->i);
        print(this->n);
    
        //iteration logic:
        this += sizeof(Factorial);
        this->process(); //must do TCO
    }
    virtual void Message::process_and_print() {
        //processing logic:
        print(this->message);
    
        //iteration logic:
        this += sizeof(Message);//message stored by pointer
        this->process();
    }
    virtual void Sentinel::process() {
        //iteration logic:
        return;
    }
    funnyArray<Processable> fa;
    fa.add(Factorial(1));
    fa.add(Message("hello"));
    fa.add(Sentinel());
    fa.run(process); //prints something like "1hello"
I guess you'd mostly want the iteration logic and the sentinel to be taken care of by the language/library, too, but that's probably not worth thinking about unless there's actually a real use-case for something like this...


The entire concept of variable-length C arrays is at best iffy, but including them in structs is pretty crazy.

Consider these two questions:

1) What is the sizeof a struct containing a variable-length array?

2) How do you create an array of structs containing variable-length arrays?


1) The sum of sizes of the padding and fixed-size members of the struct. In the context of C, this is expected and fairly sane. It also matches the C89 idiom of ending a struct intended with a single element array when you want a variable length array. If C allowed zero-element arrays, then they would be used.

    /* c89 */
    struct {
        int len;
        int vla[1]; /* really len elements long */
    } MyVLAStruct;
2) Carefully.


> 2) Carefully.

Can you provide code demonstrating how you will "carefully" create a C array of structs with a variable-sized member? It will be very educational for me, at least, and I think others, as well.


Well, in the normal case, you wouldn't do it. These variable length structures need to be created on the heap to be able to be used in a variable length way, for the most part, so you'd just put a pointer to them into an array.

However, I can think of a couple of methods, such as packing into an array, and using a second one to index it, like so:

    a   = [aaaa,bb,ccc,dd]
    idx = [0,4,6,9]
To get to the i'th element of a, accesses would go through idx like so: a[idx[i]]. In general, of course, there's no way to allow O(1) access and updates without occasional repacking.


This is exactly the problem I'm getting at, you're not actually working within the confines of C here, you're creating funny workarounds for the fact that certain C features don't work how you want them to, and you're violating the type system in the process.

The concept of VLAs does not fit the language well, they're inherently something of an anomaly. Allowing them in structs would simply multiply the anomalies.

I'm frankly shocked that they were codified in C99 at all, rather than codifying something akin to alloca() with implementation-defined behavior, but I'm infinitely grateful the committee did not elect to make them anything more than they are -- which is a semi-portable mechanism for allocating arbitrary amounts of automatic memory.


What could be nice is if the error pages could suggest possible fixes for the errors that have "will never support". Might be helpful to guide bug-fixers (esp. ones that are starting out).


There's a clickable text on the top of those pages that points at the clang FAQ fragment suggesting possible fixes.


This is seriously amazing and will, certainly, improve all the codebases involved. Sylvestre and the others involved deserve a lot of good karma for this.

Going a step further, wouldn't it be great if all packages had automated tests that could easily be run on the 91% of the packages that were successfully built?


Thanks. I appreciate ! :)


Thank you.


If there was even an easy way to get involved in OpenSource development in general - this is it. It's pretty much a list of trivial 1-line bugs to fix!


And I will be happy to help contributors in this task!


And patches to generate, mailing lists to find and emails to write. If only every open source project were on GitHub.


That's also something I meant by getting involved in open source. I'm sure for some projects you'll have to figure out its local rules, sometimes maybe even explain to the developers what's clang, why the change is required to support it and why they should care. It's the whole experience, not only code change ;)


Amazing list.

I was frustrated that I didn't manage to figure out how to locate the results for a given package, if there were any.

Without that, how should I (as a package upstream owner) know if I need to fix my code, or at least analyze the results with respect to my particular package?


The build logs are all together:

http://clang.debian.net/logs/2012-01-12/

(I guess the ones with a 'b' appended are for clang)


Fantastic, just what I was after. Thanks a lot!

And phew, my package was not listed. :)


Would be really interested to see what caused those segfaults! But the link in the table just goes to the table.


Here's one from one of the logs (http://clang.debian.net/logs/2012-01-12/libgtkada2_2.14.2-5_...):

  Building libgtkada.so.2.14.2
  cd obj-shared; x86_64-linux-gnu-gcc -shared -fPIC -Wl,--as-needed \
  	  -o libgtkada.so.2.14.2 -Wl,-soname,libgtkada.so.2.14.2 glib*.o gdk*.o \
  	  gtk*.o pango*.o misc.o misc_broken.o -lgtk-x11-2.0 -lgdk-x11-2.0 \
  	  -latk-1.0 -lgio-2.0 -lpangoft2-1.0 -lpangocairo-1.0 \
  	  -lgdk_pixbuf-2.0 -lcairo -lpango-1.0 -lfreetype \
  	  -lfontconfig -lgobject-2.0 -lgmodule-2.0 \
  	  -lgthread-2.0 -lrt -lglib-2.0   -lgnat -lX11
  Segmentation fault
In this case it looks like the compiler is crashing when invoking the linker (I'm assuming that x86_64-linux-gnu-gcc has been aliased to clang).

The others: http://clang.debian.net/status.php?version=3.0&key=SEG_F...


However this doesn't proove that clang produce error-free (was afar the compiklation goes) executables.


No one can ever prove such a thing. Neither can it be proven that gcc or any compiler for that matter produces error-free executables. What this does, however, is providing information and analysis about the quality of the clang compilation process as compared to gcc.


No can prove it, but at least the GCC version are actually run by people - these version and created, but never actually used.

He should do some fuzz testing of these programs, with the exact same fuzz sent to the GCC versions, and then report any differences. But collecting the "results" would be hard - it's not always visible in output, you'd have to track system calls and IO.


The ideal scenario would be if upstream projects provided embedded test code with standardized hooks so that tests could be built, executed and results collected in an automated way.

Even if we started with a small test set for some projects, it would be a huge win in the long run just to have this scaffolding in place.

Any ideas on how to make a distro-agnostic testing hook?


It does not necessarily have to be distro-agnostic.

Debian packages can already hook the upstream's test suite (e.g. via dh_auto_test).

From my (extremely limited and mostly dynamic language) Debian packaging experience, it seems that more often than not, packages do not use this existing hook. Not sure why that is though.


As an example, gcc generates nearly unusable code for a Via Isaiah in amd64 mode if you compile with -O1, and this has apparently been a known issue for at least 2 years.


It can be proven, see CompCert C compiler: http://compcert.inria.fr/compcert-C.html.


That only proves that a compiler behaves according to a language definition. That does not prove the executables will behave as you expect the executables to behave, and thus there is no way knowing the executables will be error free.


Frownie said error-free compilation, not error-free in all possible aspects.


Doesn't Apple use clang for OS X? If it does, that's as good a validation as any for me.


Another (IMHO better) validation is that FreeBSD compiles the kernel with clang: http://wiki.freebsd.org/BuildingFreeBSDWithClang


FreeBSD can compile the kernel with Clang, but at least as of 9.0-RELEASE, the default is still GCC. They're aiming to switch to Clang for as much as possible for 10.0-RELEASE, in a couple of years.


FreeBSD has made a lot of progress with clang in the base system (which consists of both the kernel and standard userland); I run a clang-compiled FreeBSD on my laptop.

But that said, we have a lot more clang coverage through the work done by the ports team who have been running experimental builds with clang for more than a year. That work is documented here: http://wiki.freebsd.org/PortsAndClang


Isn't most of their code in Objective C? Does that use clang too?


clang does support Objective C (as does gcc).

Apple does look after a lot of plain old C (e.g. Core Foundation) and C++ (e.g. WebKit) too, so clang handles all three.


If I'm not mistaken, the drivers are also written in C++ (They used to be in Objective-C).


And? Nobody expects that gcc, or intel's reference compilers, or Microsoft's compilers produce provably correct output.

How could they? The C standard contains ambiguity in several areas.


Opps, type : : "(as far as the compilation goes)"


Why is everyone itching to get off of GCC? Or is it all just posturing, to get GCC to work harder now that it has competition?


GCC is becoming larger and more unmaintainable with each release. Competition in this space has been pretty sorely lacking for quite some time. The OpenBSD folks are also pushing PCC as a GCC alternative.

It'll be nice to have a BSD-licenced GCC equivalent for distribution with FreeBSD/OpenBSD, too.

Clang also has REALLY great error messages compared to GCC. It will tell you exactly where you forgot a comma, semicolon, quote, etc. instead of randomly pointing to some line around the error in question.


It is useful to keep in mind a common confusion.

Apple and various BSD's stopped updating GCC around version 4.2, the last GPLv2 version, which is approximately 4 years old now. For people who only develop on those platforms, this is what "GCC" means. So when someone from Apple or someone with an obvious BSD bias talks about how much better clang is than "GCC", they're usually talking about how much better clang is than an old unmaintained version of GCC.

clang does have strengths, but when we make comparisons, we should be clear about what it is we're comparing.


Unless there's been a concerted effort in the last ~year or so to improve GCC's error messages and compilation speed, clang's primary technical strengths remain unchanged relative to GCC.

Saying we're comparing old versions isn't really useful unless the new versions have actually addressed the relevant issues.


Recent versions of GCC do, in fact, have improved error messages. For example, GCC 4.6 fixed a problem that the post I was replying to mentioned, where a missing semicolon after a struct definition would cause GCC to elicit an error message pointing to some other nearby line.

If you're comparing clang to just GCC 4.2, then say that. If you've compared it with recent versions of GCC too, then say that.


    $ gcc -v
    ...
    gcc version 4.6.2 20120120 (prerelease) (GCC)
When it comes to diagnostics, GCC isn't even competitive.


Well Clang/LLVM is also becoming larger with each release. As for GCC becoming more 'unmaintainable', can you point me to anything supporting that notion? It's certainly not unmaintainable now as proven by them regularly releasing new improved versions which serve as the de facto compiler toolchain for open source systems.

Also when speaking of GCC and how old and full of cruft it is, you need to realize that it's code base is being continously improved with each new release with a strong focus on modularity. It's not the same codebase as it was back in 1987.

I'm very happy Clang/LLVM exists and progressing at a great pace as I see increased availability of open source compiler toolchains as something awesome. Also LLVM really does have some great features like the best-in-class error reporting and it also serves an undeniable purpose as a jit framework for numerous projects.

That said I don't understand those trying to push the idea of GCC being obsolete (unless they are nurturing some political/licence based crusade) as it's a very mature compiler toolchain targeting a large number of architectures and (atleast for me) quite importantly generating faster code than Clang/LLVM.

Also the notion that companies would be afraid of GCC strikes me as odd given that we have Red Hat, IBM, Google, CodeSourcery, Suse etc employing full-time GCC developers.

I hope to see these compiler toolchains continously going neck to neck in the future and thus provide the open source ecosystem with two first class options each with their respective strenghts.


"the notion that companies would be afraid of GCC strikes me as odd given that we have Red Hat, IBM, Google, CodeSourcery, Suse etc employing full-time GCC developers."

Afraid? I expect that these companies have spent serious money on lawyers to figure out whether GPL 3 was acceptable for them before committing to it.

Also, Mac OS supports OpenCL (http://en.wikipedia.org/wiki/OpenCL). One could argue that that makes the compiler more a part of the OS than "just another application that runs on it". If they used GPL code there, a perceived risk could be that they could be forced to make their OS GPL.


Even if everyone doesn't switch, having more than one widely used open-source C compiler would be healthy. If anything, it allows some non-portable code to be found and fixed. Also, competition is healthy. The different compiler developers can choose to focus on specific areas of improvement (e.g. compile speed vs executable speed) rather than a one-size-fits-all approach.


Fully agreed, having two open source, free compiler toolchains with which you can fully build your software stack beats having to rely on just one. And like you said competition is always good. Of course that is why it's better if both compilers continue to flourish, which is what I'm hoping/expecting will be the case.


This kind of work is valuable even without getting rid of GCC. Each compiler has a slightly different implementation of warnings; getting the code to compile under two compilers can find issues that you wouldn't find under just one compiler.

It also increases flexibility for developers. Perhaps I prefer to use Clang for its better error messages, or because I find it easier to target at a new platform that I'm working on. If the code already runs under Clang, then that's work I don't have to do. Clang can also be used as an independent parser more easily for GCC, so tools that use it to generate code browsing information or the like benefit from the code already being parseable with Clang.

Compiling under another compiler will also catch more of the cases in which you're depending on non-portable behavior or extensions, which increases the likelihood that you will be able to run it under static analyzers like Coverity, Clang Analyzer, Klee, Splint, or the like.

Basically, beyond wanting to move away from GCC, there is real value in putting work into making sure that everything compiles under another compiler. It's like making sure your web site works on multiple browsers; it can help you catch errors earlier, allow people to use better debugging tools, widen your audience, and help future-proof you.

Now, there are some people who are itching to get away from GCC. For one thing, it is a bigger, older, wartier project, that is harder to work with. People find Clang and LLVM more modular and extensible. For another, there are some companies that are deathly afraid of the GPLv3, and won't touch anything to do with it (Apple in particular, as well as some other embedded device manufacturers I believe). But there are many reasons for wanting to make sure that your software compiles under Clang, and not all of them mean that you want to ditch GCC, just that Clang provides additional value as well.


sane error messages are worth the switch alone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: