Hacker News new | past | comments | ask | show | jobs | submit login
The Shortest Crashing C Program (llbit.se)
270 points by cfj on May 24, 2013 | hide | past | favorite | 89 comments



This reminds me of "A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux" [1]. The author tries to create the smallest possible elf executable possible. You would think it'd be easy... :) go read it. Very cool!

[1] http://www.muppetlabs.com/~breadbox/software/tiny/teensy.htm...


The way he just keeps pushing further and further pleases the hacker inside me. A recommended read for sure.


It depends on the definition. You can do better than this if you define a valid C program as anything that passes though the C compiler and generates an executable. Behold the zero length program:

$ touch a.c

$ gcc -c a.c

$ ld a.o

ld: warning: cannot find entry symbol _start; defaulting to 0000000000400078

$ ./a.out

Segmentation fault


You can build it with a single command:

gcc -nostdlib ./empty.c -o ./empty

Edit: This one actually runs correctly:

    $ touch empty.c
    $ gcc -static -nostartfiles ./empty.c -e_exit -o ./empty
    $ ./empty && echo $?
    > 0


That's a shame. At least in one point in history, that was the shortest-known quine:

http://www.ioccc.org/years.html#1994_smr


It still is, provided you follow the build procedure prescribed by the author of that quine (check the Makefile from the contest):

$ rm -rf a

$ cp a.c a

$ chmod +x a

$ ./a

$


Even with the regular procedure, it's a quine if you ignore stderr and only look at stdout, which should definitely count.


I find it interesting that displays the error from exec, but not bash:

    zsh: exec format error: ./a


Why?

The file is marked as executable, so the shell very reasonably tries to execute it by calling some well-chosen member of the exec() family (http://linux.die.net/man/3/exec).

The exec() function then needs to open and parse the file according to the formats it supports, which of course fails since the file is empty.

Do you simply mean that you expected the shell to validate this, and not try to execute empty files?


Traditionally, if the kernel cannot execute the file, then it is treated as a shell (/bin/sh) script. (Somewhere along the line, #! got added to specify an interpreter other than the shell.) I read POSIX as requiring this <http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu..., so if zsh claims to be a POSIX compatible shell, that's probably a bug.

In Seventh Edition UNIX, /bin/true is an empty file; it is a shell script that succeeds at does nothing.

Some later commercial UNIXes are noted to have /bin/true contain nothing but comments containing a copyright notice for that nothing.


The particular version of POSIX you linked to (2004) actually forbids the behavior you describe if you read it strictly. [1] defines a text file as "A file that contains characters organized into one or more lines.".

This was altered for 2008[2] to "A file that contains characters organized into zero or more lines."

The 2008 version is actually broken, since it contradicts itself -- a file cannot "contain characters" on zero lines.

[1] http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_...

[2] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...


> a file cannot "contain characters" on zero lines.

I disagree. To me this doesn't mean that a file "contains at least one character", but that files are containers and their contained values are characters. Like most containers in computer science, the set of contained values can be empty, but it's still meaningful to say that it's a container that "contains characters".


> Do you simply mean that you expected the shell to validate this, and not try to execute empty files?

I understand what happens here and why there is an error message in zsh, but I'm surprised by the fact that bash does not signal the error (exec returns -1, after all).

Bash includes logic to parse ELF[1], so I guess that after exec fails it tries to parse the file and has a special case for empty files.

    [1]: http://utcc.utoronto.ca/~cks/space/blog/unix/BashSuperintelligentExec


Does not compute. At least on OS X:

   $ ld a.o
   ld: warning: -macosx_version_min not specified, assuming 10.7
   Undefined symbols for architecture x86_64:
    "start", referenced from:
       implicit entry/start for main executable
   ld: symbol(s) not found for inferred architecture x86_64


Does it work if you specify an entry address? Something like this:

gcc -nostdlib ./empty.c -e0 -o ./empty


How about:

    $ ld a.o -U _main -U start


Since ANSI/ISO C, a "translation unit" (whatever is left of a file after preprocessing) has to have at least one declaration; a zero length source file won't cut it.


Originally I thought I'd skip mentioning compiling empty files because doing so without linking separately `gcc` will refuse to link it. I updated the article with a reference to your comment.


Actually you don't have to link it separately if you don't link against stdlib. See my comment here: https://news.ycombinator.com/item?id=5762578


Cool!


The explanation is not quite correct - execution starts at &main rather than the address given by the value of main. On VC++, at least - well, on my PC anyway - the process halts because the data segment doesn't have the execute bit set. It isn't trying to run code at address 0.

(If execution of bytes in the data segment were possible, which I'm sure it used to be, then you'd still likely get a crash, but it's not guaranteed. (uint32_t)0 is a valid sequence of instructions - it's ADD BYTE PTR [EAX],AL - and so if EAX contained a valid value then it would execute without a problem. Then, if the following byte were 0xC3 (RET) then the program would execute. OK, so that's all rather unlikely, but you have to bear these things in mind. So I think 0xCC (INT 3) would be a better choice.)


You're right about this. On my GNU/Linux machine, main is in the data segment. The process receives a segfault for trying to execute from a page marked as not executable.

    (gdb) print &main
    $1 = (int *) 0x600864 <main>
    (gdb) run
    Starting program: /tmp/s 
    Program received signal SIGSEGV, Segmentation fault.
    0x0000000000600864 in main ()

If we are to mark the data segment as executable (quite easy for ELF), we can see execution starting at &main and continuing until end of the page and then segfaulting for trying to execute from an unmapped virtual address.

    (gdb) print &main
    $1 = (int *) 0x600864 <main>
    (gdb) run
    Starting program: /tmp/s 
    Program received signal SIGSEGV, Segmentation fault.
    0x0000000000601000 in ?? ()

If we change the source to main=0xC3; as per your suggestion and we mark the data segment as executable, the program exits correctly (but with an exit status we don't control).

    (gdb) x/i &main
    0x600860 <main>:	retq   
    (gdb) run
    Starting program: /tmp/s_ret 
    [Inferior 1 (process 10588) exited with code 0140]


No, a ret instruction would probably segfault, depending on the content of the stack. To terminate a program you have to use the corresponding system call. On linux :

mov $1, %eax

int $0x80


That (or something like it) is true for the process as a whole, but not necessarily for main. It's usual to call main from a library-provided function, so it returns just like any other function. This removes the need to special-case main in any way, and provides a space for any system-specific startup and shutdown code.

If you've got VS2012, you can see this code in the file at something like "c:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\crt\src\crt0.c" (it should be easy to find for other versions - it's been in pretty much in that place, with that name, probably with those contents, since VC5 I think).

For glibc, see http://sourceware.org/git/?p=glibc.git;a=blob;f=csu/libc-sta....

My post was a bit x86/VC++-specific but the principles have been common to all the C environments I've used. I don't think I've ever used one that by default called your startup function directly, bypassing C runtime initialisation. (Though it's very easy to set this up with Visual Studio.)


to3m is correct. On my GNU/Linux machine main() is called from __libc_start_main(). A ret instruction in main() returns to __libc_start_main(), which in turn calls exit().


Seems to work really well: It even crashed the website.


My shitty server was never meant to handle HN traffic.


But now we can't marvel at how short it is.


I was about to say the same thing :D


Who says it will crash? Could run very nicely, printing a list of prime numbers, or write poetry, or anything else that undefined behaviour encompasses.


"global variables in C are initialized to zero implicitly"

NULL pointers will lead to a crash. It would be more interesting to have it as a random pointer, which could do quite anything.


> NULL pointers will lead to a crash.

Not in C. Dereferencing a NULL pointer is undefined behavior, so any of the actions described by the parent would be correct.


Is 0 really the same thing the sane thing as 'NULL' in the context of C? If you actually wanted a pointer to the begging of the memory, you would dereference 0, which has the well defined meaning of getting whatever is at memory address 0. When the programming attempts to get that, it is shut down by the system.


> Is 0 really the same thing the sane thing as 'NULL' in the context of C?

Yes. I don't have chapter and verse handy but it is in the standard. The bit pattern of NULL is not required to be zero (so memset(&p, 0, sizeof(p)) is not guaranteed to yield null) but it must compare equally to 0 and assigning 0 must produce NULL.

[Edit: OK, in C99 this is covered in 6.3.2.3: Pointers. "An integer constant expression with the value 0, or such an expression cast to type void * , is called a null pointer constant." Then 7.17.3 says that NULL expands to a null pointer constant.]

> If you actually wanted a pointer to the begging of the memory, you would dereference 0,

Yeah, it's really easy to set up an environment where that happens. At one point I was experimenting with writing a small/toy kernel for x86 and I mapped the virtual address 0 to a valid page, and boom, dereferencing NULL did stuff. Not a great idea to set up the page tables that way for obvious reasons, but I'm going to guess that lots of hardware out there will let you do it...

In the old days of 16-bit x86, linear address 0 had the interrupt vector, so as I recall lots of DOS (maybe even Win9x) environments had dereferencing NULL do meaningful (surely confusing) things.


Is there any standard-compliant way to crash in C? A call to abort maybe?

Maybe the task should have been formulated as the shortest valid C program that invokes undefined behaviour instead. (Or maybe the task isn't very interesting either way.)



Wordpress strikes again; "error establishing database connection"


A bad hosting strikes again.


I have bad hosting with good software. Ran flawlessly with 8-15ms generation times on #3 of the HN homepage for a couple hours, only the network latency went up to at peak ~1.2 seconds (got less than 1mbps upload here). The page also executes multiple database queries for each pageload, just like Wordpress. No caching needed for me, it's all about optimization.


Well, the mistake in this particular case is probably allowing more web application processes than database connections from those processes, which is an easy thing to get wrong.


What software do you use?


Self written, no framework used. It's a simple blog with quite custom requirements so I figured whynot just build a custom one. It runs on an Intel Atom, 1GB RAM (and there's more to run than a wamp stack). and 832kbps uplink.

As for software, I wrote it for PHP 5.3 (nowadays upgraded to 5.4 though) with MySQL and persistent database connections. The server is Windows 7 with apache 2.4.


The first IOCCC winner declared main as a short[] of VAX machine code: http://www.ioccc.org/1984/mullender.c

You could probably do the same thing in x86 and it'd work on a modern compiler.


It won't work on modern Linux with modern CPUs because the array will not be in an executable mapping.


> Also, global variables in C are initialized to zero implicitly, so this is equivalent:

EDIT: this is wrong, see below.

That's wrong. 'static' variables are initialized to zero. Non-static variables are un-initialized, so they have a "random" value.

See:

$ valgrind ./a.out

==5118== Memcheck, a memory error detector

==5118== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.

==5118== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info

==5118== Command: ./a.out

==5118==

==5118==

==5118== Process terminating with default action of signal 11 (SIGSEGV)

==5118== Bad permissions for mapped region at address 0x600864

==5118== at 0x600864: ??? (in /home/def/a.out)

==5118== by 0x4E54A14: (below main) (in /usr/lib/libc-2.17.so)


See my post: https://news.ycombinator.com/item?id=5762363

main will have a value of zero, and 0x600864 will presumably be &main (it's not the initial arbitrary value of main).

Auto variables are left uninitialized so that they don't have to be given a value when they're allocated. It's for efficiency, and it makes the compiler simpler to have this blanket rule rather than have it try to figure out the minimal initializations necessary (which probably isn't even possible). But this ocnsideration doesn't apply to globals or statics, because the initialization can be done at compile time, or (sometimes, in C++) on program startup.


> the initialization can be done at compile time, or (sometimes, in C++) on program startup

With ELF binaries for C programs it's done at startup as well. The data segment is created as having memory size SIZEM and file size SIZEF. If SIZEF < SIZEM, memory from SIZEF to SIZEM is set to 0.


Actually it is both. In C, variables with static storage duration are zero initialized.

Global(variables at file scope) and variables with static linkage (i.e. the static keyword) both of have static storage duration.


That's correct, my bad.


But global variables are static.


No, they're not. In fact, if the program used 'static main;' instead, it wouldn't even compile because the 'main' symbol wouldn't be visible by the linker.


Yes they are ! Global variables have static storage duration and are therefore default initialized. Be careful with the word 'static' which does not always correspond to the the keyword static which has several meaning ! When used with a global variable the static keyword has not the same meaning as static storage duration". It only means no external linkage.


Both have so-called "static storage duration", which is what influences the initial value. See C99 standard, section 6.2.4 paragraph 4:

"An object whose identifier is declared with external or internal linkage, or with the storage-class specifier `static' has /static storage duration/. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup."

The default initial value of objects with static storage duration is dealt with in 6.7.8 paragraph 10. Basically: pointers set to NULL, non-pointers have all bits reset, aggregates thus recursively.


Ok, well I agree. I mean global variables are not created dynamically. There is room reserved for them in the data segment which is initialized to 0. Can you give me an example where a global variable isn't initialized to 0? Your valgrind example doesn't say much about the value in the main variable ..

Edit, @deweerdt: ok :)


@bnegreve can't reply to your post, but i was mistaken. externally visible symbols are also initialized to 0


You can go even shorter if you cheat:

     $ cat short.c
     M
     $ gcc -DM='main;' short.c -o short
     $ ./short
     Segmentation fault


The Shortest Crashing Wordpress Site


The site seems down, "Error establishing a database connection"


Which has its own irony


:) works as intented


I'm not convinced this is a C89 program. It is only an "accident" that the linker doesn't know about types.

I find it hard to believe that the C89 spec states that an integer called "main" is to be considered the main function, and suspect this is undefined behaviour (though I've not checked).


It compiles and runs with gcc -std=c89 and gcc -std=c99, so even if it's not a true C89 program, it's a compilable GNUC89 program.

  $ gcc -std=c99 -pedantic /tmp/main.c -o /tmp/main
  /tmp/main.c:1:1: warning: data definition has no type
      or storage class [enabled by default]
  /tmp/main.c:1:1: warning: type defaults to ‘int’ in
      declaration of ‘main’ [enabled by default]
  /tmp/main.c:1:1: warning: ‘main’ is usually a function
      [-Wmain]
  
  $ /tmp/main
  Segmentation fault (core dumped)


Not on mac: a.c -> main;

    gcc -nostdlib -std=c89 a.c -o a

    a.c:1: warning: data definition has no type or storage class
    Undefined symbols for architecture x86_64:
         "start", referenced from:
         -u command line option
    ld: symbol(s) not found for architecture x86_64
    collect2: ld returned 1 exit status`


It can't be a valid C89 program. On many Harvard architecture based microprocessors data pointers and code pointers have differing size.


Is any crashing C program valid? You can only crash by invoking undefined behavior, and I believe that any program which invokes undefined behavior is "invalid". It's a major pitfall of C that determining "validity" requires solving the halting problem.


Different size maybe, but different busses^H^H^H^H^H^Haddress spaces definitely, and using an address on the wrong bus is a sure way to cause problems.


Of course that is the definition of a Harvard architecture. It doesn't says that it won't link, just that it won't work. If compiling to an architecture with smaller sized code pointers than data pointers then the linker will most likely refuse to link it at all - otherwise it will have to truncate the adresses.


^W will delete the previous word.


How about:

    main(){*(int*)0=0;}
or:

    main(){*""=0;}
or:

    main(){main();}


The last one is just as likely an infinite loop as a crash. Even C compilers occasionally manage to do tail-call optimization these days.


All substantially longer than the example given.


The site was down when I made my suggestions. Still, I think mine may be a bit more language compliant than the shortest variants in the article.


Edit: deleted because I got to the Google cache and that's what the site was suggesting.


That's essentially where he's going with it, noting that you can even leave off the "=0". But as others here point out there's some question as to how many linkers will actually produce an executable image from that source.


Seems appropriate that the default C program, as it were, segfaults.


"address 0, which is not an address that we have access to"

if I'm not mistaken, there are platforms like TI C600 dsps for which 0 is the start of the usable address space


Interesting. We tried to do the same for Haskell. The shortest we could come up with:

import Unsafe.Coerce;main=unsafeCoerce()1


Why not:

    main=undefined


That throws an exception, I believe. We wanted something that actually segfaults.


That really depends on what you mean by "crashing"


With visual studio V6.0 (AFAIR), I made a short program that crashed the compiler:

int a;::a::b();


It's also the shortest C program that you can link at all.


$ echo "m;" > short.c

$ gcc -O0 -c short.c

short.c:1: warning: data definition has no type or storage class

$ ld -e _m -o short short.o

ld: warning: -macosx_version_min not specified, assuming 10.7

$ ./short

[1] 2040 segmentation fault ./short


"The shortest crashing C89 program" to be more precise. :)


And interesting question, what would be the shortest not crashing C program?

main(){}

??



reminds me of the recent fad of TAS (tool assisted speed run) videos of "fastest crash" of video games.


Shortest crashing website: Error establishing a database connection


for those who see the server crashing: The program is in C89

main;




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: