This reminds me of "A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux" [1]. The author tries to create the smallest possible elf executable possible. You would think it'd be easy... :) go read it. Very cool!
It depends on the definition. You can do better than this if you define a valid C program as anything that passes though the C compiler and generates an executable. Behold the zero length program:
$ touch a.c
$ gcc -c a.c
$ ld a.o
ld: warning: cannot find entry symbol _start; defaulting to 0000000000400078
The file is marked as executable, so the shell very reasonably tries to execute it by calling some well-chosen member of the exec() family (http://linux.die.net/man/3/exec).
The exec() function then needs to open and parse the file according to the formats it supports, which of course fails since the file is empty.
Do you simply mean that you expected the shell to validate this, and not try to execute empty files?
Traditionally, if the kernel cannot execute the file, then it is treated as a shell (/bin/sh) script. (Somewhere along the line, #! got added to specify an interpreter other than the shell.) I read POSIX as requiring this <http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu..., so if zsh claims to be a POSIX compatible shell, that's probably a bug.
In Seventh Edition UNIX, /bin/true is an empty file; it is a shell script that succeeds at does nothing.
Some later commercial UNIXes are noted to have /bin/true contain nothing but comments containing a copyright notice for that nothing.
The particular version of POSIX you linked to (2004) actually forbids the behavior you describe if you read it strictly. [1] defines a text file as "A file that contains characters organized into one or more lines.".
This was altered for 2008[2] to "A file that contains characters organized into zero or more lines."
The 2008 version is actually broken, since it contradicts itself -- a file cannot "contain characters" on zero lines.
> a file cannot "contain characters" on zero lines.
I disagree. To me this doesn't mean that a file "contains at least one character", but that files are containers and their contained values are characters. Like most containers in computer science, the set of contained values can be empty, but it's still meaningful to say that it's a container that "contains characters".
> Do you simply mean that you expected the shell to validate this, and not try to execute empty files?
I understand what happens here and why there is an error message in zsh, but I'm surprised by the fact that bash does not signal the error (exec returns -1, after all).
Bash includes logic to parse ELF[1], so I guess that after exec fails it tries to parse the file and has a special case for empty files.
$ ld a.o
ld: warning: -macosx_version_min not specified, assuming 10.7
Undefined symbols for architecture x86_64:
"start", referenced from:
implicit entry/start for main executable
ld: symbol(s) not found for inferred architecture x86_64
Since ANSI/ISO C, a "translation unit" (whatever is left of a file after preprocessing) has to have at least one declaration; a zero length source file won't cut it.
Originally I thought I'd skip mentioning compiling empty files because doing so without linking separately `gcc` will refuse to link it. I updated the article with a reference to your comment.
The explanation is not quite correct - execution starts at &main rather than the address given by the value of main. On VC++, at least - well, on my PC anyway - the process halts because the data segment doesn't have the execute bit set. It isn't trying to run code at address 0.
(If execution of bytes in the data segment were possible, which I'm sure it used to be, then you'd still likely get a crash, but it's not guaranteed. (uint32_t)0 is a valid sequence of instructions - it's ADD BYTE PTR [EAX],AL - and so if EAX contained a valid value then it would execute without a problem. Then, if the following byte were 0xC3 (RET) then the program would execute. OK, so that's all rather unlikely, but you have to bear these things in mind. So I think 0xCC (INT 3) would be a better choice.)
You're right about this. On my GNU/Linux machine, main is in the data segment. The process receives a segfault for trying to execute from a page marked as not executable.
(gdb) print &main
$1 = (int *) 0x600864 <main>
(gdb) run
Starting program: /tmp/s
Program received signal SIGSEGV, Segmentation fault.
0x0000000000600864 in main ()
If we are to mark the data segment as executable (quite easy for ELF), we can see execution starting at &main and continuing until end of the page and then segfaulting for trying to execute from an unmapped virtual address.
(gdb) print &main
$1 = (int *) 0x600864 <main>
(gdb) run
Starting program: /tmp/s
Program received signal SIGSEGV, Segmentation fault.
0x0000000000601000 in ?? ()
If we change the source to main=0xC3; as per your suggestion and we mark the data segment as executable, the program exits correctly (but with an exit status we don't control).
(gdb) x/i &main
0x600860 <main>: retq
(gdb) run
Starting program: /tmp/s_ret
[Inferior 1 (process 10588) exited with code 0140]
No, a ret instruction would probably segfault, depending on the content of the stack. To terminate a program you have to use the corresponding system call. On linux :
That (or something like it) is true for the process as a whole, but not necessarily for main. It's usual to call main from a library-provided function, so it returns just like any other function. This removes the need to special-case main in any way, and provides a space for any system-specific startup and shutdown code.
If you've got VS2012, you can see this code in the file at something like "c:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\crt\src\crt0.c" (it should be easy to find for other versions - it's been in pretty much in that place, with that name, probably with those contents, since VC5 I think).
My post was a bit x86/VC++-specific but the principles have been common to all the C environments I've used. I don't think I've ever used one that by default called your startup function directly, bypassing C runtime initialisation. (Though it's very easy to set this up with Visual Studio.)
to3m is correct. On my GNU/Linux machine main() is called from __libc_start_main(). A ret instruction in main() returns to __libc_start_main(), which in turn calls exit().
Who says it will crash? Could run very nicely, printing a list of prime numbers, or write poetry, or anything else that undefined behaviour encompasses.
Is 0 really the same thing the sane thing as 'NULL' in the context of C? If you actually wanted a pointer to the begging of the memory, you would dereference 0, which has the well defined meaning of getting whatever is at memory address 0. When the programming attempts to get that, it is shut down by the system.
> Is 0 really the same thing the sane thing as 'NULL' in the context of C?
Yes. I don't have chapter and verse handy but it is in the standard. The bit pattern of NULL is not required to be zero (so memset(&p, 0, sizeof(p)) is not guaranteed to yield null) but it must compare equally to 0 and assigning 0 must produce NULL.
[Edit: OK, in C99 this is covered in 6.3.2.3: Pointers. "An integer constant expression with the value 0, or such an expression cast to type void * , is called a null pointer constant." Then 7.17.3 says that NULL expands to a null pointer constant.]
> If you actually wanted a pointer to the begging of the memory, you would dereference 0,
Yeah, it's really easy to set up an environment where that happens. At one point I was experimenting with writing a small/toy kernel for x86 and I mapped the virtual address 0 to a valid page, and boom, dereferencing NULL did stuff. Not a great idea to set up the page tables that way for obvious reasons, but I'm going to guess that lots of hardware out there will let you do it...
In the old days of 16-bit x86, linear address 0 had the interrupt vector, so as I recall lots of DOS (maybe even Win9x) environments had dereferencing NULL do meaningful (surely confusing) things.
Is there any standard-compliant way to crash in C? A call to abort maybe?
Maybe the task should have been formulated as the shortest valid C program that invokes undefined behaviour instead. (Or maybe the task isn't very interesting either way.)
I have bad hosting with good software. Ran flawlessly with 8-15ms generation times on #3 of the HN homepage for a couple hours, only the network latency went up to at peak ~1.2 seconds (got less than 1mbps upload here). The page also executes multiple database queries for each pageload, just like Wordpress. No caching needed for me, it's all about optimization.
Well, the mistake in this particular case is probably allowing more web application processes than database connections from those processes, which is an easy thing to get wrong.
Self written, no framework used. It's a simple blog with quite custom requirements so I figured whynot just build a custom one. It runs on an Intel Atom, 1GB RAM (and there's more to run than a wamp stack). and 832kbps uplink.
As for software, I wrote it for PHP 5.3 (nowadays upgraded to 5.4 though) with MySQL and persistent database connections. The server is Windows 7 with apache 2.4.
main will have a value of zero, and 0x600864 will presumably be &main (it's not the initial arbitrary value of main).
Auto variables are left uninitialized so that they don't have to be given a value when they're allocated. It's for efficiency, and it makes the compiler simpler to have this blanket rule rather than have it try to figure out the minimal initializations necessary (which probably isn't even possible). But this ocnsideration doesn't apply to globals or statics, because the initialization can be done at compile time, or (sometimes, in C++) on program startup.
> the initialization can be done at compile time, or (sometimes, in C++) on program startup
With ELF binaries for C programs it's done at startup as well. The data segment is created as having memory size SIZEM and file size SIZEF. If SIZEF < SIZEM, memory from SIZEF to SIZEM is set to 0.
No, they're not. In fact, if the program used 'static main;' instead, it wouldn't even compile because the 'main' symbol wouldn't be visible by the linker.
Yes they are !
Global variables have static storage duration and are therefore default initialized.
Be careful with the word 'static' which does not always correspond to the the keyword static which has several meaning ! When used with a global variable the static keyword has not the same meaning as static storage duration". It only means no external linkage.
Both have so-called "static storage duration", which is what influences the initial value. See C99 standard, section 6.2.4 paragraph 4:
"An object whose identifier is declared with external or internal linkage, or with the
storage-class specifier `static' has /static storage duration/. Its lifetime is the entire
execution of the program and its stored value is initialized only once, prior to program
startup."
The default initial value of objects with static storage duration is dealt with in 6.7.8 paragraph 10. Basically: pointers set to NULL, non-pointers have all bits reset, aggregates thus recursively.
Ok, well I agree. I mean global variables are not created dynamically. There is room reserved for them in the data segment which is initialized to 0. Can you give me an example where a global variable isn't initialized to 0? Your valgrind example doesn't say much about the value in the main variable ..
I'm not convinced this is a C89 program. It is only an "accident" that the linker doesn't know about types.
I find it hard to believe that the C89 spec states that an integer called "main" is to be considered the main function, and suspect this is undefined behaviour (though I've not checked).
It compiles and runs with gcc -std=c89 and gcc -std=c99, so even if it's not a true C89 program, it's a compilable GNUC89 program.
$ gcc -std=c99 -pedantic /tmp/main.c -o /tmp/main
/tmp/main.c:1:1: warning: data definition has no type
or storage class [enabled by default]
/tmp/main.c:1:1: warning: type defaults to ‘int’ in
declaration of ‘main’ [enabled by default]
/tmp/main.c:1:1: warning: ‘main’ is usually a function
[-Wmain]
$ /tmp/main
Segmentation fault (core dumped)
gcc -nostdlib -std=c89 a.c -o a
a.c:1: warning: data definition has no type or storage class
Undefined symbols for architecture x86_64:
"start", referenced from:
-u command line option
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status`
Is any crashing C program valid? You can only crash by invoking undefined behavior, and I believe that any program which invokes undefined behavior is "invalid". It's a major pitfall of C that determining "validity" requires solving the halting problem.
Different size maybe, but different busses^H^H^H^H^H^Haddress spaces definitely, and using an address on the wrong bus is a sure way to cause problems.
Of course that is the definition of a Harvard architecture. It doesn't says that it won't link, just that it won't work. If compiling to an architecture with smaller sized code pointers than data pointers then the linker will most likely refuse to link it at all - otherwise it will have to truncate the adresses.
That's essentially where he's going with it, noting that you can even leave off the "=0". But as others here point out there's some question as to how many linkers will actually produce an executable image from that source.
[1] http://www.muppetlabs.com/~breadbox/software/tiny/teensy.htm...