I've been playing around with a tool to answer the more generic question: why is my binary (written in C, C++, Rust, etc) so large?
We use CPU profilers to tell us why our programs are slow. My tool is intended to be a size profiler: why is my program big?
It turns out there are some really interesting analyses and visualizations you can perform to answer this question.
A few direct comments on the article:
- I do not recommend stripping your binaries completely! Use "strip --strip-debug" instead (or just don't compile with -g). You should realize that debug information and the symbol table are two separate things. The debug information is much larger (4x or more) and much less essential. If you strip debug information but keep the symbol table, you can still get backtraces.
- I don't believe -Wl,--gc-sections has any effect unless you also compile with -ffunction-sections and/or -fdata-sections, which these examples for C/C++ do not.
If you care about binary size in C or C++ you should probably be compiling with -ffunction-sections/-fdata-sections/-Wl,--gc-sections these days. Sadly, these are not the default.
- It is fine to strip both debug information and symbol table if you want the distributable (I mentioned the ISP at the beginning for one reason) and the program does not rely on them in any way. Actually, I wonder why separate debug files [1] are not a norm in Unixes.
- `-Wl,--gc-sections` does have an effect even at the absence of `-ffunction-sections` (and so on) because libc and libstdc++ is already compiled in the way allowing for GC---perhaps necessarily. The example had a single function anyway and I was lazy enough to exploit that `-ffunction-sections` wouldn't have a difference...
Without symbols you can't get backtraces, profile the program, use function-based DTrace probes, readably disassemble it, etc. I'm not saying it's impossible to distribute stripped binaries, I'm just saying I don't recommend it. Compared with the debug information, I think the symbol table is much bigger bang for the buck, considering how much smaller it is.
It's strange -- I can verify with -Wl,--print-gc-sections that this is indeed discarding some sections from a static glibc link, so I was wrong about that. On the other hand I can also see plenty of sections in libc.a that have more than one function in them -- not sure why this would happen if libc was indeed compiled with -ffunction-sections.
I agree that separate debug files are nice, and OS X's implementation of them is especially nice, since it does a very good job of finding the right debug information for an executable.
Yeah, I don't doubt the usefulness of the symbol table. I'm probably thinking of the distributable in Windows, where you... don't really do such things.
IIRC the coding convention of glibc is that most functions (and probably all public functions) are contained in their own files. So it effectively results in the similar effect, even when `-ffunction-sections` is missing.
Actually stripping debug information (or compiling without -g) makes it very difficult to troubleshoot a problem in production, where the binary might be deployed at a customer's site, and where the source code might not be available (firewall, no network connection, ...) The additional information is encoded into the ELF header, but ignored by the runtime linker, so it does not hurt the performance of the program.
The space savings come nowhere close to the benefit of having the additional information available when debugging.
Some operating systems, for example SmartOS, and any other OS based off of the illumos's source code, even inject the entire source code into the ELF binary and library in a special compact format during the build with the ctfconvert(1ONBLD) / ctfmerge(1ONBLD) tools[1][2].
If you are ever considering stripping the binary just to save some disk space but do not have a good reason for it (like building for a space-constrained appliance), please abstain from doing so; every developer and engineer trying to debug your program will be thankful to you if you do not remove the debugging information.
Are you sure those flags has any effect? In the C++ project I'm trying it on --ffunction-sections/-fdata-sections/--gc-sections causes the object files to grow a bit. But the resulting binary is to the byte exactly as big as it was without the flags.
The short version is: as small or large as you want to trade-off convenience vs performance vs size.
I've made useful firmware for a micro-controller (yes in Rust) which is just 5KB. You can, as the article shows how, dynamically link and then it's about the same as the equivalent C/C++.
The point is, there's nothing inherent about Rust - the language - which results in binaries appreciably different in size to what you can achieve in C/C++.
> I've made useful firmware for a micro-controller (yes in Rust)
Would you mind sharing which microcontroller and how to get a Rust compiler for it? I would love to use Rust to program microcontrollers. Every time I've looked into this, I've thought it wasn't possible, because I don't see any microcontrollers in the list of supported platforms here (https://doc.rust-lang.org/book/getting-started.html#platform...) or here (https://github.com/rust-lang/rust/tree/master/mk/cfg). I've considered trying avr-rust (https://github.com/avr-rust/rust), but the README says "NOTE: This does not currently work due to a bug." Any pointers to get started would be appreciated.
I've used Rust on a variety of ARM Cortex-M based microcontrollers including Atmel SAMD21, NXP LPC1830, and STM32F4. AVR is tougher because that's an 8-bit architecture with very new support in LLVM.
On microcontrollers, you use libcore instead of the full libstd, for lack of an OS or memory allocator. Libcore is easy to cross compile because it has no dependencies. You provide a target JSON file to specify LLVM options, and a C toolchain to use as a linker, and nightly rustc can cross-compile for platforms supported by LLVM.
Things like interrupt handlers and statically-allocated global data require big chunks of unsafe code. Rust has a promising future in this space, but it will take more experimentation to get the right abstractions to make microcontroller code more idiomatic.
> I've used Rust on a variety of ARM Cortex-M based microcontrollers including Atmel SAMD21, NXP LPC1830, and STM32F4
That's an encouraging start.
Having been there, and as reported on weekly basis by numerous people like @internetofshit, much of the code that runs the current IoT hype is utter and complete tripe. I have a faint hope that Rust could be something to help in that regard, even though at the same time I recognize that much of the badness in IoT has to do with financial considerations (too costly to make a good quality talking coffee-maker). Maybe it's especially because IoT seems to be all about getting cheap stuff out cheaply, I have this hope that at least having the language and development ecosystem help devs instead of shooting them in their foots would be a good start.
For 8 bit platforms, 8 bits are still bytes/sbytes. Most of them have native functions that work on 16 bit integers, they just take up a pair of registers. You have access to the same array of integer sizes. Plain int is 16 bits, and longs are 32 bits and long longs still 64. Still, I much prefer using the int definitions included in C99, where you define the bit size explicitly. uint16_t is a lot more explicit than int, especially if you've got code that's being shared between a few different micros of different word sizes.
On AVR 8 bit microcontrollers, at least: yes. Pointers are 16 bits, and there are a handful of instructions specifically for 16 bit integer and pointer operations (which operate on pairs of 8 bit registers). For everything else, operations are performed by chaining together 8 bit operations. Adding two 32 bit ints for example would need 4 add instructions.
Last time I looked at this, it seems the number of supported microcontroller are limited. It is usually depends whether llvm supports it or not.
I was wondering if the Rust front end of llvm can be ported to other compiler engines. That could open up the number microcontrollers supported quite quickly.
A related but distinct question might be: How does Rust filesize scale? Meaning, a Hello World application is X, and X is bigger than C/C++. But does Rust outgrow projects in C/C++ as the complexity increases? Or is the code generation actually consistent and all of this is simply about the corelib size and nothing more.
Given that Rust monomorphizes generic functions like C++, and given that Rust uses a high-quality C++ backend for code generation, I'd assume that default binary sizes would be comparable. Producing a more scientific comparison would require implementing a large project more-or-less identically in both languages, which is unlikely in the near future (it might be instructive to compare Servo to Gecko, but Servo isn't near complete yet, and even then Servo does many things differently that might influence the comparison).
> Just to be cautious: Is it a valid question to ask after all? We have hundreds of gigabytes of storage, if not some terabytes, and people should be using decent ISPs nowadays, so the binary size should not be a concern, right?
$ find /usr/bin -type f | wc -l
2254
$ du -sh /usr/bin
445M /usr/bin
If all these programs were written in Rust and statically compiled, assuming only a 600K difference by binary, that would make my /usr/bin 1.3G (or 300%) larger.
But in reality all those programs are dynamically linked against many libraries in /usr/lib, so the difference would be even bigger, with libraries duplicated between all those programs.
Sure, you can dynamically link with Rust too, but then, you hit the other problem that there is no stable ABI (yet?), and that upgrading Rust (every 6 weeks) means recompiling everything.
I would gladly sacrifice that disk space in exchange for system utilities written in Rust. We're still finding vulnerabilities in core utilities, even after all these decades.
Most recent vulnerabilities in core utilities really don't have a lot to do with memory safety though - Shellshock & Imagemagick were input sanitization, other common ones though are still injection vulnerabilities or authentication weaknesses. Heartbleed excluded most major vulnerabilities these days aren't related to memory safety.
Sure, but Rust isn't just about dealing with memory safety. The language also lends itself well to solving other common mistakes by virtue of its design and by being built on modern principals. Idiomatic C promotes throwing around pointers/arrays and hoping that the next coder that comes along to consume a struct reads the docs/header and understands how the data in that struct is supposed to be used. Idiomatic Rust uses its type system to strictly enforce how a struct and its data can/should be used. It's a world of difference and results in drastically less bugs. Not to mention the rest of the Rust ecosystem works in harmony with the language to further reduce bugs. Testing as a first class citizen of the language and its tooling is one of the big ones.
That isn't to say that you can't do something similar in C, but it is an order of magnitude more challenging to design a "module" in C that is explicit and robust compared to the effort to do the same in Rust. I've coded my fair share of cryptographic systems in both C and Rust. Bulletproof C is just _exhausting_ to code and work with. The same kind of code in Rust is, dare I say, fun to write. It's just a joy to use Rust's type system to enforce rules and invariants, and then codify those rules in the documentation comments above the structs/functions, and then have "cargo test" actually run the code in that documentation automatically to check it for validity.
And yes, as you point out, some of the big bugs lately have been logic bugs resulting not necessarily from poor code but from poor design. Thing is, the less mental capacity a language requires from a coder the more mental capacity that coder has to use for thinking about the application logic. i.e. in C when you get a string you have to think about how to handle the UTF-8 encoding and what to do about path names that somehow ended up with a non UTF-8 character, and whether the string is NULL terminated or pascal, and is memmove (src, dst), or (dst, src)? In Rust, well, that's all handled, so you think about what the string actually means and, hopefully, you'll realize that hey you should probably sanitize that string so it can't be used to gain shell access from an SVG file.
Heartbleed is not a real memory safety bug when program reads beyond allocated memory. It is more of improper reuse of previously allocated buffer and could exist in safe Rust just as well.
You're right, there isn't a classic simple buffer overrun that Rust would trivially catch, but you're missing two things:
1) The problem was really sending back uninitialised memory. In Rust you can't have uninitialised memory. The oversize allocated buffer would have to have initialisation data passed in (possibly zeroes)
2) You'd never write the Rust code like that anyway. The abstractions avaialble mean that you aren't separating the content of some data and the length to pass to allocators.
Forgetting about Rust for a second. Talking about dynamically linked libs in general.
Dynamic linking was an optimization which came about when memory was expensive. Memory is no longer expensive.
Is 1.3GB (or even 13GB) a lot on your hardware[1]?
According to "Do not prematurely optimize": Pretending we never had dynamic linking, and given today's hardware constraints[2], as a community would we choose to reimplement and universally adopt this optimization?
[1] Keep in mind that a single "modern" application, on average, weighs in the 10s of MBs or GBs.
[2] I'm talking about the general case, for the majority of OS distributions, ignoring the relatively exceptional case of embedded systems, which do in-fact still need it.
Main memory is cheap but slow. Having a frequently called function in a shared library vs. statically linked code could mean the difference of the code executing from CPU's cache or from main memory.
Even latest desktop processors have a L3 cache of only a few MB.
Inlining saves time lost due to jumping about, but it can cost time if it causes code replication (same as loop unrolling), because it can bloat the hot code to larger than the smallest cache.
So the arguments against inlining apply even more strongly when talking about every program being statically linked, the same code (standard library) will exist in memory in many places, and will get dumped and reloaded to L2/L3 every process swap. Nothing slower than having to wait for something to be faulted in.
And sufficiently aggressive inlining will increase the program size further. This might or might not be compensated for by the increase in instruction-pointer locality.
> Having a frequently called function in a shared library vs. statically linked code could mean the difference of the code executing from CPU's cache or from main memory.
I am under the impression that when process switches the CPU caches are flushed.
No. Only TLBs are flushed (and probably only partially). TLBs are used to associated virtual addresses with physical addresses and memory maps are different per processes.
(That's one reason why it's beneficial to schedule a process on the same CPU if possible - the data is still in the cache)
Dynamic linking offers modularity and separation of concerns.
I don't really care which point-release of zlib my program is linked with, I just want to decompress stuff. If someone finds a bug (or exploit), I am not the best person to quickly realize it and release an update -- the maintainer of zlib, and the packagers, and OS distributions, and sysadmins are in a much better position. But if it's statically linked, then developers have to be involved.
You could say that we could invent a mechanism to allow sysadmins to rebuild with patched libraries, but then we'd still need to reinvent all of the versioning and other headaches of dynamic libraries.
I think dynamic libraries are kind of like microservices. Sure, they can break stuff, but they allow higher degrees of complexity to still be manageable.
Memory is terribly expensive and I have to fight all of the other developers/product folks/upper management for every byte in my environment (hundreds of thousands of servers).
I have no choice but to use DSOs for our Rust code.
Dynamic linking also lets you update libraries due to things like security issues, it's not just a memory thing. Kinda agree on the space thing too (plus much less chance for things like buffer overflows..)
FWIW: I think everything has its place, and everything has tradeoffs. I can definitely see a lot of usefulness for dynamic linking. The point you raise probably being the best current reason.
... but since I'm already playing devils advocate :)
Dynamic linking also lets you update libraries ... and cause security issues simultaneously across all applications. Increasing the number of possible attack vectors to successfully utilize that vulnerability.
Actually, it's a wash. If all we had was static linking, people would statically link the same common libraries. So you'd have to update multiple binaries for a single vulnerability.
I've seen this in my day job at Pivotal. The buildpacks team in NYC manages both the "rootfs"[0] of Cloud Foundry containers, as well as the buildpacks that run on them.
When a vulnerability in OpenSSL drops, they have to do two things. First, they release a new rootfs with the patched OpenSSL dynamic library. At this point the Ruby, Python, PHP, Staticfile, Binary and Golang buildpacks will be up to date.
Then they have to build and release a new NodeJS buildpack, because NodeJS statically links to OpenSSL.
Buildpacks can be updated independently of the underlying platform. The practical upshot is that anyone who keeps the NodeJS buildpack installed has a higher administrative burden than someone who uses the other buildpacks. The odds that the rootfs update and NodeJS buldpack are updated out of sync is higher, so security is weakened.
This was a much more powerful reason before things like docker became common, and methodologies adapted to provide updates for docker images, which for this purpose are functionally identical to a static binary.
At least I hope "methodologies adapted", I don't use docker images, so that's an assumption on my part, but I feel it's a fairly safe bet.
Docker images don't have a nice way of updating without "rebuilding everything". There's a tool called zypper-docker that does allow you to update images, but there's no underlying support for rebasing (updating) in Docker. I was working on something like that for a while, but it's non-trivial to make it work properly.
Hmm, I assumed it would be something along the lines of the images being fairly static, and updated as a whole, and you just apply your configs and data, possibly through mount points.
I was responding to the comment that security updates to libraries make it harder to update static binaries. Docker has revived the problem, and there isn't a way of nicely updating images without rebuilding them (which in turn means you have to do a rollout of the new images). While it's not a big deal, it causes some issues that could be avoided.
Yes, but presumably you're running far fewer docker images than you have binaries that would be affected if you statically compiled everything. For example, I assume in a statically compiled system, an update to zlib will likely affect a lot more packages than docker images you are running (on a server I admin, there's 3 binaries in /bin that link to zlib, and 374 binaries in /usr/bin, which will condense down to some smaller, but still likely quite large set of OS packages). It's easier in a dynamically linked system, where you can just replace the library, but it's not that much better for the sysadmin, as if you want to make sure you are running the new code, you need to identify any running programs that have linked to zlib and restart them, as they still have the old code resident in memory.
> "Do not prematurely optimize" is not a software design rule, it's a time-management rule.
No, it's both. Optimization often affects the cost of later decisions, and the reason not to prematurely optimize is because it can easily take you to a local optimum which is not very optimal at all. This is a perfect example of that, as the GP comment point out. If memory is not constrained as it was when this trade-off became common, it may not have become prevalent. Static binaries are faster (the degree to which depending on a lot of factors), while dynamic binaries are smaller on disk and in memory, if the shared libraries are already used elsewhere. Modern optimizations at the OS level for forking and threading should make consideration of those negligible.
Dynamic linking wasn't invented as a premature optimization though. And if it didn't exist today, it would still not be premature to invent it, because dynamic linking does not only concern how large and fast your program is, but also how it is interacted with.
So here is my point: optimization that can affect the relevant interfaces of your software is not premature because deciding on the interfaces your software exposes is not premature.
You are choosing to focus on the "premature optimization" wording, which is fair, it was said. I'm focusing on "as a community would we choose to reimplement and universally adopt this optimization?" (emphasis mine). I think it would be implemented, I do not know of any evidence that makes me believe it would become universally adopted given modern resources.
I'm not sure exactly what was originally meant, I interpreted it as how currently dynamic linking is the norm, is used in every mainstream OS and most the applications run on them, and all the major mobile platforms. If we had to make a choice right now without the history of dynamic linking behind us, would we still choose to use it for the majority of platforms?
Dynamic linking certainly has no technical advantages like lower memory/disk usage or faster processing. Its main advantage, which has been cited before, is that it forces cohesion in the Linux community.
E.g if an author of a program finds a problem in the dynamic library he or she is using, the problem is forced to be solved upstream, benefitting all users of the library. If instead static linking was the norm, it is much more likely that the author would just solve the problem for him or herself and the solution would never reach the wider community.
In the best of worlds, we would have static linking everywhere but the "social contract" of dynamic linking would be enforced just as strongly.
You could go the way of busybox or uutils and have a single binary with many hard links. So 'ls', 'wc', 'grep', etc can all point at a single executable which dispatches to different functionality based on argv[0].
Then you can even share code between the binaries, which should make them even smaller.
> Just to be cautious: Is it a valid question to ask after all? We have hundreds of gigabytes of storage, if not some terabytes, and people should be using decent ISPs nowadays, so the binary size should not be a concern, right?
Phones? IoT? Embedded? OS devs? Someone just checked in a 14KB binary size reduction to our shell (by removing unnecessarily virtual methods in some C++) that was widely celebrated.
5 KB. There are a few unfortunate reasons this isn't even smaller, one of them being a bug report[2] I filed in LLVM.
Also note that Rust is a lot further along than Zig right now. Zig does not have backtraces or threads yet. But I believe that the executable size for hello world in release mode will not contain backtrace code, or threads, or a memory allocator, even when Zig catches up to Rust in terms of std lib functionality.
"C and C++ folks had been fine with that approximation for decades, but in the recent decade they had enough and started to provide an option to enable the link-time optimization (LTO)"
There has been a technique (the 'unity build') that approximates a poor-man's LTO for a long time. Basically you #include all your cpp files in to a giant translation unit and then compile it :)
VC++ first shipped it in Visual Studio .NET (early 2002), and I don't remember it being touted as the first production-quality implementation of the concept, so I assume it was around elsewhere before that.
I dunno about "much" older but I'm sure it was part of the Xbox tools in 2001... they provided some kind of early build of the VS2002 compiler. VS2002 proper was released in 2002 (oddly enough).
Go doesn't pride itself on not having a runtime though. People aren't surprised that there's space taken up by the garbage collector and green thread management and such.
I don't understand your remark. To the contrary, a runtime should decrease binary size. The same PHP script (interpreted not JIT or VM'd though, obviously) will be only a few hundred bytes..
The runtime in this case is compiled into the binary. With php, the runtime is all contained in the php binary that runs your scripts, in Go's case, the runtime is copied into every binary it produces.
Aren't you just assuming the runtime is dynamically available? This is not a necessary feature of a runtime. You could compile PHP and add the size of the interpreter to the executable.
Before I found rust I used to be a big advocate for Scala. What eventually drove me away from Scala and towards go and rust was my jar's for Scala were clocking in at 300 to 400 Mb's, so the fact that a rust binary has a pretty fixed overhead of only a few hundred Kb I consider that a big win.
I recall someone once getting a Squeak Smalltalk image down to 384k. However, there was a Digitalk originated project called "Firewall" that could produce Smalltalk images as small as 45k, suitable for writing command-line programs, even on early 90's machines. (Even recent versions of VisualWorks can get their memory footprint down around to that of Perl 5's, and even beat Perl 5 in terms of startup speed, provided you are prepared to dig around and shut a whole lot of things off.)
That's fairly impressive. Perl 5 is pretty quick to start:
# perl -E 'my $cmd = shift; use Time::HiRes; my @times; for (1..10) { my $start=Time::HiRes::time; my $out = system($cmd); my $stop = Time::HiRes::time; die if $out>>8; my $time = $stop - $start; push @times, $time; printf "%0.4f\n", $time; } @times = sort {$a<=>$b} @times; @times = @times[1..8]; my $cumulative=0; $cumulative += $_ for @times; my $average = $cumulative/8; printf "Average time of 10 runs of \"%s\", dropping best and worst: %0.4f\n", $cmd, $average;' "perl -e '1;'"
0.0039
0.0032
0.0031
0.0031
0.0032
0.0035
0.0036
0.0030
0.0033
0.0032
Average time of 10 runs of "perl -e '1;'", dropping best and worst: 0.0033
That's Perl 5.22.1. For "python -c '1'" (Python 2.7.5) I get 0.0173. A minimal C program (just return success from main after including stdlib and stdio) built with default gcc opts is <9k in size, and vacillates between averaging 0.0007 and 0.0010 in the the benchmark above when I run it.
That perl code that Time:HiRes to measure the time it takes to start perl via system(), which includes the time it takes to fork a shell and parse the command and then fork to spawn perl. `time perl -e '1;'` is more representative of the raw perl startup time, which on my machine is reliably 0.002 real seconds, ⅔ of your times (that is, there's a lot of overhead in your measurement).
A minimal C program (just return success from main after including stdlib and stdio)
You do realize that a "including stdlib and stdio" means nothing for a C program, right? These are, literally, just the API definitions, in the Oracle-vs-Google Android sense. The default gcc options probably produced a dynamic executable; if you compile it statically, you might be able to shave a few more microseconds in startup time.
> includes the time it takes to fork a shell and parse the command and then fork to spawn perl.
The only difference from the shell builtin time, or /usr/bin/time, should be the shell startup and exec call. Every command is fork+exec, so you can't really get away from that. If I really cared, I would try to account for that, but I don't. I thought it was pretty obvious that I wasn't being very rigorous. I just did the minimum to make the values I got not useless.
> which on my machine is reliably 0.002 real seconds, ⅔ of your times
And on mine it fluctuates between 0.002 and 0.009. You mention your times relative to my times, but that's useless. I ran mine on a 512mb VPS. The only relevant measure to determine overhead would be my method vs the shell builtin. Is that what you were actually referring to?
> You do realize that a "including stdlib and stdio" means nothing for a C program, right?
Truthfully, so I wouldn't have to remember the setup for C, I just googled "minimal C program" and removed the printf it had. I only mentioned the includes for completeness sake, and I didn't want to clutter the comment with the source.
> if you compile it statically, you might be able to shave a few more microseconds in startup time
Possibly, but that's not really what I was trying to convey. I was just pointing out that a minimal Perl program isn't slower than a minimal C program, and it's notable if you can get your startup times close to that for a system with a virtual machine.
Err "minimal Perl program isn't slower than a minimal C program" was supposed to be "minimal Perl program isn't that much slower than a minimal C program".
I really like executable shrinking posts. However, isn't it the case that the size of the executable won't increase significantly if you use --release to distribute bigger programs? After all the size comes from the library and memory allocator being included in the executable. As long as the libraries are not heavy, the executable should stay reasonably small.
That matches my (amateur) understanding of how it works. The executable would increase by small amounts because presumably your own code is a relatively small portion of all the code included in the default binary. Then again, since it's not dynamically linking, every crate you use increases the size...
Another example of how languages don't matter, and that developing a good enough language with all the nice ecosystem that goes with it, is a massive amount of work, that can only be accomplished by OS vendors or years of dedicated OSS contributors.
At that moment, I think most language developers should turn to LLVM to spread their languages more rapidly, rather than trying to mess around with all the realities of language making, especially when it involves statically typed and compiled languages.
*BSD and GNU/Linux eco-systems are a bit different in this regard, but on other OSes usually system programming languages that aren't part of the OS vendor offerings tend to have a hard time getting wide adoption support.
How much of that initial 650k is constant in size? Would we expect it to increase with programs of substantial complexity, or is the overhead relatively constant?
The cause of this has always been known (within the Rust community), but it is not considered a problem. For most users, the advantages of static linking and using jemalloc significantly outweigh the cost of a 0.5mb constant overhead on binary size. Those users under different circumstances can configure their build using the tactics in this article to make a different trade off more suited to their circumstances.
I don't understand the article's point about ISP speeds. Really if you're on a slow enough ISP that a <1MB file is too large for someone to download, then practically speaking they won't be able to download anything smaller either.
Because it's not that single ~1MB file that's a problem, it's the cumulative effect. There are people on dialup and GPRS and even people who pay by the megabyte.
We use CPU profilers to tell us why our programs are slow. My tool is intended to be a size profiler: why is my program big?
It turns out there are some really interesting analyses and visualizations you can perform to answer this question.
A few direct comments on the article:
- I do not recommend stripping your binaries completely! Use "strip --strip-debug" instead (or just don't compile with -g). You should realize that debug information and the symbol table are two separate things. The debug information is much larger (4x or more) and much less essential. If you strip debug information but keep the symbol table, you can still get backtraces.
- I don't believe -Wl,--gc-sections has any effect unless you also compile with -ffunction-sections and/or -fdata-sections, which these examples for C/C++ do not.
If you care about binary size in C or C++ you should probably be compiling with -ffunction-sections/-fdata-sections/-Wl,--gc-sections these days. Sadly, these are not the default.