Hacker News new | past | comments | ask | show | jobs | submit login
Why can't we all use standard libraries for commonly needed algorithms? (acm.org)
108 points by snth on Feb 23, 2011 | hide | past | favorite | 46 comments



Counterexample: zlib.

Everyone uses zlib. It's hard to find a binary it's not linked into. I doubt anyone reimplements it -- what's the point? Maybe some people copy it into their source tree, but what's wrong with that?

The key? Make sure people have heard of it, and make it as unobtrusive as possible. The author's "libmd" fails at the former, and PGP, GnuPG, and OpenSSL fail at the latter. There is no good reason not to Just Use Zlib. Quoting from their website: "A Massively Spiffy Yet Delicately Unobtrusive Compression Library."

zlib is my inspiration.


We tried to use something other than zlib once. The application needed to make a copy of a CD-ROM and compress the copy. This was back in the day when a couple gigs was a big hard drive, and 200 MHz was a fast CPU. We first tried zlib but it wasn't fast enough for the client. We looked around, and found another library, written by a graduate student, that was a lot faster. It didn't compress as well, but it was good enough for the client, given the speed. The only problem was it was under GPL, and no way was the client going to open source the application.

We contacted the grad student, and worked out a tentative deal to license the library for around $2k and started working out the details. Then he got all paranoid about Microsoft. Most of our products were Windows products--so what if Microsoft bought us? Would they end up with a Microsoft-wide site license for his library? He wanted to put into the contract terms that would terminate the license if our company changed ownership. Those kind of terms are tricky. Would we lose the license if we went public? What if one of the two major owners (who together owned about 99% of the stock) died and the stock went to their kids?

The CEO (who was one of those two owners) didn't have time to deal with that, said screw it, and told me to find something else.

So I went back to zlib, and put a slider in the ripper settings to control speed vs. compression. The implementation was to simply skip compressing some of the blocks when ripping. The slider controlled what percentage of blocks were compressed.

PS: ~12 years later, we're still owned by the same owner. Microsoft has not ever tried to buy us.


"So I went back to zlib, and put a slider in the ripper settings to control speed vs. compression. The implementation was to simply skip compressing some of the blocks when ripping. The slider controlled what percentage of blocks were compressed."

How did that resolve your inability to license zlib?


zlib didn't have licensing issues, it just wasn't fast enough


Was LZO available at the time?


I don't recall. It was about 12 years ago so details are fading.


If there's a usable "libgpg", I'm unaware of it; the closest thing I know of to it (in C) is GPG/ME, which actually forks and execs GPG.

It would be incredibly useful to have such a Zlib-caliber GPG library.


Yes, as I recall the GPG developers unfortunately think it would be a bad idea, security-wise: http://www.gnupg.org/faq/GnuPG-FAQ.html#cant-we-have-a-gpg-l...


This attitude drives me all sorts of nuts. Everybody with any familiarity with PGP that has ever thought about integrating it as a tool in a broader system always gets hung up on the notion that they don't want to invest their personal public keys in some library. "Give my public keys to the browser? Crazy!"

These people don't get it. The value of libgpg isn't that it makes it easier to build apps on your personal public keys. It's that it allows you to build applications that use crypto without reinventing an entire crypto stack. The products that used libgpg would be generating their own keys, often ephemerally.

This attitude has resulted in untold hundreds of shipping software products with trivially exploitable crypto flaws that were resolved by PGP's designers in the 1990s. It has been a massive net loss for information security.


Reminds me a bit of SQLite in those respects.

Why come up with your own binary filestructure? Just use SQLite!


Also worth mentioning is BSDDB. Sometimes you don't want to bother with SQL and BSDDB is always there for you :).


PhysicsFS is pretty excellent if you need more of a traditional hierarchy layout.

http://icculus.org/physfs/


It's not just the quality of the library, compressing data isn't likely to be tightly coupled to your internal data structures and program logic.


What's wrong with copying zlib around to every project? Well, have you forgotten the 2002 zlib vulnerability where 100s of copies of zlib source had to be fixed up? I believe there were half a dozen of separate copies of zlib source code in the Linux kernel alone (e.g. ipsec, ppp). Tools had to be developed to scan the binaries on your system to determined whether they stealthily used zlib for something and thus were potentially exploitable.


Linux developer Rusty Russell wrote a helpful article about designing C libraries to be easily configured and reused:

"On C Library Implementation": http://rusty.ozlabs.org/?p=140

For such an important subject, there is little practical discussion or good examples of library design. Compare the number of books about C++ object-oriented design versus the design and packaging of C/C++ libraries.

The best reference I know is John Lakos's excellent (but ancient: 1996!) book on the subject: "Large-Scale C++ Design." Here are some excerpts from the book.

http://blog.codeimproved.net/2009/03/the-large-scale-c-softw...

http://www.cs.unc.edu/~stotts/COMP204/lakos/


PHK picked a really bad example with secure hash functions.

All of these algorithms are so simple, and have such a simple interface, that copy-pasting the C code is actually easier than linking a library.

And in their most common use case, they're coupled with a bunch of other crypto code which already exists in one of several large crypto libraries (like cryptlib and openssl). So most of the critical mass of "roll this into a library" effort goes into those libraries. Meanwhile, the people who just want hashing don't want to link all of OpenSSL.


Incidentally, I really want to use the shared memory logger from varnish in my own app, but it seems to be quite embedded in Varnish. (It is not tightly-coupled code-wise, but it is build-wise.) Anyway, although it's commonly needed, it's not a library... and Varnish is the author of this article's project! :)

(I'm not complaining, I'm going to try and library-ize this code very soon. But it's odd that someone that feels the need to write about how people should do this hasn't done it in his own project yet.)


Interesting, indeed.

> But it's odd that someone that feels the need to write about how people should do this hasn't done it in his own project yet.

I'm afraid it's not that odd at all. When I hear people talk about code reuse these days it means something about how they design things, not about whether they actually reuse the code or make it available for reuse by someone else. This detachment from actual reuse has also allowed some drift in the design principles, so that "reusable code" may not be. Sigh.


People often point to monoculture as a security problem: "Clones are vulnerable. Computers need sex!" I'm not endorsing this argument, but it's curious how neither side seems to mention the other when they're making their points.

I once reimplemented much of SNMP because the standard library for it seemed too crufty; a year or two later a whole lot of programs were hit by the announcement of vulnerabilities in that library. (Unfortunately the employer who was going to open-source my code never got around to it.)


I remember when Communications of ACM used to have computer science articles.

Who says one of the bedrock ideas of good software engineering is reuse of code libraries holding easily accessible implementations of common algorithms and facilities. is really the case? One of the better pieces of software known (qmail) avoids libraries because of unknown side effects.

Software reuse, I think, is the desire of a certain kind management and not so much computer scientists or engineers. How much reuse is in PAIP or TAOCP?


This is ACM Queue rather than CACM, which is more software-engineering/practitioner oriented rather than CS-oriented.

On the other hand, CACM isn't really that CS-oriented anymore, either: http://cacm.acm.org/magazines/2011/3. Partly it's because "CS" is so big a field that it's not clear what a technical yet completely general CS journal would look like. So it writes more about computing as a field, with a few articles on emerging areas thrown in, while the technical articles go to more specialized venues (like a PLs, systems, or AI journal).


To be fair, Bill, qmail doesn't avoid libraries; it avoid libc. Much of qmail is structured reusably, and, indeed, the core libc-replacement libraries are reused in djbdns as well.


Knuth has argued specifically against reuse in the past. He prefers "re-editable code".

That being said, the sorts of problems concerning the Knuths of this world are decidedly not the sorts of problems I find myself facing every day. I'm happy enough not hacking an HTTP client into shape for every project that needs one, for instance.


"overwhelmingly this is just pointless copy-and-paste of identical source code in blatant disregard of Occam's three-quarters-millennia-old advice."

So isn't this analogous to asking why there are so many implementations of the quadratic formula in various books? I think the problem is real, but this doesn't seem to be the proof of it.

If programming allowed us to use equational reasoning then such cutting and pasting wouldn't be considered a problem at all. We would simply recognize the formulas, perhaps with the aid of computer. Unfortunately we tossed away that option the first time someone designed languages that allowed statements like "x = x + 1". Maybe it's time to rethink that decision.


Hmm. An interesting observation, but is mutation really the problem here? I'm not sure it is... we are able to transform any program with mutation into an equivalent program where each variable is assigned once upon allocation. (If you don't already know what it is, this transformation is called static single assignment.)

Anyway, it's really hard (undecidable in the general case) to compare two programs and determine whether they do the same thing. Is that what you're proposing?

And even if you could compare programs like that reliably, it seems impractical to apply it to the giant body of today's source code, but maybe you have an idea for doing that?


In Haskell I frequently find myself searching libraries for functions (algorithms) based on the input and output type. The list is often short. Also there is some literature on deriving algorithms simply by specifying such types.

If we had a standard algebra for programs, written in some relative of Haskell or an EDSL thereof, then searching for algorithms that mapped between certain types could be made easier. True, the case of proving two versions equal would be undecidable sometimes, but not always. In that case it could time out.

I don't believe this would work for imperative languages because the semantics of side effects are often not available, but I could be wrong, or they could be restricted enough so that an SSA specification could do what you say. Bottom line: types would help, purity or the equivalent would help.

Until then please excuse me while I reinvent the wheel...


I have always wondered this. Even algorithms common discussed in computer science books are not there in C or C++ standard library. (PS: This discussion is not about negatives of C++). I once emailed Dr. Bjarne Stroustrup about this; more precisely non-existence of even simple tree algorithms from C++ standard library. He just pointed me to implementation of some sorted list using red/black tree "internally".


Can you explain in more detail what you were looking for?

> I once emailed Dr. Bjarne Stroustrup about this; more precisely non-existence of even simple tree algorithms from C++ standard library. He just pointed me to implementation of some sorted list using red/black tree "internally".

What tree algorithms did you need to use?

Also, I suspect "some sorted list" was std::set or std::map.


I think you are right about std::set or std::map.

I was looking for standard tree algorithms like AVL trees, splay trees, etc. and graph algorithms like shortest path, minimum spanning trees, etc.


> I was looking for standard tree algorithms like AVL trees

What did you need to use AVL trees for that std::set/std::map could not provide?


If you want the same algorithm to work everywhere, you can't write the library in C. It will have to be written in some intermediate language with generators for all common scripting languages. The closest thing I know of that's widely used is the generators for proto buffers. We need something like proto buffers for algorithms.

Ben Laurie has started on this: http://www.links.org/?p=864


Reading his pseudo code with its explicit size information ("pad_byte = pad_byte plus32 1;") made me think, of all things, of LLVM's SSA assembler syntax ("pad% = add i32 %pad_byte, 1"). :) Not surprising of course, making the type visible in each operation is typically necessary in assembly.


If standard libraries are always suited, and there are many "variants" with different language bindings.


phk. definitely one of my heroes. big time contributor to FreeBSD, but most importantly, inventor of "THE BEER-WARE LICENSE".


Weak article. I thought he would talk about some more examples. And then talk about the difficulties in learning a library vs. rolling your own. Easy-to-copy code (MD5) was copied. Well duh.

Lots of people have talked here about the difficulty in reusing OpenSSL. I once had the distinct displeasure of trying to reuse ffmpeg as a library.

In addition, why must every language have it's own standard library. Is it possible to have a source code format for standard libraries which can translated into other languages?

There's a good article waiting to be written about this topic, but this is not it.


I have to disagree. For its length and scope it summed up the current problem nicely and clearly pointed out that the thinking that got us into this mess won't get us out of it, to paraphrase Einstein.

I don't think the intent of the article was to answer the question of what to do, but rather to point out its currently intractable nature. Furthermore, he points out that he's even "done something about it," wrote a catch-all library, but that it hasn't had the anticipated or desired effect.

Clearly, something else is afoot here. Where next?


> In addition, why must every language have it's own standard library. Is it possible to have a source code format for standard libraries which can translated into other languages?

It exists. It's called C.

You wouldn't want a standard library of standard widgets because different languages have different approaches to problems. What's suitable for Lisp, isn't suitable for C++, isn't suitable for prolog, isn't suitable for awk.


A lot of languages simply take parts of the C standard library, and make it their own. For example, hardly anyone re-implements system signals or processing forking--they just re-use the C library, worts and all.


What problems did you have with FFmpeg libraries?

Assuming you mean the libraries that come with FFmpeg, and not turning ffmpeg.c into a library.


Wasnt intuitive, undocumented, had to jump thru lots of setup even for simple things (no smart defaults), I had to manage the memory for all buffers including temp buffers myself. It almost seems like ffmeg allocates no memory on its own. Sorry can't be more specific, it was a while back.


In a language that left the string library as an exercise for the student?

Where you can't even link libraries built with different version of the same compiler on some platforms.


To be fair, Unicode didn't exist until the 1990s. C was originally invented in the late 60s and early 70s.

The C language and standard library are meant to be minimal. It assumes very little. It barely assumes that Unix/POSIX exist. String.h has to work on server, desktop, console and mobile CPUs as well as microcontrollers. And this is the minimal set of functions that will work almost everywhere. http://en.wikipedia.org/wiki/String.h


C is outdated, even by systems programming language standards. Sorry, but it's true.

There's still room in the standard for one's complement, it denies all knowledge of multi-threaded programming, it has a bunch of impossible-to-use-safely functions in its standard library, etc. (This is just a tiny sample, there's so much else wrong that you could almost write an entire book about it.)

In my opinion, they should start by removing the substantial cruft. That would leave plenty of room for some sorely needed additions/modernizations.

In the name of backwards compatibility, they'll remove too little, they'll add too much unnecessarily stuff, and eventually, another systems programming language will catch on and start to slowly-but-surely replace C.


> and eventually, another systems programming language will catch on and start to slowly-but-surely replace C.

My personal impression from Linus Torwalds' posts (manifests :) over the years - it wouldn't happen until he is conscious ...

For me it looks like there is a Catch-22 - whenever a language fixes a "deficiency" of C it makes the language unsuitable for systems programming. It is similar to multiplication table - 99% of people can rely on calculators (very trustworthy, much safer, no side-effects on the brain tissue :), except for the people who design the calculators.


C isn't outdated. The fact that nothing has fully replaced it is a testament to its continued practicality.

As for the impossible-to-use-safely functions, use only string manipulation functions that have an n in their name.

Finally, compared to the size of most standard libraries, I wouldn't say the cruft that exists is "substantial." Everything the libc package installs in /lib is less than 3MB on ARM, and that includes crypto, math, networking, and a bunch of other code. The cruft makes up a rather small fraction of that.


I was thinking of C++. C is very simple to make reusable libraries for - C++ is a nightmare. Even 10years after we pretty much agreed what langauge features we could use it's still a pain to reuse any significant sized library.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: