An unexpected benefit of open-sourcing our code

dekhn · on July 9, 2015

I'm glad you're contributing to genomics technology, but I feel compelled to point out that zlib is a terrible algorithm for compressing BAM files. The reason is that over 50% of the compressed BAM file space is spent on encoding the quality scores at full resolution, which is just wasted effort. Quality scores are very hard to compress- they are very close to completely random data- and the best compressed use complex probabilistic models to switch the encoding technology depending on what category quality scores fall into.

chuckcode · on July 9, 2015

I'm all for a replacement for BAM file format and the quality scores. Ideally something that supports delta based encoding to a reference genome similar to CRAM and something that more compactly represents quality of bases sequenced.

In the meantime as a user of BAM I'm very very very grateful for faster zlib as it makes life a lot better with the current large installed base of programs that use BAM and data sets in BAM. These sort of improvements really do make a meaningful difference in the same way that switching to SSDs from magnetic disks didn't improve the implementation of my software but makes my computer a lot faster to boot up and compile things.

minimax · on July 9, 2015

The CloudFlare x86-64 optimized version of the longest match algorithm is especially well documented.

https://github.com/cloudflare/zlib/blob/31043308c3d3edfb487d...

It's great material for anyone interested in these types of low level optimizations.

Matt3o12_ · on July 9, 2015

Thanks for the link, although I understand only half of what it is.

Do you know what the file extension inc stands for?

munificent · on July 9, 2015

"include"

It's a fairly common extension for a file containing C implementation code (and not just headers) that is intended to be #included by some other file.

lfowles · on July 10, 2015

You know what, I was trying to find an example for Matt3o12_ where it was #included, but I can't even find the file at the tip of any of the branches. Can a better git archaeologist explain what happened to it? Did it get merged into another file?

Edit:

Here's an example of a .inc file being used

https://github.com/cloudflare/zlib/blob/31043308c3d3edfb487d...

Double Edit:

git log --all -p contrib/amd64/longest-match.inc

It was merged directly into deflate.c

astazangasta · on July 9, 2015

Can you argue that there is ever a social (not private) benefit to NOT open sourcing your code?

Mahn · on July 9, 2015

There is one case where the benefit would be evident: if the code in question is another Javascript framework.

drzaiusapelord · on July 9, 2015

A lot of serious security advisories affecting closed source programs often don't become exploits in the wild due to the nature of closed source and the impied secrecy involved.

For example, IE had a major, major issue regarding SChannel that more or less trivially gave attackers an ability to run arbitary code at either admin or SYSTEM level. It was scary. It was reported privately and then patched. Never in this process did anyone have the source code to analyze and publish an early PoC like they did shellshock and heartbleed. When the patch was released, it was a binary, so no one could just compare the old code to the new and figure out exactly what the problem was and launch an attack. Sure, they could analyze the binary, but that gives limited and often unusable results. Or at the very least puts up enough barriers to buy time for patch installs.

Its funny, years ago we used to worry about our Windows servers, now only worry about our linux servers. FOSS's transparency is ugly when it comes to exploits because they go from discovered to in the wild very, very quickly. Even when they don't, once the patch is released, the hackers have the exploit instantly, and that means if your organization can't patch for a couple hours, you're screwed. The recent Drupal exploit is a good example of this. It went from published to bots hacking Drupal installs within seven hours. Millions of sites were affected.

yaps8 · on July 9, 2015

"it was a binary, so no one could just compare the old code to the new and figure out exactly what the problem was and launch an attack"

Sorry but this is plain false, people doing vulnerability research for closed source software do compare the binaries to understand the patch.

rudolf0 · on July 9, 2015

You're right, but there's no question that being somewhat vague with the patch details and producing only a changed binary means attackers will be slowed down a bit from producing a functional exploit. Even if you're only buying yourself maybe ~6-24 hours, that might mean N extra hours for millions of machines to patch.

moviuro · on July 9, 2015

This is the debate about Full/Responsible/Non-Disclosure. With Full Disclosure, users, admins and pirates alike get the same information at the same time, meaning that you, user might be able to protect yourself (add a line to your WAF filter, add a block on your FW, etc.).

On the other hand, I note that proprietary software is flawed with tons of 0 day (I'm thinking about Flash lately), whereas the self-proclaimed most security-oriented open-source projects only have a tiny number of unsafe code (I'm thinking about OpenBSD "Only 2 remote holes in the default install, in a heck of a long time!")

drzaiusapelord · on July 9, 2015

>Only 2 remote holes in the default install

Except no one uses the default install and these types of claims just incentivize making the default as sparse as possible. Things change when you deploy your stack, use ssl, etc, etc.

spacemanmatt · on July 9, 2015

You're actually making the argument that Windows is now secure and Linux isn't, because Linux gets attention around security events? I'm afraid that seems outlandish to me.

EGreg · on July 9, 2015

I think he is talking about smaller and less mature projects than Linux or BSD variants. Say you're a small developer who's spent years on a new platform that will power your social network with millions of people. You might want to keep it closed source for a few years and only release the source code to a select number of people until you've let security researchers take cracks a it. Once it's had a few years of battle testing, you can release it. As opposed to releasing the source code to your social network and making it 100x easier for anyone to come up with an exploit against your network.

In fact I'd say it might be better that a relatively new app (especially ANY APP that powers servers) to remain closed source until given the green light by security researchers. And even then...

Imagine if Facebook open sourced the code running their social network? I guess the question could equally be ... is there ever a social good from centralizing your social network and not letting it be distributed across all machines in the world?

I would say security.

rudolf0 · on July 9, 2015

I don't think Facebook's business would be at risk at all if they open sourced most of the backend code. As long they kept things like spam/malware detections closed. Facebook's main value proposition is the existing network. Another network could possibly overtake them some day, but they're definitely not going to be doing it with Facebook's source; not even a heavy fork of that source.

EGreg · on July 10, 2015

Right. But imagine if Zuck opensourced the exact facebook source from day 1 or day 100. It would have been hacked a long time ago to the point of collapse, and who would that have benefited?

mkingston · on July 9, 2015

The answer depends somewhat on the license. However, for the purpose of my answer I shall assume a product or service for which releasing the source code would destroy an ability to generate a return on investment. (Of course this isn't always true, however I suspect it's not always false, either).

If the only way to develop your product or service is to generate a return on investment, and society will benefit as a result of the provision of said product/service. I think in many cases it's safe to say that society has benefitted if you generate a return on your investment.

Having said this, in principle I would advocate for an enthusiastic open-sourcing strategy once the societal benefit has been realised. Practically, it's a bit more complicated than that. For example: are you able to continue to generate a greater societal benefit by maintaining a monopoly on your source code? (Tough to answer, I'm sure).

btown · on July 9, 2015

If your industry is run by unethical people who would use your code, if developed in the open, in a way that would put your firm out of business and create orders-of-magnitude-fewer jobs than your firm would IF it kept that code proprietary and used it as a way to leapfrog the status quo... then I think you have a very good argument for your company keeping its core code under wraps.

And even once you're at the scale where you can compete head-to-head with those people, you still might want to keep things proprietary so that you can encourage ethical and aligned behavior across the industry. Because the stakes are much higher than falling asleep happy because you maintained the "purity" of the open-source-software movement.

You'd still share your changes to zlib and other non-core components, though. Because of articles exactly like this. But none of the secret sauce.

As a side note, it is not at all my intent to imply that any of this refers to the finance industry. Not at all. Nope. (But if this piques your interest, send me a message at the email in my profile.)

s73v3r · on July 9, 2015

"If your industry is run by unethical people who would use your code, if developed in the open, in a way that would put your firm out of business"

Which is entirely the case in software.

jahewson · on July 9, 2015

Yes, if your product provides a social benefit but would not be able to exist if its source code were free. Google's ranking algorithms would seem like a good example of this.

kwhitefoot · on July 9, 2015

What is your reasoning for this?

nathanlied · on July 9, 2015

If the algorithm's nooks and crannies were fully known, website owners could manipulate it to display their results at the top, which may not be in the web's best interest.

ManFromUranus · on July 9, 2015

I'm not OP but my guess is that without the revenue generated by the proprietary algorithm, there wouldn't be enough money or incentive to keep improving it.

With an open source google algorithm there would be possibly 1000's of lesser google's (lesser as in inferior search) the lesser googles might even get eclipsed by Bing or some other proprietary search which has the resources to improve it's search.

jeffdavis · on July 9, 2015

I think it's a myth that you can flip a switch and open source your code. Technically, you can, but in many cases it would be no better than closed anyway. Proprietary code often has a lot of dependencies and it can be hard to understand without being devoted full-time. Just building it often requires a fairly complex procedure with its own dependencies (often on specific versions of libraries and compilers).

So throwing the code up on github isn't really "open sourcing" it.

The question then becomes: is it a net social benefit to spend the large effort to truly open source a particular code base? And the answer is that it depends.

baudehlo · on July 10, 2015

Absolutely. I contributed heavily to the code that ran an anti-spam DNS blocklist (that is used by most of the world). Open sourcing that code would have given away how we populated that blacklist, which would have allowed spammers to bypass it completely. After 6 years the technology and concept is still working strong.

gus_massa · on July 9, 2015

I've never seen an open source AAA game, perhaps it's impossible go get enough money to pay the developers and designers of a AAA game with an open source project.

The social benefit is that some members of the society can get fun from a AAA game if they want.

maccard · on July 9, 2015

There's a few. ID software[0] open source their games, but require the assets to actually build and run the game (which you obtain by buying the game). Epic [1] have open sourced Unreal Tournament (note I'm not sure on the licensing, just that the code is available). Natural Selection [2] source code was released

Star wars Jedi Knight [3] had the source released.

Note that all the AAA games that I'm aware of that have been open sourced are older games, I don't know of any in development AAA games that are open source.

[0] https://github.com/id-Software [1] https://github.com/EpicGames/ [2] http://unknownworlds.com/ns2/natural-selection-source-code-r... [3] https://github.com/grayj/Jedi-Academy

angersock · on July 9, 2015

ID, at least while Carmack was there, was very very good about releasing their last-gen engine tech once all licensees had published their games. A whole lot of people learned about how to do game development from those things.

Unknown Worlds only released the source code for the NS1 mod for Half-Life--and as Valve has yet to release the Gold Src source code, that's of questionable utility.

Mahn · on July 9, 2015

See also: http://www.codersnotes.com/notes/the-man-behind-the-curtain and https://news.ycombinator.com/item?id=9738215

gurkendoktor · on July 9, 2015

First, responsible disclosure.

And second, I think (but cannot prove) that people need time to adapt to technological progress, which is also driven by software. For example, we cannot figure out how to integrate Google Glass into our social norms over night. And we needed time to respond to the Snowden revelations. Who knows how we will handle drones and the IoT? Accelerating progress has upsides and downsides.

(I feel this is more about hardware than software, though. More efficient Big Data might be scary?)

wang_li · on July 10, 2015

>and the IoT?

Just don't connect your things to the internet. Problem solved.

blfr · on July 9, 2015

If there is a private benefit that can be realized then it will incentivize people to write more code, making more tools and solutions available to everyone, even if for a fee.

pjc50 · on July 9, 2015

A sufficiently bad FOSS implementation can drive out a better paid-for solution, especially if the latter is from a small company.

(I can't immediately name an example of this)

jahewson · on July 9, 2015

A sufficient bad FOSS implementation can also prevent a better FOSS implementation from ever being built, e.g. Xpdf/Poppler.

TD-Linux · on July 10, 2015

It didn't prevent PDF.js from being built (though maybe it's not "better")

Though I haven't had problems with poppler, so I don't quite understand what is bad about it.

wnoise · on July 9, 2015

See also OpenSSL.

joeyo · on July 10, 2015

For what it's worth, MuPDF is quite good.

lmm · on July 9, 2015

I've heard server licensing suggested as a reason the web displaced Gopher.

I remember being amazed by AOLserver which I was surprised to discover is now open-source; possibly by that point it was too late though? My impression was that everyone used Apache because it was FOSS, but that could be wrong.

JoshTriplett · on July 9, 2015

"better" is relative; if it isn't FOSS, then many projects won't be able to use it.

kwhitefoot · on July 9, 2015

Surely it would have to be sufficiently good.

scrrr · on July 9, 2015

How about software that could break into military, bank, etc. systems and manipulate them?

Mahn · on July 9, 2015

You could argue it's beneficial to open source and make it public since it would raise awareness and probably get the root problem fixed rather quickly, as opposed to keeping it to yourself.

jonathanpoulter · on July 9, 2015

That would probably be a longer term social benefit, increased safety. Probably a short term social catastrophe.

aurelianito · on July 10, 2015

Do you mean like metasploit ?http://www.metasploit.com

higherpurpose · on July 9, 2015

So you'd rather not know how that software works, while criminal organizations get to use it anyway?

gonzo41 · on July 9, 2015

Manage the risk. Lots of people knowing = many new threats. keep that stuff a secret or hard to access then only a certain class of threat will present itself. still not great but better.

logn · on July 9, 2015

Things like backoffice code which are very specific to one business. Or something like the Javascript that ties together the widgets of a GUI. These types of things are basically only useful in exactly the context they're currently being used. When your potential user base are people who'd like to run an exact clone of your business, open sourcing is not a good idea (EDIT: and it's potentially a social benefit to maintain your business's existence and unique identity). I guess you could step back and question the utility of your business altogether, but that's a separate consideration.

nugga · on July 9, 2015

The way the question is posed sounds a little entitled. There is no right for anyone to have anyone's code until it's granted.

glomph · on July 9, 2015

I didn't read it like that. The point was only that from an external perspective the act of open sourcing doesn't have any cost to be given code.

fastball · on July 9, 2015

Who said anything about a "right" to have code?

mrob · on July 9, 2015

Demoscene demos. A big part of the fun is the difficulty in coding them and if source is available for everything it will be too easy.

mikekchar · on July 10, 2015

I'm going to have to disagree here. If source is available, it simply raises the bar.

carapace · on July 9, 2015

If you discovered that, indeed, P = NP?

In any event how would you measure "social benefit"?

pdknsk · on July 9, 2015

> Both formats are supported by the absolute majority of web browsers [...].

Unfortunately this isn't true. IE still does not support deflate as specified in the RFC, with a zlib header. It only supports raw deflate.

https://www.ietf.org/rfc/rfc2616.txt

  deflate
      The "zlib" format defined in RFC 1950 [31] in combination with
      the "deflate" compression mechanism described in RFC 1951 [29].

This has lead to most browsers also accepting raw deflate, to be reverse compatible.

cwp · on July 9, 2015

He didn't say "all browsers", he said the majority. It's been a long time since the majority of web traffic came from IE.

rurban · on July 9, 2015

I have to applaude CloudFlare for their zlib version. It really is the best.

Some versions on my system even use the SW crc32 variant, macports on an i7: 200x slower

j_s · on July 9, 2015

The previous discussion of another zlib alternative last month mentioned unexpected issues incorporating them into other software (that reaches into internals), as well as the potential for licensing issues.

https://news.ycombinator.com/item?id=9664655

nadams · on July 9, 2015

I was actually just wondering if they made it better - why not create a patch and push it upstream?

Mojah · on July 9, 2015

I've experienced something similar, although on a much smaller scale.

A few years back, I got an e-mail that one of my open source tools was being used as a monitoring solution on a hospital ship. I felt pretty damn proud too, not having imagined the impact of open sourcing a project could have.

I wrote a quick blogpost on the matter back then; https://ma.ttias.be/this-is-what-open-source-is-about-mobile...

halosghost · on July 9, 2015

> -O3

yikes. I didn't realize anyone actually used -O3 in-production. Isn't that widely considered to be a BadIdea™?

majke · on July 9, 2015

I use it all the time. I've encountered more bugs with -O0 and lack of -O at all, than with explicit -O3. My experience is with gcc.

lfowles · on July 10, 2015

Funny, building -O3 actually flushed out some segfaults due to timing... everything was running too damn fast for my assumptions to hold!

It looks like Debian policy is to build at -O2 with exceptions for building at -O0 or -O3

halosghost · on July 10, 2015

I've /never/ seen -O3 used in the wild (nor -O0 actually). I've essentially only seen -O2 and -Os actually being used in-production. Perhaps that has just been luck of the draw…

rollo2 · on July 10, 2015

"Because if we can make your photos smaller...we can make your cancer smaller!"

jackgavigan · on July 9, 2015

So... Cloudflare cures Cancer? :-)

daenney · on July 9, 2015

So about that...

> The files used for this kind of research reach hundreds of gigabytes and every time they are compressed and decompressed with our library many important seconds are saved, bringing the cure for cancer that much closer. At least that's what I am going to tell myself when I go to bed.

I realise it's not meant entirely seriously and I applaud any effort that helps speed up the exchange and storage of information and this type of research. But "bringing the cure for cancer that much closer" I find a bit of a stretch.