I was curious about the disagreement over his predictions from 1998, so I looked it up. I don't think it reflects well on him.
Spiegel interview 2010:
SPIEGEL: The genome project has been called the Manhattan Project or Moon Landing of its era. It has also been said that knowledge of the genes will change the future of humanity and become a "main driver of the world economy."
Venter: Who said that? I didn't. That was the people at the consortium.
SPIEGEL: You're wrong. You made all those statements in an interview with DER SPIEGEL in 1998.
Venter: Really? Those are Francis Collins' lines. So I may have said that that's how he describes it. I, on the other hand, have always said, "This is a race from the starting line to the finish."
Spiegel interview 1998 (translated to German by Spiegel and back to English with Google Translate[1]):
SPIEGEL: What does the knowledge of the genetic material?
Venter: The End of Ignorance, a completely new understanding of the human body and a revolution in medicine. What we plan is there, of the order of the Manhattan Project or the moon flights. The decoding of the genome will change the self-image of humanity.
Venter also wanted to patent the whole human genome once he'd sequenced it. It was only by a huge and expensive effort by the Wellcome charity that the genome was sequenced and made freely available. If you enjoy reading ..AGAAACCATCAGCACA.. then:
Venter also wanted to patent the whole human genome once he'd sequenced it. It was only by a huge and expensive effort by the Wellcome charity that the genome was sequenced and made freely available
Not only Wellcome (although their Sanger Centre was a major player), but a whole worldwide bucket of public money.
It also bears mentioning too that on March 14, 2000 Bill Clinton and Tony Blair issued a joint statement saying that the results of genomic sequencing shouldn't be patented. The biotech sector lost $40bn in value as a result[1], as it became evident the patent business model was dead.
I don't think the comments themselves reflect poorly on him. Late 90's would have been significantly influenced by the bubble and who wasn't thinking how much the sequenced genome would be worth. Twelve years is a long time to reverse-map current feelings and knowledge onto previous memories.
The part that reflects poorly, IMO, is how he tries to pin it on Collins when faced with the quotes. I can understand why the interviewers didn't press the issue, but I would have liked to see them press it.
SPIEGEL: How much money do you think the human genome is worth?
Venter: In any case, a lot. Each of the major pharmaceutical companies is constantly working on 50-60 new drugs. If only a part of the future knowledge of the human genome based on, then this alone brings hundreds of billions of dollars in the next decade. The knowledge of the human genes is one of the strongest drivers of the global economy.
As I pointed out the first time this article was posted (http://news.ycombinator.com/item?id=1559056), the article title is a truncated and distorted version of the actual quote: "We have learned nothing from the genome other than probabilities."
I would not have expected such a misunderstanding of the fruits of the knowledge of the human reference sequence from J Craig Venter. Frankly, it sounds like sour grapes. For some reason, I feel inclined to try to explain some of the technical reasons why the public project's approach was necessary for the completion of the human genome.
There is broad agreement (perhaps "universal - 1") that Venter's sequencing approach would never have been able to achieve the level of completeness that we have now thanks to the Human Genome Project. This is because of duplicate DNA in the human genome, and the nature of short read lengths.
Let's try this with colors. Say you have a terrible sequencer that only reads 7bp at a time, max. You come across the following 2 strands of DNA:
GREENgattacaGREEN
REDgattacaRED
They're on 2 entirely different chromosomes, but you don't know this since there exists no reference sequence - you're mapping de novo, remember? Good. So you're get readouts like "GREENgattac" and "gattaca" and "attacaRED" - but you don't have any reads >7bp, so you never see what is on both sides of gattaca at once.
Thus, since you can never span gattaca, you will never know if your DNA in this region should be GREENgattacaRED + REDgattacaGREEN, or REDgattacaRED + GREENgattacaGREEN. For de novo alignment with short reads, this type of shortcoming is deadly.
Because the public consortium / Human Genome Project used much longer read lengths and a method involving bacterial artificial chromosomes, they were able to map the more difficult regions of the genome having repeats.
Edit I should have addressed the public consortium's approach more thoroughly, though to be clear I am less familiar with the approach and have never used it myself. Essentially, you chop up the 3 gigabase genome into a bunch of 150 kilobase segments (in this case, bacterial artificial chromosomes). You can map these back to their source chromosomes, of which we already knew we had 23. You then chop these BACs up into smaller pieces and sequence. In the worst case, you might be unsure of where some of your reads from each BAC map within the 150kb window that they come from, but hey, at least you know which chromosome and more or less which portion of the chromosome they come from. In other words, in the situation described above, the RED and the GREEN never co-occur, since they are on different BACs and are not sequenced together. Thus you know that RED/RED and GREEN/GREEN are the right pairings (and you know which belongs on which chromosome).
Sure, but Venter and Celera do deserve props for introducing whole genome shotgun, which became the de facto standard for getting most of the genome.
This doesn't invalidate anything you're saying, because finishing is important, but the whole trend towards short reads and pushing more of the job to the computer was begun by Celera. Eugene Myers is/was an amazing computer scientist, and the public project would never have finished as quickly without the push from Celera.
[beat]
Also, all of this really deals with the technical aspects of obtaining a sequence, which is not really what Venter is talking about. He's saying that a good sequence isn't that useful! I'm not sure what he means with this, or whether he's just trying to be controversial.
I agree with you on all points. Like I said, I've never actually used the BAC approach but I've definitely had my hands all over short read data. I think that underscores your point about the popularity of short read length methods. It's good stuff - but definitely aided by the presence of a relatively complete reference.
It is true that my post is somewhat off topic, too; I didn't want to dwell on Venter's naysaying.
Can you say a word or two about the current view of short reads for re-sequencing? I worked for a time on software for a particular lab to do alignment of short(-ish) reads like 13-bp+gap-of-100-200+7bp+gap-of-5-8+7bp+gap-of-100-200+13-bp. The idea was that you give me millions or more of those reads, and a reference like the human genome, and I give you back all of the alignments including partial matches (a few bp off, here and there).
Interesting programming problem. This aren't the actual bp counts and gap sizes I was working with - they were in that range. I got concerned that the reads we were working with were too ambiguous -- similar to the problem you describe with de novo. I was having trouble doing the combinatorics to prove myself wrong and started asking: hey, where's the mathematical analysis that shows reads of this size can possibly work? Nobody could produce it and this led to some internal strife.
That was a few years back. What's current thinking on what kind of reads plausibly work for re-sequencing? (If you happen to know. And where can I read something about the math that proves it.)
It's more an empirical question rather than a theoretical one, and short-read sequencing wouldn't be as popular as it is if it didn't work out most of the time.
I don't have all the exact numbers handy, but 76bp will uniquely map about 80-90% of potential reads, just with single ends. A high proportion of reads can be mapped uniquely even down to 25 or 30bp. There are some 'mappability' tracks available in the UCSC browser if you're curious about a particular region.
Of course there are repetitive and low complexity regions that can't be sequenced at all with these read lengths, or which haven't yet been properly sequenced or assembled at all even with much longer reads.
I'm not aware of a paper that has fully explicated how read-length and the structure of the reads affects mappability (especially with pairs or reads with multiple gaps), though it's certainly something that all the people building the instruments (and the mapping algorithms) have thought about a quite a lot.
You probably know more about the intricate details than I do! I think perhaps one of the more important recent approaches for potentially gapped or slightly mismatched reads is the use of the Burrows-Wheeler transform. http://en.wikipedia.org/wiki/Burrows–Wheeler_transform
Heng Li wrote BWA based on this, and it seems to handle some of the issues you were wondering about.
I've been out of the field for a while, but I see a similarity between the old BAC vs. shotgun and newer SNP vs. full-sequence debate. While my academic friends scoff at SNP-based products like the ones commercialized by 23andme, I believe there is value in going for reach vs. depth. You can draw a similarity here with coding: running up technical debt is fine in the short term as long as you are sure someone will come in later and clean up the mess.
An interesting point of view, especially that is a "bit opposing" the perspective of Sergey Brin described in
this article : http://www.wired.com/magazine/2010/06/ff_sergeys_search/all/... where Sergey Brin knows having a specific gene (LRRK2) that has been associated with a higher rate of Parkinson disease. But the conclusion looks the same, we still don't know well the interaction of the Genome with our real biological operation and need more research especially in data analysis.
Looking at it on a programming perspective, it seems that we have a large pile of "source code" and we are still evaluating some functions (or even at some variable name) without grasping the overall modus of operation. When we are doing binary reverse engineering, we are always looking for an entry point where the program starts to be executed. It seems that for Genome that there are many entry points...
Like a program binary, which has been developed by trial and error without a design, which uses overlapping instructions that mean different things depending on the entry point, and which makes extensive use of self modifying code (via gene expression). And which goes through three phases of binary to binary translation: DNA->RNA->Proteins.
And that's just the simple stuff. Now imagine that you will silence a lot of the binary code, only running small parts of it. But you may not know which parts. I think this is one of the big things to come out of the human genome project: exactly how little is there. And since there is so little code to work with, there is a bigger need for epigenetic regulation and alternative splicing.
I'm equal parts impressed with, annoyed by and intrigued by Craig Venter and his undertakings.
Impressed because of the relentless energy he pours in to his ventures and their success rate, annoyed because of his tendency to take credit and self-aggrandise at every opportunity and intrigued because I wonder where it will all lead.
Damn, I missed this article while it was higher up on the page. Not sure if there are enough people around but I'll ask my question nonetheless.
"""Venter: Well, the goal is multifold. We have to start by creating minimal cells. A human cell is too complex -- we have no idea how any human cell works. We don't even know how the simplest bacterial cell works."""
Do we have no idea how a cell works? I know we understand the anatomy but I had no idea we hadn't moved beyond that? Is this some kind of satire/joke or is this where the state of the science is at?
I think what he's getting at is that you can look at the genome of E. coli, which was sequenced in the nineties, and is one of the most common model systems used in research, but we don't know what the functions of a lot of those genes are. Last time I heard anything about this we had no idea what twenty percent of its genes did. That number may be off by ten to fifteen percent either way, since I haven't read anything on the topic in years, and it's going to depend on the parameters you set for how sure you are about a gene's function.
Keep in mind that this doesn't mean we're completely clueless about how the cell works, just that there's still a lot of work to do.
Unfortunately, as I understand it, that is this where the state of the science is at. Physics can now explain most of chemical processes, but it's still unable to explain even basic biological processes.
Spiegel interview 2010:
SPIEGEL: The genome project has been called the Manhattan Project or Moon Landing of its era. It has also been said that knowledge of the genes will change the future of humanity and become a "main driver of the world economy."
Venter: Who said that? I didn't. That was the people at the consortium.
SPIEGEL: You're wrong. You made all those statements in an interview with DER SPIEGEL in 1998.
Venter: Really? Those are Francis Collins' lines. So I may have said that that's how he describes it. I, on the other hand, have always said, "This is a race from the starting line to the finish."
Spiegel interview 1998 (translated to German by Spiegel and back to English with Google Translate[1]):
SPIEGEL: What does the knowledge of the genetic material?
Venter: The End of Ignorance, a completely new understanding of the human body and a revolution in medicine. What we plan is there, of the order of the Manhattan Project or the moon flights. The decoding of the genome will change the self-image of humanity.
[1] http://translate.googleusercontent.com/translate_c?hl=en&...