Hacker News new | past | comments | ask | show | jobs | submit login
The human genome is, at long last, complete (rockefeller.edu)
468 points by marc__1 on April 2, 2022 | hide | past | favorite | 181 comments



I’m bothered by the description of the history of “junk” dna. Going by this article dna, researchers labeled it junk just because they couldn’t analyze it well and prioritized the easier 92% and thus didn’t understand it. Calling it junk just seems like trying to compensate for not understanding it like “I don’t understand it but that’s fine because it’s junk anyway”

And the scientist quote seems so wrong. if missing almost 10% of something when that ~10% is not like the other 90% then it seems like a very bad assumption to assume that it doesn’t show a lot of important features.

The quote: “ You would think that, with 92 percent of the genome completed long ago, another eight percent wouldn’t contribute much“


Context: I have a PhD in genomics

The label “junk DNA” was one of the biggest mistakes in the history of genetics. A lot of high school textbooks still reference this term and it’s worse than misleading.

In many ways, non-coding DNA is just as important as the parts of the genome that code for proteins. Non-coding DNA determines expression levels, genome confirmation (shape), and replication efficiency among other things.

The term junk DNA misleads students into thinking that these sections of DNA play little part in how a cell functions. Quite the opposite, the “junk DNA” is responsible for orchestrating the “non-junk” bits.


The term junk DNA triggers a lot of confused discussion (on HN and everywhere else), and I suspect a part of that is our getting defensive about the idea of our DNA containing "junk". That term is just more loaded than saying something more benign like "non-functional".

But another part is the term is poorly defined, this article seems to use junk DNA to mean the until-recently unsequenced portions of our genome (and I think that's an unconventional usage), some comments here take it to mean non-protein coding, and another common use is for the term to mean non-functional.

If it helps, a defensible recent accounting is probably something like 1% of our genome being protein coding, perhaps 10% being functional in some way but not protein coding (e.g. regulatory, or transcribed to RNA that is functional etc), and the remaining 90% being without known function and likely non-functional.

After further years and much great painstaking work we'll perhaps learn that to a bit more is functional, though it may end up being say 11% vs 89% non-functional. And that's ok! I wouldn't worry progress being stunted by assumptions of too much of the genome being non-functional, rather the opposite, continuing to believe there is function where there is little evidence to warrant it.

disclaimer: not a geneticist, but sometimes write tools they might use.


Ah thank you! This is the right way to think and comment.


I guess you would be the person to correct me here. I was lead to believe several competing mechanisms that collectively lead to "junk DNA".

1. Junk DNA is irrelevant from the perspective of the cell but in some way selected for by the fitness function of likely DNA transcription errors.

2. Junk DNA is formerly-coding legacy DNA that was "turned off" by natural selection but never fully deleted.

3. Junk DNA is in some way error-correction / check-sums for the coding part of DNA.

4. Junk DNA is physically protecting the coding parts of DNA by letting most stray radiation act inconsequentially.


I would only disagree with (1). I think a lot of the confusion comes from the fact that coding DNA is much easier to understand. Coding DNA essentially has one function, to code for proteins. Only the local sequence matters when considering coding sequences.

Non-coding DNA is much more difficult to understand in terms of function. It can act to regulate gene expression both through the local sequence (promoters, enhancers) and through long distance effects. For example, very distant promoters can be brought together within the nucleus and hence interact with each. These interactions are a result of both the local sequence and the sequence of the very distant interacting region. The interactions are also dependent on the concentrations of proteins/transcription factors that also interact with these sequences. We have no good way of modelling that kind of complexity.


In defense of #1, I've read that retrotransposons can be irrelevant to the function of the cell. they survive random mutations through making more copies of themselves, so mutations will select for retrotransposons that A) increase their number of copies and B) don't harm the fitness of the host. they don't need to actually benefit the host to survive selection pressure. if they do, that's gravy.

I suppose they could also survive selection pressure by hurting the organism if they're not expressed. for instance, by capturing a vital protein-coding gene from the host, and arranging things so the transposon must stay active for the gene to be transcribed.


Yes, you are right. The genome is a battle field between human reproduction and viral sequence integration and reproduction in our “shared” genomes. We think of it as a human genome but it is hodgepodge of what we call “our” genome and half a million viral-derived sequence that also wants to replicate every once and a while.

Eukaryotic genome are the worst noodle code you have ever seen—but miraculously they help to make humans and other creatures, bugs, and plants with some consistency. That is a miracle to even the most hardened atheist!


Science has a long history of giving things arbitrary or whimsical names, and regretting it. Consider "real" and "imaginary" numbers. Or names that have different technical and popular definitions, such as "introversion."


Imaginary isn't exactly wrong though. Imaginary co-ordinate results in quantum mechanics correspond to non-physical observations - they "exist" but are unobserved until something squares them into reality.

Same with AC power transfer - the imaginary power doesn't drive the load, but it's quite real because I^2*R turns the term into a real power for resistive losses.


> Imaginary co-ordinate results in quantum mechanics correspond to non-physical observations - they "exist" but are unobserved until something squares them into reality.

This sounds wrong for three reasons:

1. You can multiply any vector of qubits representing a quantum state by any complex number and it's the same state.

2. Measurement is always with respect to a basis. A result can be measured with 0% probability in one basis, 50% in another, 100% in another.

3. It doesn't take into account interference, so it's missing the whole point of how quantum differs from regular probabilities.


Infrastructure, scaffold, or interface DNA then, would be a better term.


All better terms. I think “regulatory DNA” is also a good term that helps emphasise the role in controlling gene expression.


No, way too generous a definition; one that posits a “selected” function for every nucleotide. Human effective population size is much too low to effectively clear the genome of junk.

Read Michael Lynch: The Origins of Genome Architecture.


How about high repetitive non-functional viral-derived DNA—aka junk ;-)


Yes, and I do too and you are wrong—-it is not misleading at all—despite what ENCODE has claimed.

I suspect you have not had a recent course in population genetics.

Read Michael Lynch’s “The Origins of Genome Architecture” before you make dubious pronouncements. The human genome is a jungle of code and under rather weak selection due to small effective population size. There is a tremendous amount of cruft in all human genomes due to mobile element insertions and spread.


Cannot follow your comment. Junk as noun refers to rubbish/garbage and therefore means something to be worthless. Parent comment says that it isn't worthless. My questions are:

Are you saying this is wrong and so-called junk DNA is indeed worthless?

Why you say to disregard ENCODE claims and are the functions mentioned in parent even related to them?

Can you make a synopsis of what the book you're recommending talks about?

Also note the book predates ENCODE project results and being 15 year old it doesn't consider any newer developments either.


Question from the peanut gallery: if you were to flip a single bit in this junk DNA, are the outcomes only slightly different or could they be wildly variable depending on which bit was flipped?


Wildly variable. That’s true for all DNA including protein coding DNA. It’s much more likely that changing a base in a protein coding region will result in an effect, but still nowhere near guaranteed.

Statistically, the average human has ~1 base variant (SNP) in a protein coding region somewhere in their genome. In almost all people the effect of that SNP is no more apparent than the effect of SNPs in “junk DNA”.


what do you mean by "~1 SNP?" one de novo amino acid substitution? or inherited? we have millions of variants.

the redundancy of codons - several codons code for most amino acids, and one-base-pair changes are often synonymous or will code for similarly-charged residues - provides some protection.

on the other hand, if you have a mutation in, say, a non-coding intronic splice site, that can really fuck a protein up. I got unlucky there.


Not true at all.

So much charming and well intentioned misinformation on this thread; as much as if I were to weigh in on the pros and cons of MariaBD versus PostgreSL.

Coding sequence is under significantly higher levels of active “purifying” selection than regulatory DNA, and much higher than de novo mutations in repetitive mobile elements that only rare perturb phenotypes.


Single point mutations in regulatory elements such as enhancers have been shown to change the expression levels of multiple genes. These elements are under just as much selective pressure as any coding region.


not all noncoding dna is going to be a regulatory element.


You can't change 1 bit, each base contains 2 bits of information.

We can compare each base vs ancestor bases, eg primates and other vertebrates - called conservation scores.

Some areas are highly conserved, so changing those will probably have worse outcomes than less conserved regions.


Are there any part of the theory that is still foggy ?


Junk DNA is, AFAIK, not actively expressed (used to create proteins). It's important, though, in the sense that spacing between gene expression sites is a control on which genes get expressed under which conditions (so the junk adds necessary spacing between important genes).

I did _some_ research on epigenetics during my MS degree. Spacing between sites was an important factor in our modeling of gene expression.


> spacing between gene expression sites is a control on which genes get expressed under which conditions

This makes it sound like it represents control flow rather than data. If its presence does / can make a material difference on the output encoding, it strikes my non-expert ears as actively perilous to label such DNA 'junk'


so DNA is written in Python then. it's settled.


and pythons are written in DNA. a fully bootstrapped system!


* self-hosted system

We still don't know how hard bootstrapping it would be :)


What research have you seen on modeling gene expression? I'm genuinely curious, as I haven't really seen many convincingab initio studies towards this. I could see finding certain features like this spacing as predictive of perhaps some other feature, but I haven't seen any research that really tackles generation of gene expression data from first principles and input of DNA sequence. It's my understanding that modeling the kinetics is difficult, as we really haven't tried making the full network of differential equations. Does anyone have a project that points to the 'final solution' to this? I know recently there was a paper in cell that modeled the cell with the smallest viable genome to predict cell division, but that's a bit further away from complete modeling of our 30k genes' (much less isoforms) dynamics.


Global models of gene expression for an entire cell are fairly distant at this point, but there is quite a bit of work into modeling transcriptional activity from sequence. If you're interested in reading more, a relevant technology to search for would be the "Massively Parallel Reporter Assay", or MPRA, which couples pools of 10⁴–10⁵+ synthetic DNA sequences with RNA sequencing to measure transcriptional output. Data from MPRA experiments is being used to train models, although these models are not anywhere near a point where you could model the gene expression of all regulatory elements in a cell; they are usually focused on a specific factor or regulatory sequence.


The "train models" or ML portion is what I'm disappointed with unfortunately. I make ML models to predict things from genetic information somewhat regularly, but we all are aware of the enormous issues with that. I am more interested in the ab initio methods, as I have seen them be spectacularly useful in other fields - like Bethe salpeter equations in condensed matter physics.


Notably DeepMind had a recent paper on using transformers to predict long range interactions in gene expression: https://www.nature.com/articles/s41592-021-01252-x


Nope, this is not a claim that is supported well. Yes sure there will be a handful of examples, mostly in promotor and enhancer regions, but insertions or deletions in introns and repetitive non-coding DNA will generally have very modest impact on phenotypes and these mutations generally cannot be extirpated by natural selection. That is why the term “junk” DNA is actually not that bad or incorrect in many cases.

If you prefer think of it as “spandrel” DNA in the sense used by SJ Gould.


I remember reading how some of the "junk" DNA turned out to be important because while it doesn't make proteins, the "non-coding" RNA it gets transcribed into regulates something.


Not just spacing. The sequence also matters as they serve as binding sites for enzymes that can promote or repress the expression of downstream genes. As just one (relative simple) example of how complicated genetic circuitry can be, I really enjoyed & recommend Mark Ptashne's A Genetic Switch: Phage Lambda for anyone who doesn't mind doing some slightly technical reading.

Disclaimer: not a biologist, and would be interested in hearing from someone more knowledgeble than I, both about the Ptashne book and about recommended reading.


Do junk DNA is like code styling and comments in programming.


Its closer to a config file / internal functions that modify the state variables of a system instead of generating objects. The junk DNA doesn't explicitly get read, but it interacts in nonlinear ways with the executable "text" portion of the DNA.

Also disclaimer: My only knowledge of this is from Nessa Carey's The Epigenetics Revolution and some additional online reading.


Or, more like Makefiles, and an enormous amount of cache for the runtime and build system, but this cache never really gets invalidated.

Basically it's like using a "buggy filesystem" (like ext2) that only shows the first ~16000 files in a directory. So having too much junk can hide stuff, having too little of it can uncover files that evolution carefully hid, and so on.


No, this is the wrong metaphor. You want the right metaphor: one brilliant coder, 10 total newbie coders, and five cats walking all over the key boards. And no delete function at all!


depends on what you consider non-coding, too. the ribosome is mostly made of RNA, but it's not turned into a protein. if your ribosome breaks you can't make proteins.


Imagine copying and pasting random junks of code from around stack overflow into your script and commenting it all out, thats kinda a better analogy. 20 lines there, 400 lines in reverse over there. Sometimes they get uncommented in cancers.


What do you mean by gene expression sites? If you mean enhancers, afaik you can excise and insert them anywhere in the chromosome and they will still effect function in cis.


There is a very clear difference between "junk" DNA and "non-junk" DNA, the latter encodes proteins. That doesn't mean that the DNA parts that don't encode proteins are junk, this is more of an exaggeration or misunderstanding that is often repeated, but not what scientists thought.

There are clearly parts of DNA that are not essential, this is clear if you compare genome sizes between different organism. They can vary enormously, and not in a way correlated with any complexity of the organism. There are also parts of DNA that are remnants of viruses inserting many copies of their DNA, which are the parts that could be considered junk. Even those might have an effect simply due to their presence, but essentially everything in the cell has some effects if you look closely enough.


Not really. Effect sizes that are small are not subject to selection in most metazoan species because of a small effective population size. These types of variant either drift to fixation or extinction. See Michael Lynch for details.


This is why in school these days they say noncoding DNA or noncoding regions of the genome.


or the viral DNA might have an effect if demethylated.


The term "junk DNA" was coined very early in our understanding of DNA. Even when we had no idea what it was for, very few respectable geneticists actually believed it was "junk" - basic evolutionary theory argues pretty strongly against it. But the name has stuck around for far longer than it deserves to.


>basic evolutionary theory argues pretty strongly against it.

That's not true. Over millenia leftover chunks of DNA can accumulate for no good reason. Duplication mistakes, viral infections etc. The term junk dna originated from the initial assumption that all noncoding dna was useless. Evolutionary theory has nothing to do with this.


There is a huge difference between "large parts of it are useless" and "all of it are useless". And large parts of the non-coding DNA are probably useless, unless you're extremely generous with what counts as "function" when examining this.


Yep!


Evolutionary theory says that it would not be preserved if it wasn't being used. If it was never read or relied on, then random errors would accumulate dramatically faster than in coding DNA. Because in coding DNA mutations generally lead to problems and so they're selected against.

IIRC, much of these non-coding sequences (so-called "junk DNA") are preserved in a way that shows they are being relied upon for something.


Another voice of reason.


Basic evolutionary theory may argue that most of it is "junk" in the sense of being non-functional (even though some may be species-specific or under selection too weak or recent to be detectable). One paper that lays out this argument has title with the memorable beginning "On the Immortality of Television Sets." https://academic.oup.com/gbe/article/5/3/578/583411


One of the authors, Dan Graur, has a blog where he discusses evolution and junk DNA, among others: https://judgestarling.tumblr.com/


Ah lovely. Voice of reason.


Absolutely wrong. Evolutionary theory has to take a back seat to highly quantitative population genetics.

If you do not know what “effective population size” means then you should watch from the side lines.

I think I am a respectable quantitative geneticist. I do not use the term myself but I have no fundamental distaste for the idea of junk DNA.

It is inevitable given our constant war with viruses that want to highjack our genomes.


It would have been wise to declare it "the unknown regions" or "the frontier," I think. Something to more closely indicate that our gap of understanding was wider than indicated by the moniker the project chose.

And I also agree with the bad assumption on the remaining 8% not being significant when we knew that it was structurally different. Less than 8% of an ELF is the header, but boy howdy will that thing not run well if you cut it off.


Calling the "junk DNA" a "header" is closer to what it may do, but still slightly different because in computer software a header is still read with the same codec (bits) as the data, much like DNA is usually read with the protein codec.

Instead, we are learning the "unknown" DNA performs biological functions due to its physical nature, such as physically blocking things from binding.

Imagine if a small section of a hard drive was so strongly magnetized that it repulsed the read head - if you were trying to translate it into binary it would appear to be nonsense.


We are not learning that. Much unknown DNA is repetitive crufts. Get used to it. Biology is even messier than coding.


Im not sure which of your many comments to respond to here but we certainly have learned that a small part of "junk" DNA (as that term was originally meant) performs useful functions not directly related to protein coding. So the comment you replied to is absolutely accurate.

Of course you are accurate as well: if we subtract the protein coding and the known non-coding but useful bits the rest is almost certainly absolute junk... duplications, bits of retroviruses, etc. But it would be inappropriate to claim we are absolutely certain none of it is useful. Will we find out 50% is useful? Obviously not. But I would rather say "current evidence suggests at least 97% is true junk; we are not likely to find large sequences with hidden uses at this point".


To build a (bad) analogy to human cultures... There are two things we lock up tight. Things that could cause damage if freely accessed or openly roaming, and things that it is vital to protect and must not be changed.

My rough read of biology is that it could use the same mechanism to lock down both.


No. If a cat walks all over your keyboard while you are away getting a cup of coffee and types in several lines of gibberish would you call that “code of unknown function”. By the way, the delete key on you keyboard is broken.

Why is a viral insertion any different? Why is a recombination error any different?


It's not, but I think it was premature to assume that's what the un-decodrd sections are.

Looking at the hex dump of RAM, some repetition is garbage, some is lookup tables with multiple entries into the same routine, some is initialized scratch memory for distant functions. It'd be premature to assume it's worthless.

Software code analogies to biological encoding break down, of course, because code is built intentionally and biological systems aren't. On the flip side, that very lack of intentionality in biological systems means it's risky to assume something is junk without a really holistic understanding of the context.


I think it's more that the geneticists have the sense of humor.


Yep: especially the Drosophila geneticists: sonic hedgehog is a seriously important gene (even in humans).


The usual expression nowadays is "non-coding DNA".

Undoubtedly much of it could be pruned out with no undesirable result, but there does not seem to be any ongoing process to do that, so stuff piles up. As it will.


>Undoubtedly much of it could be pruned out with no undesirable result

Such hubris as this is what led us to:

- define DNA we didn't and still don't understand as useless "junk"

- call the appendix a useless vestigial organ

- declared "silenced" b-cells useless

The list goes on and on and on... When will somebody compile a list of how often science is wrong just to slap the arrogance out of people before they cost more time and lives with such reckless and impatient reasoning?


There is a very large difference between "much of" and "all". And we have at this time no way to distinguish which bits are in the "much of" and which the rest.

There is so very much of it that even were the actually-junk just 5% of that, it would still qualify as "much of".


It is way more than that. Among all vertebrates the puffer fish has a famously small genome of only about 342 million basepairs (Mb) (x2). Compare that of human at 3200 Mb (x2).

Now you might claim humans are more complex than pufferfish but that is mainly human hubris at work and would be exceedingly hard to prove in a court of law.

Fish genomes range “in size from 342 Mb of Tetraodon nigroviridis to 2967 Mb of Salmo salar” (from https://bmcgenomics.biomedcentral.com/articles/10.1186/s1286...).

Do you think all of that difference between two teleosts is functional?


Is is not arrogant to think that DNA code should be perfectly functional down to the last nucleotide?

Show me 100 million lines of human computer code.


This perspective is true to some extent, but it’s counter-productive to think of science as being wrong.

You should think of science as the “least wrong” set of beliefs we have at any point in time. It will never be perfectly right, and every day it’s less and less wrong. The reason it’s so reliable is because it embraces (and doesn’t dismiss) this uncertainty.


Rather, scientists are wrong.

Terribly often, science grinds to a dead halt on some range of subject matter until certain scientists retire or die. It is easy to list remarkably recent cases where they did finally die and work could proceed, and many others where they have not died yet and the field is still stuck fast.

Science will always be incomplete. It has nothing to say yet about most possible questions. What it does pronounce upon should be reliably correct, but that is often not true, traceably to those individuals who maintain falsehoods. Sometimes the falsehoods become doctrine and hang on even after the offenders have obliged by dying.


I don't think it's counter productive at all. There was a period of time, when the appendix was (absurdly) considered vestigial, that surgeons would remove the appendix as a side quest if they happened to have the area opened for some other purpose.

That was a terrible idea, but one that was supported by science at the time. There are practical reasons to be skeptical about scientific assumptions.

Science becomes less wrong faster if we allow history to remind us that a lot of what we believe will likely turn out to be wrong.


A better example would be irradiating thymus glands.

Appendices are still removed, to this day, and people lacking them make do without. A thymus gland is harder to dispense with.


I’m happier without mine. It was causing me no end of digestion issues culminating with an attempt on my life.


Yes, but too simple—there are many types of non-coding DNA, some very important, others truly jun.

Promotors, enhancers, insulators, CpG regions, 5’ and 3’ UTRs, splice junctions in introns, transposons, retrotransposons, LTRs, …


It's a common trope. Anything that cannot be intellectually digested is labeled "junk", ignored and, eventually, becomes invisible.

You would be astonished at how much of reality falls into that category.


Wrong. It is not called junk DNA out of ignorance but because we know much of it is ancient viral insertion. There is no way to cleanse the genome effectively.


there must be some genome-cleaning processes active, surely. otherwise wouldn't the size of all genomes slowly ratchet up in every species, until organisms just can't cope anymore and everything goes extinct?

also, do you consider introns to be junk DNA? those didn't evolve from viral insertion (probably?) so though they might end up hosting "junk", presumably they're not all "junk."


There is a balance to genome size in terms of selection. If you have a smaller genome, mutations are more likely to have an effect, which means faster evolution (good if you are a pathogen for example). There are smaller energy requirements with replicating a smaller genome, and fewer atoms needed from the environment with shorter DNA. There is probably some optimal physical size to the nucleus, which gets larger as there is more DNA inside. Deletions can occur during cell division. A deletion bias is observed over a duplication bias:

https://en.wikipedia.org/wiki/Bacterial_genome#Deletional_bi...


Yet the trope exists, my fervent friend.


Didn't you watch the move Twins? Arnold got all the good DNA and Danny DeVito got the junk DNA.



LoL: will have to watch.


Junk is an appropriate easy-access word. Would you be happier if geneticists told you junk DNA was mainly highly repetitive mobile element DNA derived originally from exogenous viruses? In general this junk does not make proteins and its expression as RNA is often actively suppressed.

The human genome is littered with cruft and dumb code comments just like you would expect for very very old code.


Ha, they might as well call it high yield DNA as in other nomenclature.


Now on to the more important challenge. Making our understanding of the human genome more diverse and less specific to certain geographic areas. This is already having an impact in studies, drug development, etc based on genomics.

Investing in organizations such as H3Africa will be important.


I don’t think it’s “more important”, without a reference genome it’s impossible to take the next step and the Human Genome Project successfully took us from 0 to 1. Going from 1 to n is much easier. The Human Pangenome Project is working on this and should have 350 diverse genomes sequenced within the next couple of years.

Note that this has nothing to do with collecting variations in individual genes - that’s easy and widely available. But about collecting variations in the actual content and structure of the genome. e.g. Some populations have a bunch of extra DNA that most other humans lack, amazing.


One of the issues is in terms of sample size. African populations are the most genetically diverse, so you would need a much larger sample to draw the same power you get from a less diverse population of say icelanders (1), who are even less genetically diverse than western europeans. Another favorite population for geneticists for similar reasons is a group of mormons from utah.

1. https://en.wikipedia.org/wiki/DeCODE_genetics


Yes. Which is better or worse depends on the use case. The bias in available samples causes bias in things like medical treatments.


Maybe someone can explain what exactly it means. Are all the variants of every allele now mapped? Of course everyone might have a slightly different variant, so what does it mean?

What does complete mean?


Complete here means the full end-to-end sequence of all chromosomes in a single human cell line named CHM13. The typical human cell has 46 chromosomes, in 23 pairs (one from our mother, one from our father) named chromosome 1, chromosome 2, and so on. This CHM13 cell line is special is that each of its pairs is (nearly) identical. Each chromosome is a long string of A,C,G,T nucleotides. So, this complete genome is a full set of 23 sequences without any "not sure" positions or "gaps" in the sequence.

One common analogy is to consider the genome sequence (a.k.a. assembly) as a map. Since the initial publication of the human genome in the early 2000s, most regions of human DNA has been known in full resolution. Other portions, most prominently the repetitive centromeres that lie at the middle of chromosomes, have remained unmapped. It was known that they exist, approximately how big they were, and which types of sequences lay inside, but the full order of the sequence had never been determined for any human genome until this work.

You could consider the genome like the earth and the centromeres like a dense rainforest. Previously we had detailed maps of most of the earth, and we had mapped the boundaries of the rainforest and had satellite-level images (i.e. we knew they were full of plants). Now we have on-the-ground pictures with full detail.

Having a map of these sequences makes the accessible to study. One of the most valuable uses of the human genome is as a shared coordinate system used by scientists to compare different individuals and identify and name genetic variants that explain human traits. We lacked that coordinate system for a big chunk of the genome until now.

As you say, this paper reports the sequence of a single human cell line named CHM13. Each of us has a slightly different genome sequence (really two of them, one from each parent). Now when scientists sequence the genomes of more individuals, they can look at these regions that were previously ignored. Certainly understanding those regions will improve our understanding of human biology. Exactly how much will remain to be seen.


Well not quite: There is still a lot of ambiguity and compression in centromeres. But I agree that we are almost there.

So, this complete genome is a full set of 23 sequences without any "not sure" positions or "gaps" in the sequence.


What's a cell line, and do we know anything about who CHM13 is?


chm13 is from a "complete hydatidiform mole" https://en.wikipedia.org/wiki/Molar_pregnancy and the paper says "Local ancestry analysis shows that most of the CHM13 genome is of European origin, including regions of Neanderthal introgression, with some predicted admixture" and fig 1 shows a cool breakdown of the regions of the genome with different ancestries


Seems to be an immortalized (telomerase*-transformed) cell line from a female fetus with near-complete homozygosity (https://sites.google.com/ucsc.edu/t2tworkinggroup/chm13-cell...).

* Telomerase is a reverse transcriptase that allows to achieve replicative immortality (https://academic.oup.com/hmg/article/9/3/403/715108).


> The typical human cell has 46 chromosomes, in 23 pairs

Mitochondria have their own DNA, which is also sequenced.


Imagine a sequence in the DNA like this: TAAAAAAAAAAAACAAAAAAAAAAG. The way sequencing worked is the DNA is split into small parts and then they are aligned back together. But if we split the sequence above we might get these pieces: AAAAAA AAAAAA AAACAA AAAAAG TAAAAA. You know the first and last letter of each piece overlaps, but due to the high repetition count of A there is no way to figure out what the proper order is.

With new techniques you generate much longer pieces, so there is much less confusion.


From the Science article:

"However, limitations of BAC cloning led to an underrepresentation of repetitive sequences, and the opportunistic assembly of BACs derived from multiple individuals resulted in a mosaic of haplotypes. As a result, several GRC assembly gaps are unsolvable because of incompatible structural polymorphisms on their flanks, and many other repetitive and polymorphic regions were left unfinished or incorrectly assembled (5)."

Looks like there were "gaps" in the sequence due to technical limitations associated with the original sequencing methods and the authors have filled in those gaps. I haven't read the full paper, though.


Yes.


No, not all variants of every allele are now mapped. You would have to sequence a significant fraction of the human population, and imho the very idea of mapping all the variants of alleles doesn't quite square with what would be the most useful way to understand human genotype to phenotype variation.


Ain’t gonna happen.

Every human has unique de novo germline mutations (estimated to be in the range of 100 to 200).

In addition every cell (other than RBCs) will have a small number of its own unique somatic mutations.

And then there is radiation!


We'll get another batch of "the human genome is complete" articles when we start publishing graph genomes.


I think those will be saying "we have just begun the sequencing of the human pangenome".

The problem is endless. Life on earth is huge.


One does not “sequence” a pangenome. Pangenomes are single graphical assemblies of many diverse genomes—ideally T2T genomes. Pangenome assemblies will improve accurate health care.

Pangenomes allow us to discard the kludge of a “reference” genome.


Too cynical. Pangenomics is a different kettle if fish entirely. It attempts to provide a full view if all types of the more common and rare germline variants in a family, population, species, genus…

And can be applied to cancers too.


As a human geneticist, I'm not being cynical: I look forward to taking advantage of graph genomes. I'm just making a prediction about what people will say.


> CHM13 lacks a Y chromosome, and homozygous Y-bearing CHMs are nonviable, so a different sample type will be required to complete this last remaining chromosome.

(from the paper itself)

It is a respectable achievement. But the Y chromosome is too important to be left out in order to call this the complete human genome.


y-chromosome was added since the preprint was made https://twitter.com/aphillippy/status/1509594880623796226 and was made from the hg002 cell sample (which is heavily analyzed by the genome in a bottle project https://www.nist.gov/programs-projects/genome-bottle)


LoL: Chr Y is famously junk DNA —to come back to the head of this amusing thread.


without the SRY gene you probably wouldn't be named Rob.


It was enlightening to learn that to further increase the diversity and functioning of proteins, sugar groups are added (glycosylation) to our proteins at specific organelles within the human cell (think, Golgi apparatus, endoplasmic reticulum). Historically, proteins were studied first as they were more readily purified, then in the 1990s, decoding our genome took precedence in research funding. What is the next key funding target?


The human genome, or a human genome?


A single one. But completely mapped.


Man/woman?


The answer is complicated. The molar pregnancy which CHM13 was made from had two copies of one man's X chromosome and, separately, the project sequenced another man's Y chromosome.

https://www.science.org/content/article/most-complete-human-...

The genome’s Y chromosome came from Peshkin, and the rest of the DNA sequenced by the Telomere-to-Telomere (T2T) Consortium comes from a so-called molar pregnancy, a uterine growth that results on rare occasions when a sperm enters an egg that has no chromosomes. The fertilized cell can copy the sperm’s 23 chromosomes, creating two identical sets, and begin to replicate.

The question remains open of whether the owner of CHM13’s genome could be identified using public DNA sequences in genealogy databases. Phillippy thinks not because CHM13’s genome only represents one-half of that person’s DNA. Even if it were possible, NHGRI officials argue it would be unethical to reveal him for any reason, including to get consent.

Because CHM13 has an X chromosome but no Y, the T2T Consortium added Peshkin’s DNA


Almost certainly.


and yet neither "A complete hydatidiform mole (CHM)"


Now that is a critical distinction—actually only one half of a single full human genome—one human “haplome”.


I feel I've read the same headline quite a few times over the years. Here's one from last year: https://www.theatlantic.com/science/archive/2021/06/the-huma...


These are both the same paper. The earlier link points to the preprint. The paper has now been published in Science.


Touché. There's no shortage of these headlines though, here's one from 2003: https://www.nytimes.com/2003/04/15/science/once-again-scient...


To be fair, they address exactly that point in the first few sentences.


I don't think the content which addresses the title is OP's point, just the fact that there have been numerous publications with this title or something similar.


I'm not sure what the suggestion is here. Scientists should really work harder at their jobs and simplify the real world to the point that a headline that somebody absent mindedly read a decade ago don't sound repetitive?


I think the suggestion is that we should use the word "complete" when we mean it. Presumably, unlike say a software project, there is an actual state of completion possible in sequencing the human genome. Why has that mark been supposedly met so many times over the past couple of decades, only to be called complete again a few years later? When is it actually complete? Does it even matter anymore?


What human endeavor could ever be considered complete? What you propose is not a useful definition of the word.

If somebody completes a version of software, are they not allowed to announce future versions? Is scientific understanding not allowed to advance and change? Such views of science, as fixed and unchanging, with rigid definitions of inherently fuzzy concepts, are inherently anti-science.


Smart and correct rebuttal.

We all need to be able to live with the notion of levels of completeness. This is biology—not engineering.

What is the precise definition of a filled glass of water?


There is still room for improvement though, parts of the Y chromosome are still not there.

Is it gonna be GRCh39 or will they change the naming scheme again?


> [We] have decided to indefinitely postpone our next coordinate-changing update (GRCh39) while we evaluate new models and sequence content for the human reference assembly currently in development.

https://www.ncbi.nlm.nih.gov/grc

The complete Y chromosome from HG002 was added with v2 (after the paper was written). Probably a patched form of GRCh38 will be made using T2T sequence, but IMO it makes more sense to use T2T-CHM13 as a reference with its single origin instead of a weird chimera, at least until pan-genome graph methods mature.


We will be killing the archaic concept of a reference genome. All will some be dynamic pangenomic assemblies that grow and flex and branch as more genomes are added.


Yes, and I've had a hard time parsing what's different this time than last time. Anybody?


There are telomeres and centromeres on the ends and middle of chromosomes that have really long repeating sequences, which until recently were hard to sequence. This sequenced those long sequences, from a fertilized egg that apparently didn't have any female DNA. Next steps are sequencing more individuals. If this sequencing tech gets cheap enough, individualized medicine could take a big leap forward. I hope that's a good summary.

https://www.washingtonpost.com/science/2022/03/31/human-geno...


Has no male DNA (no Y chromosome), and is completely homozygous which simplified the assembly. https://web.expasy.org/cellosaurus/CVCL_VU12


The cell was female, but due to a quirk, the DNA was all male? Or I'm not understanding what the WaPo article said. The genome does have a Y chromosome: https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4#/def


That Y chromosome was added by applying same analysis workflow to a different biological source (CORIELL:NA24385, a NIST standard material used in the Genome in a Bottle project). The other chromosomes are all from the CHM13htert line (if you click on the individual chromosomes at your link above, you can scroll down to the "/isolate" feature to see what material the sequence was derived from). There's a long tradition of having a reference assembly be a combination of different individuals. Even the "standard" GRCh38/hg38 reference doesn't represent any single individual.


Thanks for that explanation. I wonder how much harder it'll be to sequence a heterozygous genome.


First part true.

Second part is doubtful.

Individualized medicine needs a much more robust computational framework than it current has.

Each of us has a unique genome, a unique developmental history, and has lived in a unique environment.

We are n=1 experiments. Organ culture and IPSCs are a pipe dream.

Here is our approach to addressing this barrier:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979527/


The Atlantic article is based on a preprint of one of the papers now formally published in Science. They both describe the same T2T-CHM13 assembly.


Telomeres and centromeres, major histocompatibility complex, and other highly repetitive and complex regions with segmental duplications. This stuff matters!


Eh, it's referring to base pair variations. The title is on the sensationalistic side when you consider how most lay people will interpret it.

The cool stuff people imagine about in response to the title won't happen until researchers finish figuring out regulatory regions in the DNA; and, how DNA interacts with itself and environment, both spatially and temporally. Regulatory regions are promoters, enhancers, silencers, and insulators, and impact gene expression and regulation.


> it's referring to base pair variations

No.

> The title is on the sensationalistic side when you consider how most lay people will interpret it.

The title refers to a large scientific collaboration that has succeeded in utilizing single-molecule sequencing technology that only matured in the last 3-5 years to sequence regions of the human genome that were previously unmapped, bringing the completeness of the mapping to 100%. That doesn't seem sensationalistic.


After ten years of "we are getting to the last mile" and "we are finally completing", it just feels like bio people cannot give up this good reason when justifying their funding so we have to re-complete human genome every other year.


Not so. This work is real progress brought about by significant technical innovations over the last 5 years. I name three:

1. Two cool long-read sequencing technologies (PacbBio HiFi and Oxford Nanopore)

2. Hydatiform moles carrying human chromosomes for easier sequencing

3. Excellent code for assembly through complex regions of the human genome (e.g., code for assembling pangenomes by Garrison and Li).


“Junk DNA”, i.e. “non-coding DNA”, are the code segments of the DNA, while the “coding DNA” (“non-junk”) are the data segments.

As a programmer, I find the code segments the most interesting.


Where can I download it? (I still have a copy of the originally-announced one from like 20 years ago, somewhere...)


How long would it take to sequence the genome of an individual human instead of an immortal human cell line?


how much do you want to spend? you could do it in a couple days if you use a few sequencers. how long would it take to assemble? how much do you want to spend, etc.

how useful is it? not very at the moment. most analysis is still around specific variants that are understood. most function is not understood.


The DNA prep and sequencing would take about two weeks if you worked hard using both HiFi and the latest ONP systems.

Cost would be under $25,000 for great depth (about 20X for both platforms=overkill).

The hard part is the de novo assembly. At least a few weeks in cottage industry mode. A few days if in full factory mode.


I think we're on the same page. It's a matter of $$ for speed in the sequencing and analysis phases. However, the big problem that Isaac Asimov posed and is still relevant, is the understanding of the results. Having your entire genome sequence is achievable now, but understanding it isn't. As far as I know 23andMe and other services only look for a subset of well understood variants and that's it.


Nebula has a 23andMe-like service with 100x WGS and provides traits that are actually polygenic and not just SNPs, so have some hope of being causal and not random associations… but it's still practically useless for all kinds of obvious statistical bias reasons.

For instance it says I'm "99th percentile genetic predisposition" for melanoma - yes, thanks, I know I'm Scottish - and left-handedness, which I'm not.


Not quite: Centromeres are still a messy tangle of repeats sequence. Give it a few more years.


> The Human Genome Project essentially handed us the keys to euchromatin, the majority of the human genome, which is rich in genes, loosely packaged, and busy making RNA

> Jarvis and Formenti hope that their contribution will not only help tie a bow on the Human Genome Project, but also inform research into diseases linked to the heterochromatic genome—chief among them cancer

So the TL;DR or ELI5 version of this is this completion can help fight cancer. Had to wade through this article to get as to why we would want a complete sequencing. Any other non-obvious things we can do after this? Like perhaps life extension or other diseases we can cure?


Cancerous cells can have a rapidly changing genome, with the heterochromatin “glue” between genes playing an important role in this, including altering how much those genes get expressed.

This work has sequenced the “glue” so we know what’s supposed to be there and can better understand what’s different about cancerous cells beyond the usual gene mutations.


Literally any genetic disease (or shortcomings like aging) could have missing facets hidden in these newly-complete regions of the genome. It's kind of the same reason you would want a complete anything, it's not ideal to go hunting for knowledge while blinded to a nonrandom 8% of the territory.


"Cancer" is the gimme-funding word. It is quite doubtful that this work will enable much better cancer treatment than what we already had.

But it's science. Nobody knows what might come out, which is really the point.


It is infrastructure!

What can you do with a road or a bridge?


Is it free?


The data from the project is released to the public domain (CC0). The research article is also free to access.

See https://github.com/marbl/CHM13 and https://www.science.org/doi/10.1126/science.abj6987.



2013 Supreme court case Molecular Pathology v. Myriad Genetics, Inc says yes. Although I would guess the right to read the results of the study are not necessarily free.


I think most people ship with a copy from birth.


Yeah, but it's in compiled form. Average person does not possess tools or skills to read the code and see what it does or modify it's behavior.

We need to stop using proprietary genomes. Free genomes, free society.


I'm pretty sure there a bunch of very popular, high-traffic sites with tons of content that demonstrates the build process.


This guy's interested in reading the make file. Those sites just show people running make.


There are sites to cater to those who prefer reading about the build process, too. Uh, so I hear.


Can't wait for v2 to ship. Maybe it will have drivers for the latent psychic hardware.



I would settle for a plugin api and a descent man file


now at long last we can be... better


when can i grow a second a set of arms?


That's just the Homeobox[1] genes. They're actually incredibly simple given their complex function.

[1]: https://en.wikipedia.org/wiki/Homeobox


If you're in the US you have the Constitutional right to bear's arms. I'd choose grisly bear, or maybe panda.


But no right to bear claws, so you have to find your own bakery.


Excellent, now we can watch it change!


if the planet shuts down today and we all melt into a hydro-carbon haze, then, this will be why. project complete


It is unfortunate we are still misdirecting funds to fruitless endeavors like genetics. As we know, genetics has little influence on your individual biology or behavior, as race is a social construct and each human shares 99.9% of their DNA with each other. Further, hegemonic tools of assigning assumed traits to people, like gender and IQ, are also social constructs, so any connection they have to genetics is moot. If we are truly interested in having a diverse, equitable understanding of people, we should instead invest in efforts that actually seek to understand them as people: therapy, rehabilitation, and decolonization work.

https://www.discovermagazine.com/planet-earth/race-is-real-b...

https://en.m.wikipedia.org/wiki/Race_and_genetics#Race_and_h...

https://www.independent.co.uk/news/science/iq-tests-are-fund...


If you do a few basic searches on the applications of this knowledge then you will see that the vast majority of research & benefits that have built off of it have nothing to do with anything you mentioned here.

Information about the benefits of genomic research are trivially easy to find. To get you started in general, checkout [1] below. For one of the most prominent examples-- the way it fundamentally transformed cancer research, checkout [2].

[1] https://www.google.com/search?q=genomic+research+application...

[2] https://www.icr.ac.uk/news-features/latest-features/how-the-...


> As we know, genetics has little influence on your individual biology or behavior, as race is a social construct and each human shares 99.9% of their DNA with each other.

The first part of your sentence doesn't follow from the second despite the second part being true. (well… the "99.9%" part is pop science bullshit.)

Genomes do run in families, that being the whole point of them. And they certainly do have health effects whether they're shared with everyone or unique to you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: