Hacker News new | past | comments | ask | show | jobs | submit login
Researchers generate complete human X chromosome sequence (genome.gov)
123 points by mglauco on July 15, 2020 | hide | past | favorite | 48 comments



Dumb question: is there a x_chromosome.txt with the sequence in order? Why do geneticists not talk about it this way?


There is! You can find the current "agreed upon" human genome reference segmented by chromosome here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/. (It's not the assembly that's described in the article here.)

People do talk about the genome and its elements using the location by chromosome number and range like you'd describe an index in a string. There has even been special notation developed to do so [1]. However, it depends on _how_ you're looking at biology.

I think an analogy would be: you can describe all code as machine code, but when there are higher level abstractions you wouldn't choose to do so.

[1]: https://en.wikipedia.org/wiki/Locus_(genetics)


It’s a good question. The answer is no, before this study we didn’t have a gapless “x_chromosome.txt”. We did have 97% of it, but there were parts that were missing here and there. In fact, because the answer is no - which admittedly probably seems wild - this work is very important.

Now, there are much more sophisticated answers, and downstream points to be made about graph genomes instead of a reference, etc (which would also get to your point about why geneticists don’t talk about it this way). But, that’s a broader scope.


At a certain level of abstraction, we can treat it that way and it is good enough for many use cases. In biological and physical reality, no.

Each human started with between 1 and 5 copies of the X chromosome. Those copies are different in various ways. Many of the differences are single nucleotide variation, identical in a region but with a single letter changed. There are also tandem repeats where there might be a CAG sequence that occurs one or dozens of times. (Counting the number of repeats like this is often used for DNA fingerprinting.) There is also ample larger-scale structural variation, which includes whole regions of the genome present present or absent in one copy or another, or maybe copied multiple times in a row, or moved in from another chromosome, or reversed.

Complicated enough? On top of that you have to add the fact that there are trillions of cells in each human and in those trillions of cells you will have many slightly different copies of the original 1 to 5 X chromosomes from when that human was a single-cell organism. You will definitely have changes at the ends of the chromosomes, the telomeres, as they are made up of variable tandem repeats. You'll also have single nucleotide mutations, and if you're unlucky, bigger changes. On some chromosomes (not chromosome X), there's also V(D)J recombination, where our immune "memory" is actually encoded in changes to genome sequence in particular cells. Cancer or a pre-cancerous syndrome will increase the frequency and severity of these changes.

If you want to sequence a whole chromosome you have to contend with the fact that the most accurate methods for sequencing generally give you reads of 1000 nucleotides or less each and you have to assemble them together. People liken the problem to putting together a jigsaw puzzle, but it's not like assembling a jigsaw puzzle from a single box. It's more like taking hundreds of boxes of supposedly the same jigsaw puzzle (but in reality some small changes that make things fit together not quite right), dumping them all in a pile, randomly removing a bunch of them, and then trying to figure out how everything fits together. Also there are many parts of this puzzle with identical artwork and that fit together identically! Good luck!

Scientists have been applying a lot of ingenuity to this puzzle for decades and getting a whole chromosome assembly like this is a big milestone.


The article doesn’t mention but now I’m curious what kind of read length and error rate they are achieving. This could have huge impacts across all sequencing.


It looks like they're using Oxford nanopore and PacBio sequencing technologies for the long reads. These are two up-and-coming sequencing technologies focused on extremely long reads. My understanding of both is that their error rates on individual base pairs are too high to reliably determine the actual sequence on their own (something like 15% error rates). Typically the long reads from these technologies are used as a "scaffold" to resolve the large-scale structure of a DNA sequence, while another sequencing technology, usually Illumina, is used to resolve the actual sequence. (Illumina produces short reads, but it produces a lot of them, and the error rate is much lower, about 1%-5%.) In addition, since PacBio and Oxford Nanopore are very different technologies, I'm guessing that they probably have different "error profiles", so they probably partially cover for each others' deficiencies when you use both of them at the same time.

Note: Don't take any of the specific numbers above as gospel. These technologies develop extremely quickly, so it's quite likely that my knowledge of typical error rates is out of date.

In any case, here's the relevant quote from the original link (to phys.org), before it was changed to the less technical press release, which doesn't mention any specific technologies used:

"The new project built on that effort, combining nanopore sequencing with other sequencing technologies from PacBio and Illumina, and optical maps from BioNano Genomics. Using these technologies, the team produced a whole-genome assembly that exceeds all prior human genome assemblies in terms of continuity, completeness, and accuracy, even surpassing the current human reference genome by some metrics."


Illumina error rates are <<1% (~0.1%), whereas Nanopore with newer basecalling software is 5-10%. With UMIs you can get a consensus error that's also <<1%. The error profiles are indeed different: Illumina generally creates substitution errors, whereas Nanopore has trouble with "homopolymers" -- counting how many of the same letter occur in a row.


Oxford error rates are up to 15%, they have optimized published runs that show 5% or even better, but in the real world the error rates are much closer to 15%. However, Oxford read lengths can be absolutely massive compared to even PacBio. PacBio's sequencing is actually much more accurate than Oxford, but read lengths top out at about 15,000 bases I think. Illumina read lengths are a bit less than 100 bases but the systems are massively parallel as compared to both PacBio and Oxford.


I dont think you can call pacbio up and coming at this point, but nanopore certainly.

And those error rate examples are way way too high - illumina is closer to Q30, which is a 1/1000 error rate[0]. 15% would result in an unusable sequence.

https://emea.illumina.com/science/technology/next-generation...


The sequencer may report a quality score of 30, but that doesn't guarantee that the error rate when you align to the genome will actually be 1/1000. Still, you're right that good quality Illumina data can do significantly better than 1% error rate. You can't always get "good quality" data, but I imagine that the researchers on this project probably could, given the well-controlled experimental setup.

And yes, a 15% error rate does result in a sequence that is unusable for the purposes of actually knowing the sequence. But a bunch of really long reads with 15% error can still be used to resolve the large-scale structure of a sequence, and then the lower-error-rate Illumina reads can be aligned onto this large-scale scaffold in order to resolve the actual sequence. At least, this is my understanding of how these technologies are typically used together, and given the mention of PacBio, nanopore, and Illumina, that seems to be what was done in this case.


Yes the higher error ones are used for alignment- and even then too high an error rate in a very repetitive region (especially depending on the error type - misreads vs skipped bases etc) make it too challenging to build a scaffold to align your illumina reads.

As of 2018 error rate for alignments with nanopore was around 3-6 percent

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6053456/


15% is quite outdated. There have been major updates to nanopores and software. Typical single read error rate is less than 5% these days.

Single read accuracy is not as important for such projects. As coverage gets to 50-60X, expected assembly accuracy is Q30 on human.


The "ultra long" nanopore reads used in this study are often greater than 100kbp in length and occasionally up to 1Mbp


In the movie he mentions Oxford Nanopore tech and using reads of 100.000 to 1.000.000 base pairs


In a simple English, could someone explain why is this good and what does it all mean?


From the article: "Repetitive DNA sequences are common throughout the genome and have always posed a challenge for sequencing because most technologies produce relatively short "reads" of the sequence, which then have to be pieced together like a jigsaw puzzle to assemble the genome. Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are. . . . Filling in the remaining gaps in the human genome sequence opens up new regions of the genome where researchers can search for associations between sequence variations and disease and for other clues to important questions about human biology and evolution."

Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.


So 20 years ago when we "sequenced the human genome", we actually didn't? If you'd asked me whether or not we had a complete sequence of an X chromosome before I saw this I would have said, "Of course we have one, for over 20 years".


So much about the original announcements was overhyped PR. The original assembly was super-crappy and super-gappy. The folks running the two projects were exhausted and declared victory, then moved on.

Of all the fields that I've worked in, genomics has been one of the most overhyped (virtual drug discovery is the other) and it takes a ton of training just to understand how messed up the field is.


Wow, this is news to me. What a farce haha. Thanks for adding clarity here and breaking a bad assumption I had!


That's right -- sequenced genomes are typically assemblies of short fragments. The assembly algorithms fail in areas of low complexity or when large sequences are repeated an unknown number of times.


So what DID we achieve 20 years ago? Like, what happened then that led to the claim of "human genome has been sequenced" and what is the additional progress that was made now?


What was announced 20 years ago was an incomplete assembly that met certain metrics that made sense at the time. https://www.nature.com/articles/35057062 is the paper. It describes the assembly as a "partial draft".

The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).

I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)


Ultimately, is there anything you could do with repetitive regions? Like, were the problems with the heuristics or just a mismatch between ~100 base reads and multi-kb repeats?


most people have moved to techniques that produce longer reads with higher error rates. good coverage + longer reads overcomes the higher error rate. You can read the DALIGN paper by Myers to learn more. Or read https://dazzlerblog.wordpress.com/author/thegenemyers/


For a good read, I recommend The Gene: An Intimate History by Siddhartha Mukherjee


> Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are

Ah, https://en.wikipedia.org/wiki/Clock_recovery !

Too bad DNA isn't a run-length limited code. (Wouldn't that be something.)


DNA is a code, but it's not just a code. It is also a really long molecule that has to bend and fold up in certain ways.

There are error detecting codes, in a way. Protein is encoded by 3 base codes, and if you insert or delete bases not in a multiple of 3 it will be misaligned, then eventually likely encode a stop code and cause the bad protein to be truncated and likely removed via nonsense mediated decay.


It's crazy to me that we still haven't hacked RNA polymerase or some other such "obvious" method to just linearly read all of the strand. The machinery is all there, by definition!


Most sequencing methods do use DNA polymerase to effectively copy the DNA while reading out the bases that are being incorporated. However, even when the genome is being replicated in vivo, the polymerase doesn't just copy the entire chromosome in one go from end to end, for various reasons. For example, for larger genomes such as those of animals, it would just take too long to copy it that way. But even for small genomes such as bacteria, DNA replication is still naturally done in fragments. One simple reason is that at any given time the polymerase can dissociate from the template and have to reattach. Even if you could engineer the polymerase to bind more tightly, you'd have to deal with the tradeoff between binding strength, replication rate, and error rate (i.e., a more tightly binding polymerase would likely copy more slowly).


Imagine you have a bunch of aerial photographs that you're trying to assemble into one big mosaic of the entire region by matching up the overlaps on the edges. The problem is, this is a desert, and significant portions of the region are just flat expanses of empty sand, so all the aerial photographs from those regions look pretty much identical. Even worse, there's several of these flat sandy regions in the area, so you don't even know which region those photos came from.

So despite having taken multiple photos of every square inch of land in your target area, there's no way you can assemble them into one big image just by matching up the overlaps. Without a source of larger-scale information about the region, like a satellite photograph or GPS coordinates for the photos, you have no way of knowing how wide that desert is. All you know is that it's wider than one or two photographs.

This is essentially same problem that current genome assemblies have: there are regions of repetitive sequence in the genome, so all the sequencing reads from those regions look identical to each other, just like the photographs of flat sandy desert, and there's no way to tell how they're supposed to overlap to form the full sequence. The only way to resolve these regions is with a technology that can read all the way through from one end to the other without stopping, producing a single contiguous sequence.

The link here describes the fruits of an effort using exactly those sorts of long-read technologies to fill in all the gaps in the X chromosome sequence, thus generating a single contiguous sequence from end to end, something that hasn't previously been possible for DNA sequences of this size.

As to why this is important, these repetitive sequences, despite being apparently featureless, still sometimes have important effects (not unlike the apparently dead and featureless desert in the analogy). In addition, sometimes there are "oases" of functionally important non-repetitive DNA sequence within the "desert" of repetition, and previous genome assembly methods would not be able to tell where these oases belonged. All of this is important because many functional DNA elements are cis-acting. That is, they exert effects on genes that are nearby on the genome. So if you don't know where they belong, then you don't know what they're doing.

If you can assemble one big chromosome sequence from end to end, all of the above problems go away, and you can finally get on with the analysis you wanted to do anyway and stop worrying about not being able to calculate meaningful distances between DNA elements.


Having a complete and accurate map is important for researchers who study the X-chromosome.

Long sequences reads allowed them to map the highly repetitive chromosome. Most sequencing is done by high-throughput short reads.

The technology can theoretically be used to map other regions of the genome which are highly repetitive.


I thought the human genome was mapped in 2003, when the Human Genome Project wrapped up: https://en.wikipedia.org/wiki/Human_Genome_Project

What's different about this?


https://news.ycombinator.com/item?id=23852177 explains it, with a quote from the article. They weren't able to map the whole thing, because of repeating patterns. That's now starting to change.


There is a space missing in the title ("X chromosome").


OP probably ran into the HN character limit.

Edit: Upon testing, that appears to not be the case. Probably a typo.


So how do you actually isolate one chromosome to sequence it?


Their github has lots of information about what they do:

https://github.com/nanopore-wgs-consortium/chm13


Step 1: the researchers use a diploid cell line where the entire diploid X chromosome is homozygous: both copies of the X chromosome in this cell line are identical. This is part of why the researchers chose to look at the X chromosome.

> To circumvent the complexity of assembling both haplotypes of a diploid genome, we selected the effectively haploid CHM13hTERT cell line for sequencing (abbr. CHM13)

Incidentally, they do capture the other chromosomes in this process:

> Several chromosomes were captured in two contigs, broken only at the centromere (Fig 1a).

> https://www.biorxiv.org/content/10.1101/735928v3.full.pdf

Step 2: Follow a procedure for DNA prep that results in long stretches of DNA (though not an entire chromosome-length) and amplify (make multiple copies of) the mixture, per this reference:

> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889714/

Step 3: Run the mixture through a nanopore sequencer (essentially a hole a few nanometres across), reading the change in current in response to the different bases, including methylated bases.

Step 4: Repeat this many many times to get multiple reads of each region of the data:

> In total, we sequenced 98 MinION flow cells for a total of 155 Gb (50× coverage, 1.6 Gb/flow cell, SNote 2). Half of all sequenced bases were contained in reads of 70 kb or longer (78 Gb, 25× genome coverage) and the longest validated read was 1.04 Mb.

Step 5: Overlay the data from the long measurements

> Once we had collected sufficient sequencing coverage for de novo assembly, we combined 39× of the ultra-long reads with 70× coverage of previously generated PacBio data 18 and assembled the CHM13 genome using Canu 19. This initial assembly totaled 2.90 Gbp with half of the genome contained in contiguous sequences (contigs) of length 75 Mbp or greater (NG50), which exceeds the continuity of the reference genome GRCh38 (75 vs. 56 Mbp NG50).

> The read was placed in the location of the assembly having the most unique markers in common with the read. Alignments were further filtered to exclude short and low identity alignments. This process was repeated after each polishing round, with new unique markers and alignments recomputed after each round.

Step 6: Check up the data against the reference genome:

> The corrected contigs were then ordered and oriented relative to one another using the optical map and assigned to chromosomes using the human reference genome.

> The final assembly consists of 2.94 Gbp in 590 contigs with a contig NG50 of 72 Mbp. We estimate the median consensus accuracy of this assembly to be >99.99%.

Essentially, this work closes up difficult-to-read gaps in the reference genome ( https://en.wikipedia.org/wiki/Reference_genome#Human_referen... )


Great write up! Thanks!

Regarding step 1 how can any human have an entire homozygous X chromosome?

Also/rather why not just use a male with one X chromosome?


there are long regions on the Y chromosome that are very similar to the X chromosome, which would make the analysis difficult:

https://en.wikipedia.org/wiki/Pseudoautosomal_region


In the old good (?) days some chromosomal libraries were constructed by using flow sorting. Not sure how often this is being used nowadays for genome/chromosome sequencing projects.


One minor correction - in step 2 the DNA is not amplified as this would reduce the fragment length and also lose the methylation information


Good point. Here they're starting from a cell line, so presumably just starting with as much DNA as they can get from the cells. Amplification is usually needed in other scenarios where the sample is more finite, though from what I've read, nanopore sequencing tech doesn't need much DNA.


I don’t believe you do.


When I read the title my first thought was "Scientists found complete assembly language of human Xchromosome".


go banana slugs!


The press release is better than the link:

https://www.genome.gov/news/news-release/NHGRI-researchers-g...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: