Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Genomics – A programmer’s guide (gist.github.com)
274 points by andy-thomason on May 17, 2019 | hide | past | favorite | 51 comments


Along with the annual BOSC (Bioinformatics Open Source Conference), the OBF (Open Bioinformatics Foundation) hosts a free and welcoming collaborative event called CollaborationFest (CoFest).

CoFest is a collaborative two-day working session. The only requirement for attendance is that you have an interest in open source software and solving scientific problems. We will have contributors to open source bioinformatics tools present to collaborate with, and we welcome new attendees who want to learn and contribute to open source code, documentation, workflows, or training.

This year CoFest is July 26-27 in Basel, Switzerland. There is no registration fee to attend in person or virtually.

Disclosure: I'm on the organizing committee for BOSC and CollaborationFest

[1] https://www.open-bio.org/events/bosc/collaborationfest/


For a more realistic and comprehensive guide to genomics from a data science perspective see the "Biostar Handbook"

https://www.biostarhandbook.com/

The book is inspired by "Biostars" the StackOverflow like Q&A of genomics:

https://www.biostars.org/


I often recommend Aaron Quinlan's excellent course if people want to learn a more hands-on and practical set of skills related to computational genomics and bioinformatics.

https://github.com/quinlan-lab/applied-computational-genomic...


More genetics than genomics, but a cool introduction [1]!

There's a lot of subtlety in genomics analysis, and learning about all of it is a deep dive into chemistry, biology, and statistics, as well as a lot of literature search.

One fun fact about this is that most of the data in genomics is some form of a plain text table. As this makes interoperability between different programs easier. Many people have tried to make this more efficient, but their application is usually only limited to very specific use cases.

[1] https://en.wikipedia.org/wiki/Genomics


Yes, this is a deep subject and "Genetics" is indeed the correct term. Our company is called "Genomics" because of the origins of the founders in the Oxford stats department and Big Data Institute.

Both Gil and Peter have done some incredible things in this space, but the language of scientists can be difficult to understand by a lay audience.


The hardest part of genomics for me has honestly been figuring out which open source poorly maintained tool I should use for a particular problem. and which options should be run and how the data need to be preprocessed before hand.

I mean has anyone ever actually read the documentation of the GATK? It is famously dreadful. And that's professionally maintained.

Honestly a nice addition here would be a "so you want to" with snippets of raw FASTQ or VCF data and working code for various operations, maybe with an accompanying Docker container.


I feel like ADAM (https://github.com/bigdatagenomics/adam) is a huge step in the right direction. You convert from standard genomics format to Parquet and then work with the resulting data in spark with genomics-specific libraries.

My experience has been translating domain data into spark has a 100X improvement in data analysis.


> I mean has anyone ever actually read the documentation of the GATK? It is famously dreadful.

Reference for "famously"?


TRUWL had a poster at the Biology of Genomes conference last week. Sounds like they're working on this problem. I hope they succeed, because it really needs to be solved.

[0] https://truwl.com/


have you ever looked at the test suites for Picard? All regression tests and the library is OO hell lols

I was taught a decade ago that rolling your own in genomics isn't as bad of a decision as it seems.


>I was taught a decade ago that rolling your own in genomics isn't as bad of a decision as it seems.

Famous last words.


This is sadly very true, especially if you have any real software training.


Genetics is an extremely fun topic but the learning curve for entry has been non-trivial. It reminds me of the struggles I had when I was first learning to program recursive functions. With that said, it's been far, far more complicated than recursion. So, thank you for the resource, I'm always happy to find content sources that are helpful on the subject. If you enjoy hard CS problems, bioinformatics has been booming and I don't really see it slowing down anytime soon.

As for some resources, these books have been the most helpful for me.

[0] https://www.amazon.com/Molecular-Biology-Cell-Bruce-Alberts-... // I've linked and have seen mentioned here in the past, a great intro to cell biology.

[1] https://www.amazon.com/System-Modeling-Cellular-Biology-Conc... // Modeling with math. I would consider this high-yield as it gave me a great deal of insight into what different code bases are actually attempting to do.

[2] https://www.amazon.com/DNA-Nanoscience-Prebiotic-Emerging-Na... // Blew my mind when I first read it but the content is pretty standard as far as genetics course material goes.


For those of you interested in bioinformatics I can recommend the Rosalind Project [1]. It's like the Euler Project, but for Bioinformatics.

[1] http://rosalind.info/problems/locations/


If anyone is interested in playing with a full 23andMe raw data file (VCF), I have mine on GitHub: https://github.com/blopker/DNA PRs welcome!

If you're also interested in working on this stuff, shoot me an email ;) blopker@23andme.com


Keep in mind companies like 23AndMe and Ancestry typically only provide a small fraction of your genome (the parts we currently consider most important, which is a moving target). If you want your whole genome you'll need to go with something like Dante Labs (~200-300USD during sales).


For academic use, UK Biobank, 1000 Genomes and other resources offer variants for large groups, my wife included.

You may (at your own risk) take a look at

https://opensnp.org/


Very cool, thanks for the resource. I see I can upload my own data there too. Probably a more useful place for it than GitHub!


I'm not an expert but AFAIK VCF isn't "raw" data. Raw data would be the output of a sequencer (fastq) which would be several gigabytes. I recently processed raw data from sequencing a tiny virus (~20k base pairs) and it was around 13GB. Human genome sequence data would probably be tens if not hundreds of GB.


23andMe does genotyping not sequencing


Ah.. okay. I didn't know that. Yes that makes sense now that I think about it. They wont be able to do a full sequencing at their current price.


Sorry, but that's not a vcf. It's a tsv of genotypes. Here's the spec for VCFs: https://samtools.github.io/hts-specs/VCFv4.3.pdf


How far are companies like 23andMe from entire genome sequencing? That's kind of what I'm waiting for. Can you still get valuable data from genotyping?


You can get valuable data from genotyping. SNPs contain the bulk of variation between you & me.

For your first question it depends on your definition of "companies like 23andMe". There are numerous companies that'll do a whole genome for you, but I don't know if any of them do the writeup about it that 23andMe provides. 23andMe did at one time offer an exome product, but stopped that a while back.

The largest hurdle is cost. Whole genomes, even exomes, are significantly more expensive than a SNP chip. As most would be users don't know enough to care it doesn't make much economic sense to offer those to the masses at the moment.


Actually about half of variation is private (not common) and commercial services will only look for common SNPs. So you will have some unique variants that would show up in a whole genome but not a SNP test.


True. I was trying to say that your average lay person is unlikely to know the difference enough to be a big deal


Veritas Genetics offers full-genome sequencing for 1000 USD. I purchased their kit for 200 in a limited-time offer. Unfortunately when I sent it in early February, there is a backlog which is delaying my results until late this summer.


I hope you got all of your kids and relatives to sign off on that, if you have any.


That's a useful guide. A description of how sets of three letters translate to amino acids or stop commands would be handy, because that bit is quite mind-blowing and also quite reminiscent of machine code. And from there you can explain different sorts of mutation, like truncation, substitution and phase shift.

Also a guide to how to usr all that to interpret medical nonclemature of mutations, like c.345G>E would be handy


Also a guide to how to usr all that to interpret medical nonclemature of mutations, like c.345G>E would be handy

Those mutation descriptions are called HGVS (Human Genome Variation Society) nomenclature. In the example you give, "c." means that it's in a (protein) coding region, 345 is the position within the region, and G>E would be the change (although E isn't a valid "letter" in DNA sequence, even if you allow ambiguity codes -- you'd normally see something like G>T there instead).

Complications include:

1) You need to know which gene this is relative to.

2) The "coding sequence" for the gene isn't always perfectly defined, due to splice variation and different versions of the annotation. Ideally, you'd see this code relative to a specific splice variant (which might have an ENST identifier, from http://www.ensembl.org/). But it depends...

More at http://varnomen.hgvs.org/ if you're curious.


How to represent variants is a whole can of worms. There are a number of competing systems.

* RSIDs (from DBsnp https://www.ncbi.nlm.nih.gov/snp/) * HGVS as mentioned. * Ensembl chrom-pos-ref-alt (CPRA). * Variant key (Nicola Asuni)

As dasmoth says, there is no fixed coding sequence for a gene or location in the genome.


NCBI (Natl Center Biotech Info) and other related hackathons, https://biohackathons.github.io/


If you're a programmer/CS person interesting in genomics/bioinformatics, I can't recommend UCSD's Coursera courses[0] enough.

[0] https://www.coursera.org/specializations/bioinformatics


Maybe it's off topic, but anyway :

I'm a cs student, in my thesys I'll be working on a NGS C++ application. I need at least a brief introduction to "basic" sequencing but I'm struggling to find something accessible. Every book I find seems superspecialized. Now I'm reading "Insect Molecular Genetics : An introduction to principles and applications" but I'd like to read just a book chapter a little bit more advanced than the contents shown in this video https://youtu.be/ONGdehkB8jU

Any suggestions?


I studied Biochemistry/Comp Sci and the foundational biochmeistry book imo is the Lehninger Principles of Biochemistry. It goes over the basic biochemistry and once you understand that things just start to “Make sense. Once you have those basics you can read the wikipedia article and things start to click.

On the other hand, as a person who’s worked on sequencing software I’ve found the biochemistry knowledge to only be incidentally useful - though I may be underestimating some of the “basic” assumptions that were used day to day.


>as a person who’s worked on sequencing software I’ve found the biochemistry knowledge to only be incidentally useful

I have the same feeling but I'm uncomfortable working on something knowing so little about it. I'll check out the book, thanks!


On the practical side, if you're working on a low level with NGS data, htslib[1] may be worth looking into. It is a C library for reading, writing, and manipulating data structures that are commonly used in NGS (BAM, VCF, etc). I have used it and can attest to its quality. However, as is the issue with all software related to genomics, its only documentation is its header files and example programs. Here is the very example I used to get started[2]. The comments in the header files are usually good enough.

The reason I'm recommending it is the quality of its interfaces. It can seamlessly handle (input or output) virtually any kind of file you throw at it (SAM, BAM, CRAM). I can't say the same for a lot of other software I have run into in this space.

[1]: https://github.com/samtools/htslib

[2]: https://gist.github.com/PoisonAlien/350677acc03b2fbf98aa


That video describes the process used before NGS was around. These days, using anything with plasmids would be pretty unusual.

There are several next generation sequencing technologies:

1) short read - Illumina - dominates most next-generation sequencing 2) long read - nanopore or pacbio.

These have very different analysis methods, have measurement errors that are very different, and even have different file formats, etc.

Short read is far more common, so you're probably in the "Data Analysis" of this:

https://www.youtube.com/watch?v=fCd6B5HRaZ8

But you need to know about the adapters and indices (how multiple samples can be sequenced at the same time).

But as another commenter mentions, knowing some particulars about the project would really help know what sort of tutorial would be appropriate. You'll need to also know about the biology of the application, in addition to understanding the sequencing technology.


the program will work on fastq files. The sequencing technology makes long reads.

As another commenter said, I don't need superdeep sequencing knowledge because my work will mostly be on the programming side (enhance performance, not adding new functionalities) but anyways it could be useful to have a clear picture of the process.

Thanks for your help


Unfortunately I don't have many long-read resources to share, but here's a short video about the process for the MinION nanopore sequencer for long reads:

https://www.youtube.com/watch?v=Wq35ZXyayuU

At about 1:30 there's a cartoon of the data signals that get processed into sequencing data.

It's been a while since I looked at long read data, but last time I did, the individual base calls in FASTQ files (A, C, G, T) have a fairly high error rate, and there are systematic biases in the errors, which makes it harder to correct them. Most of the processing of these data is trying to correct these errors, either by looking at a known reference sequence or by sequencing many times.


What sort of sequencing data are you planning to process? Are you planning to re-implement algorithms used by bwa/samtools or come up with something on your own? NGS is a very specialized field, so its very easy to get stuck in the weeds.


This is a nice intro. For a good collection of "worked out" "pipelines" to analyze different kinds of genomic data types (RNA-seq, ChIP-seq) in the R environment (the concepts are universal, even if you don't R), take a look at Bioconductor:

https://www.bioconductor.org/packages/release/BiocViews.html...


There are many people looking for introduction into genomics and NGS applications. One of the books I found to be extremely useful is Genomic Quirks (https://www.amazon.com/Genomic-Quirks-Search-Spelling-Errors...). This book explains genomic concepts with several case studies.

Here is a video by the author - https://www.youtube.com/watch?v=BfVo8EkeDVI


Unfortunately, the article handwaves the one thing I’ve been struggling with the most: chromosomes.

What is the relationship between chromosomes and the human genome?

I somewhat get that a chromosome is somewhat of a partition of the genome, but how does the „two copies“ phenomenon of the human genome and the „two copies“ thing of chromosomes fit together? Are those one and the same concept?

Are there two copies of the XY chromosome, too?


A chromosome is a physical object. You can see them under a microscope. In eukaryotes, they consist of a DNA double helix supercoiled and wrapped around big protein structures called nucleosomes, along with a bunch of chemical modifications of certain bits sticking off the nucleosome and of the DNA structure which are involved in a bunch of different functions in the cell. In bacteria and archaea, it's still supercoiled, and there are some proteins that are similar to nucleosomes in function, but the picture is much more diverse.

Animals tend to have the same number of chromosomes at all times, and they tend to come in pairs that are nearly identical. There are various ways of mapping chromosomes that yield unique fingerprints that are stable under the level of variation we typically see in a species (see restriction mapping for example), so we can take a particular fingerprint and call it chromosome 1 or 2 or whatever. Animals have two copies of chromosome 1 and two copies of chromosome 2, etc. There are individuals who don't, who have a single copy of one or extra copies, and this causes problems, such as Turner syndrome. Similarly, when animals reproduce, each parent produces a germ cell (sperm or egg) that has one of each of the chromosome pairs. One of the reasons that many hybrids like mules are sterile or nearly infertile is that their chromosomes, coming from different species, aren't in pairs, so when they pass on half of them, there may be necessary hunks of DNA that just aren't passed on.

Other species have different numbers of copies of chromosomes, and may vary. Depending on point in lifecycle and conditions, some plants range from two copies to hundreds. Dinoflagellates tend to have four for interesting reasons that I believe are related to the Byzantine generals problem.

There is no XY chromosome. Females in mammals follow the normal pattern with the X chromosome: they have two of them and they are passed on like any other chromosome. Males are weird. They have one copy of X and a copy of a shrunken chromosome called Y. Note that this is only mammals. Birds and reptiles have a totally different set of chromosomes for sex determination, and in some vertebrate orders the chromosomes don't fully determine sex. Incubation temperature often changes it.


This is a movie of a human kidney cell dividing [1].

The red chunks are its DNA. But they're 'chunky' because the 'whole genome' is partitioned into 23 chunk - each chunk is a 'chromosome'. And each chromosome comes as a pair. And in the movie you can see the cell split the pairs, where one set goes to one cell, and the other set goes to the other.

If you notice, in the cells just prior to 'condensing', the nuclei (red stuff) looks kind of brain-like in its topology. Those are the chromosomes as well, just relaxed and spread out.

[1] https://www.youtube.com/watch?v=N97cgUqV0Cg


You have 46 chromosomes. Each chromosome is one massive contiguous molecule of DNA, which can be represented as a string of (up to) 150 million letters ("nucleotides") drawn from {A, C, G, T}. Together, the whole human genome is roughly 3 billion nucleotides. Since you have two copies in each cell, you have 6 billion nucleotides in total.

Each chromosome is either an autosome or sex chromosome. The autosomes are chr1 to chr22. All people (excepting those with chromosomal disorders, like trisomy 21) have two copies of each autosome, with one copy coming from each parent. Then, for the sex chromosomes:

* If you're male, you get an X chromosome from your mom and a Y chromosome from your dad.

* If you're female, you get an X chromosome from your mom and another X chromosome from your dad.

So, yes, "two copies of each chromosome" and "two copies of the genome" are the same concept, since the genome consists of chromosomes.


> how does the „two copies“ phenomenon of the human genome and the „two copies“ thing of chromosomes fit together? Are those one and the same concept?

Which phenomena are you referring to exactly? That we have two copies of each chromosome and if they mean we have two copies of the human genome?

> Are there two copies of the XY chromosome, too?

Each of our non-sex cells[1] contain two sex chromosomes: one from our father and one from our mother. Since your mother always inherits her X chromosome, your sex is determined by which sex chromosome you got from your father. If you are a female (XX), your father passed on his X chromosome. If you are a male (XY), your father passed on his Y chromosome.

This rule makes for some interesting inferences. For example, your father in turn got his X chromosome from your grandmother and his Y chromosome from your grandfather (both on your father side, of course). Your mother, on the other hand, got his X chromsome from both your grandparents on her side.

So if you're a male, your Y chromosome was passed on from your grandfather on your father's side. If you're a female, one of your X chromosome comes from your grandmother on father's side, but your other X chromosome may come from either of your grandparents on your mother's side.

You can trace this Y-chromosome lineage back to what's called the Y-Chromosomal Adam, which is the last universal common ancestor of all currently living human males[2]. You can make a similar inference using your mitochondrial genome[3] and arrive at what we call the Mitochondrial Eve[4].

Our sex cells[5] are different, since they only have one copy of our chromosome set. The number of the chromosome set we have is called ploidy[6] and so our sex cells are haploid cells, as opposed to our non-sex cells, which are called diploid.

If you're a male, a single mature sperm cell in your body contains either the X or Y chromosome. For females it's different, since they only have the X chromosome, all their mature cells contain only one copy of the X chromosome.

[1] https://en.wikipedia.org/wiki/Somatic_cell

[2] https://en.wikipedia.org/wiki/Y-chromosomal_Adam

[3] https://en.wikipedia.org/wiki/Mitochondrial_DNA

[4] https://en.wikipedia.org/wiki/Mitochondrial_Eve

[5] https://en.wikipedia.org/wiki/Gamete

[6] https://en.wikipedia.org/wiki/Ploidy


Thank you for this detail and useful guide Andy.


Thanks, Terry.

I hope it was useful. It is just an introductory jargon-free guide. You can find more on Ensembl and Wikipedia.


There are a lot of caveats in representing a genome as some point diffs from a reference. I worry that your description might promote a way of thinking about genomes that ignores the more complex things that can and do happen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: