Can you provide a list of the top problems in that space? Much rather try to und...

whitepaint · on Nov 19, 2022

This please. I would love to start working on (or create from scratch) some software that helps people in that field.

331c8c71 · on Nov 19, 2022

Creating pipelines is still a problem. Typically one needs to call a bunch of other tools in order to get to the final result. There could be map/reduce behavior in the middle where chunks of data are processed in parallel in order to gain speed. And you need some kind of data management/tracking as well (putting samples in groups, ingesting raw data, exporting results). And sane monitoring especially if something breaks/fails.

There are probably 100s of tools written for this but no clear winner so far. The traditional software engineering approaches like git, ci/cd seem too heavyweight (or rather too low-level) especially during development. IMHO there could be space for a fully remote/cloud solution where one would code/debug/deploy from the browser optimized for writing/maintaining pipelines.

kzuberi · on Nov 19, 2022

I also found the quality & proliferation of data pipeline tools to be baffling. Somehow always more painful to put these together than it seemed like it ought to be.

At one point we wrote an internal tool (I think lots of organizations do this, since all the 100s of existing tools somehow don't fit, so you invent #101) and while it was tremendously satisfying getting batch jobs with 1000's of cpu's churning away, that kind of data infrastructure needs to be standardized. I think some companies are doing this, e.g. saw a presentation about Arvados/Curii that seemed interesting (but haven't used it so not sure). Maybe CWL will turn out to be the way forward here?

jinto36 · on Nov 19, 2022

Protein structure prediction was a huge deal, which is why AlphaFold received so much fanfare. It is actually pretty good. The next step is to predict where multi-protein complexes would interact- which is not just as simple as predicting the structure of two proteins independently and then trying to fit them together like a puzzle, because the the interactions can also change the structure. While it's not as hard as it used to be to experimentally determine protein targets of, for example, a protein kinase, it's still not an arbitrary or cheap experiment, and to do that for the many thousands of such proteins, across different conditions (stress, presence of co-factors, etc) and in different organisms would be rather a lot of work. Something like alphafold that makes reasonable predictions and can be used to help you focus on what's most likely to be relevant to your disease or process of interest helps quite a bit.

There's also more need for integrating "multi-omics" data, where you have data from multiple assays (gene expression, phospho-proteomics, lipidomics, epigenetics, small RNA expression, etc etc) with the goal of somehow combining all these different assay results from various levels of gene regulation, to get closer to figuring out actual mechanism for complex processes. Building on that, we can also do single-cell multi-omics to some extent- where you have results from different sequencing-based assays on the level of the same individual cell. This is still pretty limited, but it's exciting and advancing pretty quickly. This will eventually be combined with things like spatial transcriptomics, which is useful for mapping out what's going on in heterogeneous tissue samples like tumors, for example, so we'll end up with spatial single-cell multi-omics, at which point you're looking at 1) some quantitative trait for multiple genes/loci/molecules, and often 10k+ of such features at the same time per assay, 2) multiple assays, such as DNA accessibility and gene expression, in 3) single-cells, of which you might have 10k of in a single sample, 4) across a physical tissue sample where individual cells are spatially mapped, and where you probably want to figure out how cells might influence the state of those around them, and 5) in multiple different samples, where you might want to compare disease vs control, or look for correlation to heterogeneity of results within one group.

There's a lot of public data already available for single-cell gene expression projects if you want to get a feel for how these things are structured and how (passable but not amazing) the existing tooling is- one of the main repositories for this data is the NCBI's SRA https://www.ncbi.nlm.nih.gov/sra but you'll quickly note that searching and browsing is not as easy as you might think it would be- because one of the main limiting factors in bioinformatics is how bad everyone is at keeping terminology consistent. For many bioinformaticians, a majority of time is spent in the data cleaning phase. It's awful. Sometimes the experimental parameters make it into SRA or GEO, but sometimes you have to read through the associated paper to pull that out. Often it's only large consortium projects like the The Cancer Genome Atlas (TCGA) or the Genotype-Tissue Expression project (GTEx) - which have enough funding for staff dedicated to data management- end up publishing datasets that are easy to "consume" without having to jump through a whole bunch of hurdles to figure out how the data was produced.

I have a BS/MS in bioinformatics and I'm presently a PhD candidate in genetics and computational biology defending in February.

pengwing · on Nov 21, 2022

So if I understood you correctly then further lowering the cost of experimentally determining protein targets could be a viable way forward that is completely orthogonal to computational methods?

whage · on Nov 19, 2022

I'd like to hear about this too!