Hacker News new | past | comments | ask | show | jobs | submit login
Making the collective knowledge of chemistry open and machine actionable (nature.com)
102 points by bryanrasmussen on June 17, 2022 | hide | past | favorite | 29 comments



Chemist here. Every few years, someone has the novel idea that we should have open data for all chemistry laboratories, so then we can do Better Science. And like every other proposal I've seen, this one will get approximately zero traction because it doesn't address any of the core issues behind why laboratory data is currently closed.

I try not to be too pessimistic about it, because it really would be great if there were more open chemical data. I just really doubt anything could accomplish that without remaking the US university research system from top to bottom.


There are initiatives in the EU to require -by law- that if it's public research, then it must be released to the public. And there are official guidelines on how to do so:

https://hal.archives-ouvertes.fr/hal-03318932

I believe such an initiative for chemistry could very well succeed, even if it takes 10 years.

Hopefully this can percolate to other countries and continents too, through EU's normative power.


That could be very valuable. In many ways it’s like material science and parts of chemistry are skimping along on the fumes of basic science done in the 1950’s up to the 70’s at national labs. Good experimentalists made solid careers doing core research without chasing endless grants or the latest fads. Seems pretty much all publicity available chemical and material databases comes from that era. Some specialty areas have progressed way beyond that but it’s rarely systematically collected, unless you’re willing and able to pay lots of money for private databases. Those private databases of course largely build from publicly funded research.

I hope this pans out.


It’s the same law in the US.

However just because the research is public doesn’t mean it’s easily searchable.

Someone still has to do that work.


What are the core issues?


Off the top of my head:

-Academic researchers are already overworked, underpaid, and undertrained. Asking them to spend even more of their time to meticulously upload all their notes and data to an electronic notebook is going to be an uphill battle.

-Academic scientists live or die by their ability to publish. Open data, especially if you're sharing in real time, makes you vulnerable to being scooped by competing researchers. Even disclosures of data after the fact make it easier for others to benefit from work you did with no benefit to the ones who collected the data. Given how cut-throat academics is, you're also not going to get many researchers on board with this idea.

-Interoperability of most laboratory software is poor. People have been trying to get laboratory instrument manufacturers to support open data standards for years with little success. They don't have any financial incentive to allow competitors to have easy access to their data.


good points, and even when software is open source and standards exist, incentives may not be there to use them or build anything to last in the commons, so you get every lab writing its own toolkit and science gateway or data portal to publish and live for a few years then perish when the grad student maintainer moves on


> Open data, especially if you're sharing in real time, makes you vulnerable to being scooped by competing researchers.

Why did something like standards and patents didn't emerge in the scientific world?


It's not needed. Scientists are gentlemen.

E.g. in ~1900 three different people, Hugo de Vries, Carl Correns, and Erich Tschermak von Seysenegg, discovered the laws of genetics, checked the literature, and found the work of Mendel. Each of them credited Mendel.


The scientific world rewards people in glory and honor much more than money. If you want more money go corporate. If you want to reward people more with money then they’ll pay less attention to the glory but that’s really expensive.


No economic incentive


>Interoperability of most laboratory software is poor.

In computerized chromatograhy, we have "standard" .CDF files.

For the handful of us who have actually been computerized for over 40 years, but who were very familiar with pre-computerized performance, a common file format was supposed to be something to look forward to.

We watched through the 1980's as each major instrument vendor brought computerization to the benchtop as a derivative of how they had been using mainframes & minis. Which meant that diverse data file formats became further diverged as they became more advanced while at the same time compromises were made so each would be optimized with a degree of backward compatibility between the limited memory and storage of the mainframes, and the even more limited memory & storage of the emerging pre-PC benchtop units. This was all very high-dollar stuff.

Then the office PC arose and started to get much more cost-effective processing power, memory & storage than any instrument vendor could compare to, so they quickly abandoned the established application-specific hardware/firmware approach. Instruments then began to be designed so that some or all of the most numerically challenging data handling functions were no longer fully possible on that vendor's electronics alone. You would need the vendors' specialzed software in addition running on a DOS and then a Windows PC, using the recommended supported interface such as COM ports or HPIB adapters to connect the cheap PC to the expensive instrument.

The incentive remained for each vendor to continue its own proprietary optimized file format, even if they were now all storing them on FAT32 volumes which had become standard in offices.

Each vendor had its own ecosystem from the beginning.

From their point of view ideally each entire oil company or drug company would have lock-in to that one vendor. If not then otherwise at least on a facility-by-facility basis.

But the problem became obvious more easily to those researchers who had the top model from each of the top vendors. You still could not use one company's software to access a different company's data file, and you could not exchange files between facilities unless they had the same instrument vendor.

This was still before Hewlett-Packard gained a very reliable reputation and became the biggest chromatograph company, after which the whole instrument division was spun off as Agilent.

So it was still considered a level playing field and the Analytical Instrument Association was formed which included the major vendors. The purpose was to define a common nonproprietary file format with all the metadata, etc.

This can't be done overnight but steady progress was made for a few years, HP was a major AIA contributor of very worthwhile personnel & effort.

During those years HP became the biggest manufacturer and one day they quietly lost their incentive for the nonproprietary effort.

Progress rapidly crumbled as the embracement & extension were then leading toward extinction. Momentum could not be maintained but preservation was accomplished as it was dropped in the lap of ASTM where it was quickly approved in the late 1990's without deep understanding or support.

This is one of the best examples of a true standard since it has remained unchanged. Each vendor had already implemented PC support as AIA progress was made, upgrading across versions as eventual finality was anticipated.

And one day it just stopped and has been frozen in time since then. Even though it's an incomplete and unfinished standard, turns out to be so much more ideal than a continuously "upgraded" approach.

CDF does not really stand for a "Chromatography Data File", it is actually shorthand for netCDF which was a well-established storage & communication format for early data-intensive work by the government weather service. Using the freely available netCDF.dll a CDF file can be parsed into a nice extensible text file.

Basically the extensibility of the netCDF layout was utilized & curtailed by the AIA participants, focused on chromatography, and each vendors' software has been able to "export" and "import" a fairly compatible CDF file ever since.

People just don't know much about this kind of thing and it's only been about 25 years, so very many chromatographers are not familiar with it yet.

Lock-in still rules and most vendors won't have this in any of their workflows or standard training.

It'll just have to be something to look forward to like it was 40 years ago.

Now OpenCHROM is a project where so much progress is needed that one professor's lab is not enough. This is what needs to be followed now and contributed to in a modern engineering way until it can be cemented in stone the old fashioned way ASAP.


If I was in charge of granting any federal grants; I would demand the recipients open source the data, and upload everything in a orderly manner.

It would just be if you want this money do the above.


Probably dealing with enough meta data to capture the stuff like the reaction only works because the supplier of one of the reagents used by that lab had ppm copper impurities


For chemical reaction prediction, see the Open Reaction Database, a collaboration including the Coley lab at MIT (surprisingly not cited by OP):

Paper: https://pubs.acs.org/doi/10.1021/jacs.1c09820

Docs: https://docs.open-reaction-database.org/en/latest/overview.h...

It’s an incredible effort to collate and clean this data, and even then a substantial portion of it will not be reproducible due to experimental variability or outright errors.

For computational methods development it’s extremely useful, maybe even necessary, to have a substantial amount of money and one’s own lab space to collect new data and experimentally test prospective predictions under tightly controlled conditions. The historical data is certainly useful but is not a panacea.


Relatedly (and also not citing) from a couple weeks ago https://news.ycombinator.com/item?id=31566200 Call for a Public Open Database of All Chemical Reactions


The notion of open-source scientific discovery is a good one, but some of the suggestions here seem very unlikely to catch much traction, and even if they do, problems will remain.

For example, say an academic chemical research group synthesizes a series of novel compounds in the lab - they're not going to just release the raw data on everything they did immediately. The thinking might be, 'we can give this MS student this compound to work out a better synthesis route for, or this pHD student can try to extend the synthesis and make other compounds'.

A more realistic scenario mentioned in the article would be to require publication of the raw data to a database as a condition of publication. This is already done to some extent in journals, but materials and methods sections are notorious for leaving out some key factor or other, meaning repeatability is an issue and other labs will generally only try to replicate the more interesting results (possible new antibiotic, etc.).

This worked out fairly well with GenBank, the database of published gene sequences, and also with the protein crystallography databases, but everyone in the molecular biology world knows that all sequence data is not of the same quality, and so cross-referencing by the more reputable researchers and reading their papers to see if their methods are transparent and robust or not is still an important step. A database clogged with low-quality data isn't as valuable as a more carefully curated one, certainly.

It would be nice though, to have a database where you could look up everything there is to know about something like the antibiotic ciproflaxin, including all the spectral identification data, optimal reaction conditions, etc. - but this is also a molecule that researchers are busy making derivatives of, likely with the hopes of patenting some novel new knockoff and getting an exclusive license distribution deal with a major pharma corp, and so they won't be releasing any data, or even publishing in a timely manner (at least not until the patent application goes through, and maybe not even then).

That leads to a controversial question: should research universities and academics financed by taxpayers behave like for-profit startups pitching to a VC outfit?


It would be wonderful to see something like the Materials Project (https://materialsproject.org/) but for Chemical research/knowledge.


Good luck with that...lol. The ontological / informatics space for chemicals is a mess.

To make the collective knowledge of chemistry open and available, you need to represent, organize, and index it. This problem is not as sexy, but it is orders of magnitude more important.


this is a huge problem. arguably one of the primary technical reasons that 'web 2.0' was such a dud.


"Alchemists turned into chemists when they stopped keeping secrets." -- Eric S. Raymond

Open Science (in the publishing sense) used to be fringe just a decade ago. It's very much mainstream now.

Open Data will be a much tougher (and long-term) battle, but it's inevitable.


Can someone in the field explain how this "machine actionnable" would be different from Galaxy Pipeline [0], or a Chemputer [1]?

[0] https://en.wikipedia.org/wiki/Galaxy_(computational_biology)

[1] https://www.chem.gla.ac.uk/cronin/news/cronin-group-builds-c...


This kind of endeavor should be a common theme to all science, not just chemistry.


It's certainly a goal to work towards. However, it's pretty difficult to build One ELN to Rule Them All given how flexible many kinds of biological experimental designs are - especially when you're working on the bleeding edge.

A good first step is to require supplemental materials are published in a machine readable format (e.g. not manually thrown together Excel files that lack any kind of normalization or rational schema)


> it's pretty difficult to build One ELN to Rule Them All given how flexible many kinds of biological experimental designs are - especially when you're working on the bleeding edge.

RDF is quite flexible and using a combination of domain specific ontologies like cheminf[1] and other top level ontologies like BFO[2] should allow you to capture most of the semantics.

[1]: https://www.ebi.ac.uk/ols/ontologies/cheminf [2]: https://en.wikipedia.org/wiki/Basic_Formal_Ontology?wprov=sf...


But then there are things like GPT-3 , which means stashing everything in a rigid schema isn't as hard-core of a requirement as it used to be.

OTOH, facilitating:

    1. access to the raw data
    2. access to the metadata
    3. access to the source code of whatever software was used / created to run the experiment
    4. making sure everything is computer readable (i.e. not a 256x128 graph as a PNG embedded in a bloody PDF)
should be a requirement for any scientific publication worth its salt.


I'm surprised that no one mentioned RDKIT. It's an open source cheminformatics platform. It provides software to read and write multiple molecule file formats. It even has a chemical search feature implemented as a PostgreSQL extension. https://www.rdkit.org/


Can someone with more knowledge of Chemistry enlighten me why chemistry experimentation isn't the killer app for the Metaverse, at least for low-order reactions? I know the e.g. protein folding class of problems are prohibitively computationally expensive, but surely there's some low hanging fruit?


If you're talking about computational modeling of chemical reactions, for example getting a computer to figure out a novel low-cost synthesis route for an important molecule, well... This becomes incredibly complicated very quickly. It's generally more likely to get a result using the traditional experimental methods, with some exceptions for very small molecules perhaps.

The field of physical inorganic/organic chemistry is one of the more difficult ones to build accurate models for. A first step is to calculate the electronic structure of products, reactants, possible intermediaries, and this blows up fast for even moderately complex molecules. A lot of work has been done with simpler systems like 2 H2O -> 2 H2 + O2 but even that's ridiculously complicated, as you have to model the catalyst and the surrounding environment as well, and then get the kinetic model right. The computational power required is on the supercomputer scale, and the level of background knowledge required is pretty high to even start to implement something like that, for a taste see:

https://h2awsm.org/capabilities/dft-and-ab-initio-calculatio...

This is an area where quantum computers may have applications (2021):

https://www.energy.gov/science/ascr/articles/quantum-computi...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: