Hacker News new | past | comments | ask | show | jobs | submit login
BioNTech-InstaDeep Early Warning System to Detect High-Risk SARS-CoV-2 Variants (biontech.de)
149 points by adsodemelk on Jan 11, 2022 | hide | past | favorite | 64 comments



> More than 10,000 novel variant sequences are currently discovered every week and human experts simply cannot cope with complex data at this scale

This is interesting. I think the Greek alphabet naming system would lead some people to believe the virus only mutates once every few months. Of course, the reality is that every infected individual will produce hundreds of mutations within their body. I think there's a gap in the public messaging here which if addressed could help people understand what the future direction of the pandemic might be.


This is the kind of education that needs to be done in high school during health and biology classes. We can’t hope the general public will become epidemiologically literate on 240 character tweets and 30-second television quotes. Heck, my old high school biology teacher remains a source of coronavirus FUD, so maybe that’s not even enough.


I think most high school (USA) biology courses probably touch on the concept of viruses? This level of detail was certainly taught in high school Advanced Placement (AP) Biology when I took it many years ago.


I took both high school and university biology. We had two or three chapters on viruses in each textbook. But it was mostly discussed in terms of cellular and chemical levels. It was mostly abstracted from disease and epidemiology. Epidemiological concepts were barely taught, except for maybe a brief mention of exponential growth.


We didn't learn much about bacteria and viruses, we learned much more about cells with human DNA. We definitely didn't touch on rapid evolution of microscopic life, even in courses where we did spend a little bit of time talking about viruses.


I believe I heard about this in this Radiolab episode from last year. It does a good of explaining how the virus replicates in the body and how it behaves differently if the patient is immunocompromised.

https://www.wnycstudios.org/podcasts/radiolab/articles/dispa...


I don't disagree with you, but I think the more root cause thing is - people should be more humble in listening to the advice and guidance of epidemiologists, and indeed other experts and specialists.

We as a society tacitly "outsource" the task of becoming an expert to a bunch of mostly smart and well-meaning folks, in all sorts of areas, because we cannot all know "everything" - and for some reason on vaccines and epidemiology great swathes of the population, including senior leaders politically, now choose to ignore their clear guidance on how best to act.


Much of it is already on an optional biology module in my country. Of course we don't do anything quantitative about mutation rates etc, but at least you know what mRNA is instead of relying on politicized news pieces written by someone who doesn't even know what chirality is.


Public messaging often includes the more extensive PANGO lineage (e.g., B.1.1.529 for Omicron). I think there is a limit to what can be conveyed in each news story.


Well to be classified as a variant it would have to be phenotypically different right? And also viable enough to infect a number of people.


Well, there is a lot of overlapping nomenclature. Here by variant we understand any sample, which is not identical sequence-wise to sequences seen before. Most of the observed mutations are innocuous and do not lead to a potential new lineage (what better fits the definition of variant above). However, it is difficult to tell by eye - or even with the standard computational tools - if new mutation is (a) deleterious (harms the virus' capacity to spread), (b) neutral, or (c) fitness enhancing.


Impressive if true, although I am very sceptical of the results. The real test is to put their neck on the line and predict/warn of the next significant variant. If the authors are not willing to do this, I doubt that the algorithm is useful.

As a side note, the first author on this paper is the CEO of InstaDeep. I see it as a red flag that the CEO of a 150+ person company would put themselves as first author on this paper. Perhaps I'm unfairly judging and the CEO really was the lead contributor to the study.


I am the second author of the paper. We have been putting our head on the line for the last half a year. We detected Lambda, Mu (with a caveat, that we did not consider it competitive) and Omicron - all blindly.

We have been verifying all our predictions experimentally, post factum. And the method is purely data driven - with no fitting to the experiments or observations.

The first author came up with the approach, participated in analysis and got his hands dirty as everyone else. While he is a CEO of a 160+ person company, this has been a labor of love for all of us, done to a large extent in the evenings, during weekends and holidays. It is indeed an unusual situation. But this was not a regular project and InstaDeep is not a regular company either.


Thank you for chiming in. It’s great that you detected them. I’m curious how many false positives you had during that same time? Did you detect many others that just didn’t pan out to be of significance?


This is something that amazes me. Any time a true High Risk Variant appears, it is clear as day in the system. This was the case with Lambda, Mu (which we predicted to have a limited propensity to proliferate), and now with Omicron. However, there are lineages, which are prospectively dangerous, that we detect. As the classification is relative, there is no fixed threshold beyond which we would call for alert. In the evaluation, we have been looking at 20 sequences per week, as this was roughly the testing capacity of our partners. Going for ONE sequence a week makes us detect some variants a bit later, but still. The sequences and lineages we predicted to be of interest were predominantly spreading afterwards. Some were just a blip on the radar, though. We aimed at sensitivity and not missing ominous signs, rather than specificity. NB: Each week there are thousands (now 12k+) new sequence variants. Out of them we consider 20. And detect most of the variants as early on as on the first day.


A massive thanks for the work you put in - I am sure it will pay off manifold. Assuming your models are robust over time this will clearly help the development pipeline of BioNTech / Pfizer to come up with adjusted vaccines should the need arise (i.e. an especially nasty mutant showing up). Shortening the detection from weeks (months?) to days is an order of magnitude gain in lead time. From a commercial point of view this agility will give BioNTech / Pfizer a lead over its competitors (assuming this is not public / shared information?)


What we hope for is rather informing the public policy. Now, any time a scary looking variant is sequenced, there is some public commotion, uncertainty about the future repercussions. We want to be able to gauge the appropriate level of concern in these situations. It is not an all-knowing oracle, but rather a way to distill the insights from prior observations and simulations, the same way a human expert would do. This being said, we believe that the insights provided by EWS can be of use in designing new vaccines, as well as deploying the existing ones most effectively.


Dear Marcin - congratulations on the manuscript!

Do you think the immune escape parameters would need to be retuned in a post Omicron world? Do you need the actual epitopes recognised by antibodies, or can you guess this from structure.

Do you capture any aspects with respect to changes in spike glycosylation in your models?

Finally, as with another reply, do you have a guess about the specificity of this system? Is it good enough to get production of vaccines going on variants that are flagged, just in case?


The system is constantly learning, so it "retunes" itself. We can infer epitopes from structure, but we found that data derived from known complexes is sufficient for our purpose.

Current version of EWS does not explicitly account for glycosylation. It is implicitly handled by the ML models, though.

For specificity, it is difficult to estimate. We know that for each of the named Variants of Concern, the signal from EWS was unmissable. Considering, that any new vaccine would need to go through a stringent approval process, I don't think that EWS should be a major determining factor in the process. However, it can certainly help resolve the doubts.


Thanks for the info! That ability to retune sounds excellent. I wonder if you could spin this out to have an EWS for influenzas too (although hopefully it’ll never get as severe!)

I guess for the glycosylation when you say it’s implicitly handled, there’s a N-linked sequon pattern somewhere in the language model, which I guess covers a good deal of info :)

Jury still out on effects of O-linked glyco on spike, but give me a shout if you’re interested in it!

Anyway, cool work, and all the best in where you’re taking the work next!


Wow, amazing to see you here. I have somewhat of a bioinformatics background, and this is the sort of work that I always found very interesting - do you think your team will produce some sort of high level architectural and process breakdown at some point?


I am sure we will, in due time. If you have any particular hopes or wishes, we will try to accommodate as much as we can.


Thanks for adding this information. I was unfair to judge so harshly and be so sceptical.

So the $1T question is ... what (if anything) is this model currently predicting about the next variant?


The method, in its current incarnation, is detecting dangerous variants and not foreseeing them. We can use the same approach to forecast plausible developments, but there are so many latent variables (intrapatient evolution, mobility, vaccination status, restriction compliance), that it is more of an informed "What If?" exercise, than a true prediction. Same as with many good stories on predicting the future, we can only see what is likely to happen, not what will necessarily happen... Out of many paths only a few will be explored in the end :-)


Because of both founder effects and potentially people taking earlier containment actions, I’m unsure that’s a realistic measure.

It’s like how people said Y2K was just a conspiracy theory because most systems kept running — without realizing the early warnings and preparations is the reason Y2K wasn’t a civilizational disruption.


> As a side note, the first author on this paper is the CEO of InstaDeep. I see it as a red flag that the CEO of a 150+ person company would put themselves as first author on this paper. Perhaps I'm unfairly judging and the CEO really was the lead contributor to the study.

Authors on scientific papers are ordered by name, so whoever's name is first is pure coincidence.


The order of authors varies by field. I believe in biology it usually starts with the person who did most of the work (or people, but if more than one it's explicitly noted), everyone who contributed in the middle, principal investigator (i.e. head of lab) last.

In this case I notice that they first sorted all the instadeep people to the front, and biontech people to the end... not sure what I'm supposed to take away from the ordering, but it's not random.


It is not random. InstaDeep has designed, developed, built, benchmarked the method. BioNTech performed experimental validation, provided expert insight into the problem domain. Hence, authors are segmented by contribution areas.


Or sometimes PI first, depending on the scope of the PI's ego


This isn't true for the field of biology or medicine. If you take a look at the order of authors on the paper in question you can also see that the ordering is not alphabetical.


Regardless of the specifics here, I'm pleased and envious of being reminded that some people have meaningful jobs that actually make a difference.


Another author of the paper here. We are recruiting in case you would like to contribute in further meaningful projects. https://www.instadeep.com/about-us/careers/


We are actively looking for motivated colleagues. Feel free to look at the page above or drop us an email at hello[at]instadeep.com. We are also happy to make new friends and work together on exciting projects - same email as above works.


I'm all for stopping new variants of COVID. Why in the world do we have a treatment for COVID that intentionally creates mutations in COVID https://en.wikipedia.org/wiki/Molnupiravir ? I'm not a doctor but it seems pretty clear to me that Molnupiravir creates a great environment for new variants to arise (along with cancer, but that's another issue).


From the wikipedia page you linked:

> Molnupiravir is indicated for the treatment of mild-to-moderate coronavirus disease (COVID-19) in adults with positive results of direct SARS-CoV-2 viral testing, and who are at high risk for progression to severe COVID-19.[1][5]

So after the risk of creating mutations in the Covid virus has been assessed and weighed against the chances of patients just outright dying due to not having this medicine available, the board full of medical professionals trained in this matter voted 13 to 10 that they thought the risk was acceptable.


Is the risk different from the risks of antibiotics and antifungals with respect to creating resistant strains?


Well, first off "normal" antibiotics don't kill viruses so that kinda makes it different from the start. But even then, most antibiotics kill bacteria etc through disrupting their cellular processes and leaves the DNA mostly alone. Resistance against antibiotics is created through natural evolution; those bacteria that can survive better in an environment where antibiotics will reproduce more than their cousins which can't.

OTOH, Molnupiravir works by: (quoting from Wikipedia again)

> [it] exerts its antiviral action through introduction of copying errors during viral RNA replication.

So it deliberately stops virus reproduction by introducing errors in RNA copying. The vast majority of time this just makes the virus nonfunctional, but it is technically not impossible that it creates a viable mutated strain. This mutated strain may or may not be worse than the original virus. It is not unsimilar to how `cat /dev/urandom | bash` just MIGHT start off with `rm -rf /` and delete everything, but usually it will just create a crash because the random bytes don't parse into a valid bash command.


"I'm all for stopping new variants of COVID"

That's very likely not even possible[1]; and, it not being realistic, would be a waste of time, effort, and money. There are lots of corona viruses that we have never stopped, and we live with via herd immunity, hygiene, and therapeutics[2].

[1] https://www.acsh.org/news/2020/11/05/covid-why-we-will-never...

[2] https://www.cdc.gov/coronavirus/general-information.html


Because it helps keep people alive and the risks are judged to be worth the cost in high-risk patients. Hopefully Paxlovid will obsolete this one soon enough though.


How does it seem clear to you? What's the mechanism through with Molnupiravir creates new CoV mutations?


GP explicitly referenced the drug's Wikipedia page:

> "The emergency use authorization was only narrowly approved (13-10) because of questions regarding efficacy and concerns that molnupiravir's mutagenic effects could create new variants that evade immunity and prolong the COVID-19 pandemic."

The quote above has its own citations in the Wikipedia page posted by GP.


He linked to a source that contains that claim, it includes several sources for that specific claim.

It's unclear why that is relevant in this thread though.


Because it’s possible to justify the risk with saving lives.

Because creating mutations and prolonged pandemics are more profitable.


The government looses money from pandemics, so there’s no profit motive for FDA regulators to prolong the pandemic. There’s both direct costs of coronavirus response, and then there’s indirect costs (lower tax revenue from empty downtowns, less travel, more workers on taking from disability insurance instead of paying in to it, etc).


Pharma companies are making profits and there is a revolving door between regulators and industry, so your dismissal of profit motive is a joke.



Pretty cool stuff!

> When using a weekly watch-list with a size of 20 variants (less than 0.5% of the weekly average of new variant sequences), EWS flagged 12 WHO designated variants out of 13 (Fig. 4.A), with an average of 58 days of lead time (i.e two months) before these were designated as such by the WHO (Table S.4).

> Our system however does not accurately pinpoint the emergence of the B.1.617.2 Delta family of variants. Delta is known to be neutralised by vaccines24 and its global prevalence can be attributed to other fitness-enhancing factors [than immune escape]. These factors, such as P681R mutation, which abrogates O-glycosylation, thus further enabling furin cleavage, are outside of the scope of our approach.

> Specifically, the EWS identified Omicron as the highest immune escaping variant over more than 70,000 variants discovered between early October and late November 2021.


>More than 10,000 novel variant sequences are currently discovered every week and human experts simply cannot cope with complex data at this scale

I suspected something like this, given the frequency of viral mutation. Officials announce a dominant global strain but what proportion of positive cases are actually sequenced and evaluated for confirmation? How many undiscovered strains are actually in circulation at any given time, in geographically isolated areas? Could that explain variability in severity and/or long covid?


This is not my area and I read the press release but not the paper -- but I cannot help noticing that they mention a sensitivity/recall number (>90%) but not a specificity or precision number. Even if you're not trying to be cynical or skeptical about this, when there are this few true positive examples available, how can one plausibly do a good job calibrating such a system?


That's the key bit here. Supervised learning is not applicable here and any sort of fitting to known labels is doomed to fail. The system is not calibrated, tuned or parameterized. The ML part learns in a self-supervised manner what are the spike sequence features important for a successful (proliferating) coronavirus (or - more correctly - learns low-dimensional embeddings of multi-point interactions/co-occurrences of different amino acids in spike proteins). The rest is based on either frequentist statistics, or computational biochemistry. At no point in training EWS any information about certain sequences belonging to High Risk Variant classes (Variant of Concern, Variant of Interest etc.) is fed to EWS.


Given that we still don’t know the mechanism by which variants achieve their increased fitness I’m a little suspicious of claims of using “AI” to identify new variants.

Claims that spike affinity for ACE2 (or indeed changes to the spike protein at all) haven’t been conclusively proven to increase fitness as far as I know. It’s possible that this tool is simply overfit to the known variants and wouldn’t detect a new one.

I’d be interested to hear what real virologists think of this.


Is the concern that this tool would be prioritized over other information?

Like what is the harm in doing the data collection and publishing about how it is going?


I assume this is also part of a product development process: if the EWS flags a particular variant, BioNTech can begin crafting a new vaccine component and have it ready sooner, earning a competitive advantage in the (capitalist) vaccine marketplace.


Great, finally one real example where AI can really help with this pandemic


Is this literally a joke? This might as well be a doomsday machine driven by weak AI selecting new Greek glyphs to drive demand for new superfluous covid vaccines.

I'm fully vaccinated, but more data points to come up with media fear mongering is the last thing anyone in the developed world needs right now.


One of the authors states here that the goal was to create sanity(data) for the press to reference on each new variant.


Very good News, I hope this will save a lot of souls


amazing, who thought machine learning will help on finding variant is dangerous or not without going to lab !


Not exactly "finding out", but rather estimating. It's all about prioritizing testing through detecting suspiciously "good" and unexpected outliers popping up.


Is the idea to develop mRNA vaccines that can boost immunity against variants that haven't existed today?


The idea is to introduce a bit of sanity into the current world, to calm down overreactions to every new, scary looking variant. And on the flip side - to identify the truly scary ones early on, so that proper measures can be taken.

But the idea above directly follows from the work in the paper.



[flagged]


That's what happens when you offer a good product


Or when you're one of the most coercive lobbying forces on the globe.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: