A biological camera that captures and stores images directly into DNA

asimpletune · on July 10, 2023

Something that I always found fascinating is how DNA is a base 4 information format. There's this thing called radix economy, which is basically an expression of how efficient a number system is. Base e is the theoretical maximum, and so base 3 is the closest integer.

Obviously if you have a special use case, then that may dominate your radix economy (like hex, b64, etc...), but for general purpose information purposes, the order base 3, base 4, then base 2.

This present a lot of interesting questions to me. Like, why didn't DNA end up as base 3? (probably because 4 naturally lends itself to pairs of 2).

Also, this idea of radix economy goes beyond just the encoding of information and is represented in logical economy as well. So for example, ternary logic is (much) more efficient than binary logic. Having that 3rd state just makes problem solving much more elegant.

To that end, I have always wondered how nature has exploited this 4-state number system logically. Like, are there all sorts of exotic logic gates that come from a 4 state system?

usrbinbash · on July 10, 2023

> Like, why didn't DNA end up as base 3?

Why did we end up with only 20 proteinogenic amino acids? Why are vertebrate neural architectures inverted (cell bodies on the inside, connections on the outside, even though the other way round way (eg. like a squids brain is organised) is easier and less inhibitive to growth?

2 Reasons:

a) Because nature and evolution cannot engineer. Random mutation, recombination and natural selection are the only mechanisms available. Things get selected if they outcompete existing alternatives, they don't need to be the best solutions.

b) All solutions have to be built by modifying what already exists. Evolution doesn't get to do greenfield projects, because anything that has to start from scratch is so disadvantaged in natural selection compared to already evolved complex life, it will fail.

This leads to systems that, from an engineering point of view, don't always make a lot of sense.

Eg. the architecture of the vertebrate neural system creates a lot of issues (eg. our light sensitive cells point in the wrong direction). The only way this makes any sense if when one looks at how the neural tube (the precursor to the backbone) is formed by the endodermis folding in on itself. This process is so deeply at the root of the Chordata, and so many other things depend on it, that it simply cannot change any more.

Many many biological systems are "legacy systems" in the truest sense of the word: Solutions produced a long time ago that may have many problems, but are simply too deeply enmeshed with everything that came after, that they are now impossible to change.

t_serpico · on July 10, 2023

A classic armchair response. DNA has complementary nucleotides (AT,GC) that facilitates its pairing. Base 3 wouldn’t work in that sense. Also, you can’t forget about the genetic code. See https://arxiv.org/pdf/q-bio/0605036.pdf for interesting thoughts. Remember, evolutionary biology is a field and people think about these questions!

idiotsecant · on July 10, 2023

This is pretty smug for someone who seems to have managed to miss the point entirely. Yes, DNA has certain features that require a base 4 system. That is not necessarily true of all possible systems with DNA-equivalent function, which is the point this whole thread is making.

t_serpico · on July 10, 2023

How have I missed the point? The answer that nature cannot engineer and can't start de novo are trivially true statements that provide no actual insight into the question. I fully agree the original question itself is a deep one. A quick literature search is more productive than pontificating with weak analogies. See https://www.math.unl.edu/~bdeng1/Papers/DengDNAreplication.p... for what seems to be an interesting analysis regarding base number and DNA replication rate.

usrbinbash · on July 10, 2023

> that provide no actual insight into the question

Mind elaborating on that?

Because there is no biochemical reason why DNA could not have incorporated, say, a third pairing pair, so while base-3 (which I don't specifically mention in my post btw.) wouldn't work, base 6 or 8 would have been possible. "Unnatural Base Pairs" are even known to work in laboratory settings.

There is also no biochemical reason why base2 life wouldn't work. Expand the reading frame of the translation machinery to 5 instead of three, and you have enough coding space for polypeptides.

My answer adresses the question completely, because the only reason behind these "decisions" is an ancient system that simply got "frozen", and now cannot change any more.

sterlind · on July 10, 2023

> There is also no biochemical reason why base2 life wouldn't work.

are you sure about that? are you sure there's no weird effects that might destabilize very long sequences of 2-nucleotide DNA? or on how wide DNA-binding domains have to be to cope with reduced information density, and how that might sterically hinder smaller arrangements of proteins?

> My answer adresses the question completely, because the only reason behind these "decisions" is an ancient system that simply got "frozen", and now cannot change any more.

your answer is just a hypothesis, not a proof. these things can be studied (by studying abiogenesis in-vitro), and it's not certain these decisions were "flash frozen" like you describe. 2-, 4-, and 6- nucleotide coding systems might have coexisted in the RNA world, and 4- could have won out for some reason.

usrbinbash · on July 11, 2023

> are you sure about that?

Yes, I am sure about that, because I used to study Biology before going into IT. And we had a lovely lecture in which we used to discuss theoretical setups for lifeforms at a molecular level.

2 nucleotide DNA isn't necessarily less stable. AT-rich domains have less bindings, but if stablity is the issue, use CG instead (3 bindings)...although that is also a compromise, because then opening DNA for transcription gets more difficult.

> your answer is just a hypothesis, not a proof.

My answer is what we observe in evolutionary biology.

I have given an example outside of the molecular world for a reason. There is no real advantage to the inversion of the neural architecture in Chordata, it just didn't matter when the neural tube formation mechanisms came to be. Now, with mammals having huge brains and complex sensory organs, the warts in that design show.

The proof for that is easy to come by, (also a reason btw. why the neural inversion is my favorite example for this): Look an any Protostomia. Their neural system isn't inverted. Consequently, Squids don't have a visual blind spot.

sterlind · on July 13, 2023

your example of the blind spot is quite elegant and convincing. I think it's partly so convincing because there's a large fossil record and diverse phylogenetic tree, with many gaps covered. conversely, we're missing direct evidence for the pre-LUCA era, and what we have is bottlenecked. this makes me more skeptical.

for instance, I've seen arguments that the codon mapping, and even the particular set of protein- coding amino acids, that we ended up with was arbitrary, but I've also read papers arguing that the amino acids include a sort of spanning set of different structural scaffolds with different polarity that happen to mesh well with DNA, and that the particular choices of codons were influenced by how the RNA t-acyl transferases arose, etc.

so, I'm still unconvinced, but I find this area fascinating to read about.

j33zusjuice · on July 11, 2023

Idk enough about this discussion to argue it, but his hypothesis does not imply your second point couldn't be true.

> your answer is just a hypothesis, not a proof. these things can be studied (by studying abiogenesis in-vitro), and it's not certain these decisions were "flash frozen" like you describe. 2-, 4-, and 6- nucleotide coding systems might have coexisted in the RNA world, and 4- could have won out for some reason.

His hypothesis is, at least in part, “4- won for some reasons for which we have no explanation, and it stayed that way for some reason [that we may or may not know].” I suppose the reason would be that 4- was somehow better suited for the particular use-case at the time.

Of course there’s a ton of interesting details to discuss to discover, and whether if multiple systems coexisted is one of many fascinating things to discuss, and his response never said otherwise.

hotstickyballs · on July 10, 2023

If you iron man the argument then it’s an error correction argument in that this simple ecc method can be what favours a base-4 encoding instead

usrbinbash · on July 10, 2023

> Base 3 wouldn’t work in that sense.

That's true, but a) not the point I am making, and b) I am pretty sure it says nowhere in my post that it would.

sparrowInHand · on July 10, 2023

Short answer: Likelihood of noise (brownian motion) producing the element and keeping it interacting. Then once it gets going, likelihood of keeping state, while interacting.

function_seven · on July 10, 2023

> (eg. our light sensitive cells point in the wrong direction)

Can you expand on that? Are you talking about front-facing eyes vs. birds' eyes? Or something else like retinal structure?

usrbinbash · on July 14, 2023

https://en.wikipedia.org/wiki/Cephalopod_eye#/media/File:Evo...

Lefthand is a vertebrate eye, righthand is a squids eye.

In Vertebrates (really in all Chordata), the light sensitive "tips" of the sensory cells point inwards, aka. the exact wrong direction. At the base of the cells are the axons (nerve connections) which transmit the information into the brain.

Due to the aforementioned orientation, these axons run along the outer layer of our light sensitive cells, and at some point have to travel "invards" towards the brain. At that point there can be no cell bodies, and that's the "visual blind spot" of our eyes.

A squids eye doesn't have that problem; all the light sensitive cells point outwards, the axons are at the innermost layer, and connectivity can be achieved without a blind spot (also, they don't need a reflective layer).

metabagel · on July 10, 2023

I had to look this up, and I guess what usrbinbash was referring to was the layout of the retina, which places the rods and cones behind layers of transparent neurons.

https://en.wikipedia.org/wiki/Retina#/media/File:Retina-diag...

Edit: ninja'd

dekhn · on July 10, 2023

Yet, it doesn't really have a strong impact as it's been determined that humans can see individual photons and we aren't dependant on night vision for hunting.

usrbinbash · on July 14, 2023

It doesn't have a strong impact, and the design also doesn't prevent good night vision (the basic structure of a cats eye is similiar, ALL chordata have an inverted neural makeup).

But that doesn't mean the setup makes sense, and that is exactly my point.

And long term, this has an impact. For example, vertebrate brain size is limited by the simple factor, that we have to put all the connections on the outside. The more neuronal bodies we have, the more connections they require.

    N <---> N

In this clumsy diagram, 2 neurons talk with the connections on the inside. However, vertebrate brains have to do this instead:

    +--------------+
    |              |
    +-> N      N <-+

It's easy to see how the second setup becomes prohibitive when more Neurons are added to it. The brains of Protostomia again don't have that problem...they can have the connections on the inside, and the neuron bodies on the outside, aka. the logical setup.

Now there are ways around that, eg. Reptile and Bird brains grow in bulbs that theoretically allow sustained growth without the connective layer getting in the way. But similar to the reflective layer in our eyes, this is not a setup that's there because it makes a lot of sense...it's a hack, a workaround for some "legacy system", that is now so enmeshed, it's impossible to change.

dtgriscom · on July 10, 2023

Retinal structure:

https://en.wikipedia.org/wiki/Retina#Inverted_versus_non-inv...

erenyeager · on July 10, 2023

It’s a bit anthropocentric to talk about not making sense from an engineering point of view. One example is the recurrent laryngeal nerve which always appears to take unnecessary detours to people because of what is thought to be historical evolution. But there is deep wisdom and insight we have gleaned in this, but I think it’s not for us to say well we could engineer this better, we don’t have the total knowledge of tools yet & it is dismissive and say disrespectful of the wondrous biological systems that have been made to sustain life.

Hubris and attempts to alter inherent nature are often tied up ironically. But we can benefit a lot more from biological humility, realizing there are many unknown unknowns.

PaulHoule · on July 10, 2023

DNA has the same limitation that many serial protocols have: if you repeat the same base pairs (e.g. "AAAAAAAAAAAAAAAAAAAAAA") you will have trouble w/ the DNA not spiraling correctly. Some sequences of 2-6 repeated base pairs seem to "deliberately" cause variant behavior in DNA and RNA, see

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)

Many real wire protocols have mechanisms to prevent repeated sequences entirely

https://en.wikipedia.org/wiki/8b/10b_encoding

DNA coding for real proteins is unlikely to be too terribly repetitive but I image a long α helix could have a repetitive amino acid sequence. Many amino acids can be coded with variant codons, I guess if repetition were a problem in a particular gene natural selection could step in.

amateurCoder5 · on July 11, 2023

Repetitive DNA sequences do occur and do cause problems.

https://en.m.wikipedia.org/wiki/Trinucleotide_repeat_expansi...

TLDR: DNA triplet repeat expansions cause diseases like Huntington's disease.

icoder · on July 10, 2023

DNA is not really processed like that, afaik. Mostly, each 3 bases code for an amino acid, which are glued together to a string (protein), which folds in a 3D structure based on the characteristics of all amino acids.

Some DNA is used to attract other proteins, or even interact with DNA elsewhere on the strand, or is translated to RNA (one-on-one) which can then have a function based on its sequence or the structure it folds into.

Any 'logic' there is, is built _on top_ of this.

jacquesm · on July 10, 2023

Any logic that we are currently aware of. DNA contains many unsolved mysteries and I expect that gift to keep on giving for a long time to come.

asdff · on July 10, 2023

On paper this might be an interesting game, but you have to think of things in terms of crystal structure, what is able to form hydrogen bonds, what ends up being sterically hindered and what that means for the molecule. This is why watson and crick and franklin's work was so seminal, it showed how genetic information was inherited through mechanical logic of these molecules alone. Before the structure of DNA was solved, there were a lot of competing theories over what molecule was the source of heritable information, and how this information was exactly passed down between generations.

go_elmo · on July 10, 2023

Might error-correction play a role? Having a lightly inefficient base 4 system might provide capacity for the surplus error correcting code information capacity?

icoder · on July 10, 2023

DNA mostly relies on the fact that there's 2 strands that are (logically speaking) a mirror copy of each other (a C is paired with a G and vice versa, an A to a T and vice versa), it's like RAID 3 with only 2 disks (one being parity).

Apart from repairing structural damage such as missing bonds, the cell can even repair missing bases or non-straight breaks without loss. This mechanism is also used for replication: the entire strand is split and each half is completed with its mirror counterpart.

go_elmo · on July 10, 2023

Im aware of that, but was rather thinking about ECCs like hamming-code, that are able to correct single sequences of info based on surplus info in that same string.

dekhn · on July 10, 2023

Nothing algorithmically sophisticated, but DNA repair enzymes already do this.

Faaak · on July 10, 2023

One mutation in a base pair can lead a totally different amino acid (c.f the genetic code), so I doubt it ?

toufka · on July 10, 2023

BUT, if you look at the codon table, precisely because it's base-4 and not base-3, many base flips are silent when coded.

By using base-4, there's enough space to permit lossiness of the coding itself - given the number of amino acids and the 3-NT encoding.

So you really aren't optimizing JUST for nucleotide encoding, but you're also optimizing in concert with 3-nt/AA, and 20AA codes.

So if you have to optimize for information density and fidelity, given X-nucleotides, Y nucleotides/AA, and Z AAs, and sample as much chemical and physical diversity in those AAs life has settled upon: X=4, Y=3, Z=20.

If we went with X=3, you might need Y=4 to get the same kind of fidelity, but that cranks up your energy costs by 30% (from 3 to 4 NT per AA).

go_elmo · on July 10, 2023

True, iff the error persists correction cycles which are present, how exactly they work / if theyre comparable with eccs I dont know.

toufka · on July 10, 2023

You (at least) have 3 systems that are optimized in concert in a (our) DNA/Protein world.

DNA base set, Amino acid set, Translation layer between DNA/Proteins.

Currently, we've got: 4 DNA bases, 3 bases/AA, 20 AAs; 4^3 => 20

If you change one of those numbers, you'll need to rejigger the rest, and you'd need to reoptimize. And there are competing goals which at least include: - maximize access to biophysical/chemical diversity - minimize energy expenditure to produce each component, chemically - minimize energy expenditure to both copy instructions & produce products - maximize information fidelity - minimize or at least degrade gracefully in the context of errors

In the context of a 3-base system, you very well could throw off those optimizations given the consequences for the other 2 parameters (#AA & nt/AA). 3^3 = 27, which is very close to the maximum of 20 amino acids. Which means you'd probably need a 4nt->AA translation layer to keep the same number of AAs, and that alone would add 30% more energy expenditure. If you kept the 3nt->AA system you'd BOTH need to reduce the number of accessible amino acids AND you'd lose some of the error correction mechanisms of having degenerate codons code for the same amino acid.

jszymborski · on July 10, 2023

There's a lot of interesting things to consider.

One, is that base 4 makes a lot of sense for the stability of DNA structures. You have two purines, two pyrimadines.

Another is that partly because codons are degenerate, the distribution is way off a uniform distribution. For chemistry and mol bio reasons, the distribution of AGTC is very skewed.

When i fully wake up, this might be a fun blog post to draft.

guerrilla · on July 10, 2023

> (probably because 4 naturally lends itself to pairs of 2).

Why would pairs of two be favorable?

> So for example, ternary logic is (much) more efficient than binary logic. Having that 3rd state just makes problem solving much more elegant.

What do you have in mind here?

> Like, are there all sorts of exotic logic gates that come from a 4 state system?

I don't know but you may be interested in this [1].

1. https://en.wikipedia.org/wiki/Catu%E1%B9%A3ko%E1%B9%ADi

htss2013 · on July 11, 2023

Maybe it has something to do with protein folding, since apparently scientists do not understand the physical mechanisms that translate a given DNA sequence into a given 3D shape. They best they can do is infer probabilistic patterns using ML to proxy a mechanistic of action. Maybe there's something quantum at work that means it will never be fully deterministic.

dahfizz · on July 10, 2023

Why do my eyelashes, meant to protect my eyes, fall into my eyes? Why do my cheeks/tongue sometimes get in the way of my teeth so that I bite them? And why do they then get inflamed so that I continually byte them for the next few days?

We are all a bunch of biological goop resulting from random processes. Don't expect optimal solutions from evolution. There is no "why".

erenyeager · on July 10, 2023

There is a why and reasons we find and insights we glean. The denial of a why is itself an anthropocentric take on biology, influenced by Darwinian thought. But any biological system we study we need a why, otherwise the corollary of your statement essentially leads to no learning or understanding, because everything becomes “arbitrary” and explained away by randomness.

Ironic to say it’s not optimal, really we don’t have the full knowledge. Often when we learn more we learn how little we really know about biology.

throwawaymaths · on July 10, 2023

It's probably not base four because you have to stretch out more pairs to match up four pairs and that's entropically disfavored. However ribosomes can accomodate a four pair matching, though at a very reduced yield (unless you think Schultz's postdoc fabricated those data)

dylan604 · on July 10, 2023

>So for example, ternary logic is (much) more efficient than binary logic. Having that 3rd state just makes problem solving much more elegant.

Binary for electronics is obvious because there are 2 states in electric components: on or off. There is no 3rd option.

nicoburns · on July 10, 2023

I believe "on" and "off" in electronics typically correspond to different voltage levels. So you absolutely could have a third intermediate state if you wanted to. Flash memory does this (and even sometimes has 4 states). I guess designing switches (transistors) that could take advantage of and propagate these extra states could be tricky though.

ohwellhere · on July 10, 2023

https://en.wikipedia.org/wiki/Ternary_computer

lurknot · on July 10, 2023

https://en.wikipedia.org/wiki/Chargaff%27s_rules

Llamamoe · on July 10, 2023

Radix is important for digit-efficiency, but in a biological system that's not necessarily related to molecule size efficiency.

pythonguython · on July 10, 2023

I’m also failing to see how digit efficiency would be important in DNA. In fact, it seems that a high base system would be more efficient. If you had 80 nucleobases instead of 4, each base pair would contain far more information

JumpCrisscross · on July 10, 2023

> If you had 80 nucleobases instead of 4, each base pair would contain far more information

Which is a problem given DNA is a lossy format.

Llamamoe · on July 12, 2023

Strictly speaking, you could encode more error-correction bits into a higher nucleobase count but that'd pretty much require intelligent design, and wouldn't necessarily be viable for microorganisms to have that many proteins handling all the metabolic scaffolding.

asimpletune · on July 10, 2023

The efficiency comes from the ratio of the alphabet to the number of character places needed to express them. Otherwise why not base a million? Or a billion?

This ratio is what leads to base e being the theoretical maximum.

OscarTheGrinch · on July 10, 2023

Man, I have a hard enough time trying to keep track of a micro SD card, imagine misplacing your DNA based files?

Seriously tho, using DNA as an information storage medium is a pretty neat concept.

Borrible · on July 10, 2023

>using DNA as an information storage medium is a pretty neat concept

And billions of years old.

DeathArrow · on July 10, 2023

> And billions of years old.

With not quite good backup strategies.

Scarblac · on July 10, 2023

If you create your data right, the actual data can make backups of itself. There's even builtin ways for it to improve itself over time using genetic algorithms.

TeMPOraL · on July 10, 2023

A kind of implied meaning of the term "data", especially in context of storage and archiving, is that we do not want it to "improve itself".

Borrible · on July 10, 2023

DNA/RNA looks to be more like storing heuristics, landmarks and clues, not data.

snitty · on July 10, 2023

Can you elaborate on that? A significant portion of DNA in organisms literally encodes for protein sequences. It also has functional parts (binding sites for proteins, promoter sequences). Some RNAs are not translated because the RNA itself has function, but I don't see that same argument for DNA.

asdff · on July 10, 2023

Only like 1.5% of the human genome is protein coding.

snitty · on July 11, 2023

And like 90% of E. coli genome is protein coding. I intentionally wasn't limiting it to humans because humans make up a very small portion of total DNA in the world.

asdff · on July 10, 2023

Having an error rate means a small chance of gaining an edge that makes up for having it.

dylan604 · on July 10, 2023

Could make for some interesting decoding errors as your original data mutates

Daub · on July 10, 2023

Its not often that HN makes me laugh.

jdsalaro · on July 10, 2023

You could even say we're talking about a legacy storage medium ;)

jacquesm · on July 10, 2023

Ok, who ate the family movie archive?

throwawaymaths · on July 10, 2023

Man what a confusing title. It's not a single strand of DNA that gets the image information. You get a pool of DNA, which collectively hold the images information.

This is done in a pretty obvious way, each "pixel" is a well in a 96-well e.g. plate and you expose the bacteria in these wells to different light and then the DNA transformation is triggered by the light, then you harvest the DNA from the bacteria and get your image pool library.

PaulHoule · on July 10, 2023

Would be neat if you could somehow splice them into a sequence but I think you'd have some alternating sequences that determine position and sequences that really code information.

throwawaymaths · on July 10, 2023

yes, I was very impressed by the title, but once I dug into this it was sort of like "well we could probably have accomplished this ~10 years ago when optogenetics first came out". Definitely a situation where branding the title got something "so silly that no one did it before" got it to be noticed.

aclatuts · on July 10, 2023

Assassins Creed is becoming reality soon.

xk_id · on July 10, 2023

Once upon a time, the Wikipedia article about Hacking provided the following as sort of “canonical” example of hacking: using an optical mouse as barcode scanner. In some ways, this incredible paper feels like an iteration of that example.

candiodari · on July 10, 2023

Just so we're clear, this is ONE pixel per, I don't know, 10000 cells or so. So one bit per DNA chain, with that bit repeated thousands of times to get redundancy. Still and incredible achievement.

fxtentacle · on July 10, 2023

The super neat thing is that they tag each DNA chain with the pixel coordinates, so you can afterwards mix those 10,000 DNA strands each for all 96 pixels into one 1-mio-DNA-strand-soup and still recover the image successfully.

jojobas · on July 10, 2023

>successfully

Except when it recombines weirdly and gets mixed up as per the article.

tough · on July 10, 2023

Those are just mutations in the image

TeMPOraL · on July 10, 2023

"Hallucinations" seems to be the modern term.

/s, but only slightly.

K0balt · on July 10, 2023

Am I alone in thinking that using regular DNA is a terrible idea for data storage?

I mean, that would make your storage medium a potential biohazard. Although it probably would all be cool until someone put smallpox.bin on a major torrent tracker.

If it’s really that good, we should come up with a variant using slightly different chemistry so that biocontamination is not a factor.

throwawaymaths · on July 10, 2023

> that would make your storage medium a potential biohazard

generally no. If you're worried about random DNA being a biohazard there are way worse things to worry about, like how your immune system uses random stretches of biologically primed dna to create antibody diversity.

The real reasons why it's terrible is that write speed is atrocious and read speed is bad (on the order of 2-3x that of amazon glacier's robotic tape handlers, with WAY MORE expensive robots, and way more expensive cost to read -- you're bulk polluting rivers in china to make the reagents).

The only use case I can think of is deep generational archival (like the svalbard seed bank, but for information). Where cost to store by volume is at a premium, and where you'd like to have many many many copies, and you don't mind the cost to read because you won't be reading it but for every 10 or so years, if even.

Store your logs in DNA. You're never going to read them anyways.

asdff · on July 10, 2023

Having DNA as a storage medium is the best way to store actual biological data. Currently, we do things like having seedbanks, which need periodic replacing as seeds grow to be nonviable. A library of genomes is a much smaller physical footprint than a seedbank. It doesn't need periodic replacing, provided its not getting bombarded by radiation or anything unusual like that. DNA doesn't even have to be stored frozen; you can freeze dry it and store it at room temp for a very long time before any significant degradation. You can also just have the sequence stored digitally, and synthetically build out the dna molecule as you need it (I think this is still pretty costly though and not that efficient). With the right molecular biological tooling, one could conceivably introduce these genomes into a plant cell line and grow them up in tissue culture, you don't have to for example grow a tree and let it mature and go to seed since plant cells are pluripotent, everything can be done in a lab much faster.

throwawaymaths · on July 10, 2023

> Having DNA as a storage medium is the best way to store actual biological data

100% agreed. I thought that was obvious, was mostly sniping at "DNA based digitalstorage startups", thanks for clarifying for me

K0balt · on July 11, 2023

My scenario is based on the idea that someone would write code that generates an “archive” that is biologically active , like a prion or a virus or something else that has dangerous properties outside of its context as a data storage device. Not so much random DNA but very specific, targeted DNA.

I mean, DNA storage means you have an advanced, high performance DNA synthesis, copying, and sequencing machine. That’s fine in general, because most people using such a machine have a concept of the risks and responsibilities.

But imagine what would happen if we had millions of those machines connected to the internet with unpatched vulnerabilities. It seems like that could have some negative outcomes.

throwawaymaths · on July 11, 2023

Unlikely those sequences would pass checksum

Padme meme with checksum

fartsucker69 · on July 10, 2023

imagine downloading a tv show and it turns out its shit

hallihax · on July 10, 2023

This is great and all, but the merge process is still messy as hell

gsam · on July 10, 2023

Anyone else think that there is already primitive image data encoded in biological data? Essentially basic shapes and patterns which are passed down semi-generationally.

dghughes · on July 10, 2023

Anton Petrov has a recent video on YouTube I never watched it yet but it's title is "Could Life Be Transmitted Via Radio Waves? Information Panspermia". Just a bit of fun I'm sure Anton isn't too wild he puts out some interesting videos but not in a way to push quackery.

https://www.youtube.com/watch?v=K4Zghdqvxt4

nico · on July 10, 2023

Fascinating, thank you

Recently here on HN someone posted a quote saying something like “if you shine light at something for a long enough time, don’t be surprised if you end up getting a plant”

It was about how the environment seems to reorganize in certain ways to use up energy (the latest Veritasium video about entropy also talks about this)

dekhn · on July 10, 2023

I do not know of any true "image" data. Most complex patterns in nature are created by generative processes rather than direct encoding.

stevezsa8 · on July 10, 2023

I guess it's possible if this conferred some survival advantage.

It can be useful to work from the evidence to a conclusion instead of the other way round.

But wondering and philosophising can be fun :]

It would be cool if humans could pass knowledge via their offspring. But I always get worried thinking if I'm the asshole, I wouldn't want my kid to be one too.

anigbrowl · on July 10, 2023

I always get worried thinking if I'm the asshole, I wouldn't want my kid to be one too.

If you were, you would.

yieldcrv · on July 10, 2023

Just has to not be a disadvantage.

Plenty of mutations have no purpose whatsoever and were unrelated to survival or manifest after reproduction so are not selected for or against.

f6v · on July 10, 2023

I think it would have very high energy requirements. For this trait to survive over generations there would need to be a tremendous evolutionary benefit. What would that be for a “primitive image data”?

yetihehe · on July 10, 2023

Maybe things like "long green shape" (cats' fear of cucumbers because they resemble snakes), or "a series of black and yellow stripes", or even "a black blob with many appendages" to watch out for spiders? Encoding some primitive image data so that further generations know what to avoid or pursue seems like a very tremendous evolutionary benefit.

techdragon · on July 10, 2023

Yeah, I expect this isn’t going to be how that sort of mechanism works, but it’s always been an interesting concept for me, that while “genetic memory” as presented in much fiction is extremely unlikely just from the sheer entropic hill such mechanisms would have to evolutionarily climb to be able to pass on so much information (on top of the baseline necessary information for reproduction, the majority of memory won’t on average confer a lot of reproductive advantages, so it’s statistically more likely to get optimised out by the random mistakes of evolution, hence entropically “uphill”) …

Yet while this fictional form is unlikely we have quite a lot of good examples and evidence for “inherited information”. You have to be careful with it since it’s too easy to accidentally include side channels for organisms to learn the information and thus break the test. Such as insects being genetically driven towards food by smell at a molecular chemical interaction level, and the smell becoming associated with the information you wish to test. A bee colony can’t be reliably tested unless you raise it from a new queen in an odourless environment if you wish to see if bees genetically know that the shape of a flower is associated with food. It’s tough to subtract the potential that a colony will have learned and “programmed” later generations of bees with things like the classic waggle dancing in order to more efficiently gather food.

We do have good ones though like cats and snake shaped objects, it’s surprisingly consistent, and pops up in some other animal species. It’s wired into our brains a bit to watch out for such threats. There’s a significant bias towards pareidolia in human brains and it’s telling how deeply wired we have some of these things, but it is there and study shows it seems to form well before our cognitive abilities do… these all have some obvious reproductive advantages however so it makes sense that the “instinct” would be preserved over generations as it confers an advantage. But it’s still impressive that it can encode moderately complex information like “looks like the face of my species” or “cylindrical looking objects on the ground might be dangerous”… even if it’s encoded in a lossy subconscious instinctual level.

TeMPOraL · on July 10, 2023

> But it’s still impressive that it can encode moderately complex information like “looks like the face of my species” or “cylindrical looking objects on the ground might be dangerous”… even if it’s encoded in a lossy subconscious instinctual level.

I think it helps that the encoding does not have to be transferable in any way. This kind of "memory" has no need for portability between individuals or species - it doesn't even need to be factored out as a thing in any meaningful sense. I.e. we may not be able to isolate where exactly the "snake-shaped object" bit of instinct is stored, and even if we could, copy-pasting it from a cat to a dog wouldn't likely lead the (offspring of the) latter to develop the same instinct. The instinct encoding has to only ever be compatible with one's direct offspring, which is a nearly-identical copy, and so the encoding can be optimized down to some minimum tweaks - instructions that wouldn't work in another species, or even if copy-pasted down couple generations of one's offspring.

(In a way, it's similar to natural language, which rapidly (but not instantly) loses meaning with distance, both spatial/social and temporal.)

In discussing this topic, one has to also remember the insight from "Reflections on Trusting Trust" - the data/behavior you're looking for may not even be in the source code. DNA, after all, isn't universal, abstract descriptor of life. It's code executed by a complex machine that, as part of its function, copies itself along with the code. There is lots of "hidden" information capacity in organisms' reproduction machinery, being silently passed on and subject to evolutionary pressures as much as DNA itself is.

techdragon · on July 10, 2023

Oh absolutely... and that's a great analogy for the more computer oriented, "Reflections on Trusting Trust" highlights how it can be the supporting infrastructure of replication that passes on the relevant information... a compiler attack like that is equivalent to things like epigenetic information transfer... and for fun bonus measure since it came to mind... the short story Coding Machines goes well for really helping to never forget the idea behind "Reflections on Trusting Trust" https://www.teamten.com/lawrence/writings/coding-machines/

It definitely would be minimised data transfer, be it via an epigenetic nudge that just happens to work by sheer dumb luck because of some other existing mechanism or a sophisticated DNA driven growth of some very specific part of the mammalian connectome that we do not yet understand because we've barely got the full connectome maps of worms and insects, mammals are a mile away at the moment... no matter the mechanism evolution will have optimised it pretty heavily for simply information robustness reasons, fragile genetic/reproductive information transfer mistakes that work, break and get optimised out in favour of the more robust ones that don't break and more reliably pass on their advantage.

f6v · on July 10, 2023

You need to compare that with an alternative solution where this information is learned by each generation and then asses the survival advantage of having it encoded in DNA. This is outside my field and I don’t have a strong opinion.

p0w3n3d · on July 10, 2023

Exploit idea: create an image, which, when taken a shot of, would be written to DNA as a virus.

(I know viruses are RNA)

dillydogg · on July 10, 2023

There are plenty of DNA viruses. They aren't limited to RNA at all

luckystarr · on July 10, 2023

And create a never to be deleted record of images across the infected population?

dormento · on July 10, 2023

Cue "tasteless porn in bitcoin blockchain forever".

TeMPOraL · on July 10, 2023

Luckily, DNA is mutable over generations, so all such noise can and will eventually be filtered out.

pyinstallwoes · on July 10, 2023

Do a Quine now!

usrbinbash · on July 10, 2023

> DNA synthesis remains a bottleneck in the adoption of DNA as a data storage medium.

Yes, one of many.

Another one is a simple question: What exactly is the use case again? Because, storage isn't something we lack. Especially when talking about storage where, obviously, fast random access isn't a requirement, aka. data archiving.

We have good solutions for that; an LTO-9 tape can hold 18TiB of data native and up to 45 TiB of data compressed, with denser capacities planned: https://en.wikipedia.org/wiki/Linear_Tape-Open

pcrh · on July 10, 2023

Encode wikipedia into DNA, then insert it into a horseshoe crab. In a few million years it may still be around to be decoded.

>The fossil record of Xiphosura goes back over 440 million years to the Ordovician period, with the oldest representatives of the modern family Limulidae dating to approximately 250 million years ago during the Early Triassic. As such, the extant forms have been described as "living fossils".[9] https://en.wikipedia.org/wiki/Horseshoe_crab

inciampati · on July 11, 2023

And yet, in those 250 million years, it's very likely that the genome was completely rewritten. The phenotype of the organism is more stable than its genome. Not a shred of Wikipedia would be left after such a time scale.

pcrh · on July 22, 2023

There's a lot of redundancy and "archaic" DNA in most multicellular organisms. I imagine that over 50% of wikipedia would still be present after a million years.

The inserted wiki-DNA could also conceivably be constructed so as to ensure its own perpetuation.

Aardwolf · on July 10, 2023

Some number I found online, while trying to multiply the 30 trillion human cells with the data storage of DNA per cell:

"one gram of dried DNA can store 455 exabytes of data"

Seems like a pretty sweet use case to me!

I definitely do lack storage by the way. Say I want to download the common crawl data set, 380 TiB. And for redundancy I'd need multiple copies of the data too. That's a lot of disks for in the home. "18TiB ought to be enough for everyone" really doensn't cut it.

usrbinbash · on July 10, 2023

> one gram of dried DNA can store 455 exabytes of data

Yes, and half a gram of Hydrogen could produce ~500 Megawatts of power in a fusion reactor. However, that theoretical value will remain irrelevant, as long as we cannot build a practically useful fusion reactor. And even if we could build one, it still has to compete with all other forms of producing power for scalability, reliability, efficiency and cost.

The fact that there is a very high theoretical number that seems really impressive, isn't a use case.

So, with that being said: how long does it take to write these 455EiB? How long does it take to read them? How error prone are both processes? And how much does it cost to write/read them?

> "18TiB ought to be enough for everyone" really doensn't cut it.

Pretty sure I never said that.

Also pretty sure common crawl can be compressed. Even assuming only a 2:1 compression rate, that means it fits comfortably on 11 LTO-9's. Now, a quick google-search churned out tape prices of about 110-140 $ per LTO-9. Let's say ~150$ per tape, that means the whole thing fit's on 1650 $ worth of storage. About 5000 bucks with 2 backups included. Double that for uncompressed storage.

Alright, so how does that compare to DNA storage?

https://www.nanalyze.com/2023/03/dna-data-storage-solution/

quote:

These days, it costs $600 to sequence a complete genome which contains around 200 gigabytes of data or about $3 per gig. Today, magnetic tape technology offers the lowest purchase price of raw storage capacity at around two cents per gigabyte

end quote.

So just reading the 380 TiB back from uncompressed storage ONCE, would cost ~1,140,000 dollars.

And that's just for reading. At a price differential that is measured in multiple orders of magnitude, a technology better offer some REALLY good, REALLY tangible advantages to compete.

Aardwolf · on July 10, 2023

I of course wouldn't want to store my data in there today, I wouldn't even trust that I get it back reliably because DNA reading comes with a relatively big error rate for storage purposes (of course error correction can mitigate that). But it would be cool if the technology progresses. All technology, including disks, magnetic tapes, and new alternatives. Whether DNA is viable in the end or not, I don't know. I do know that tech always has been progressing and new alternatives are sometimes found, and that I do see a use for more storage.

But an argument whether DNA is a viable option in the future or not would have to say technically what the issue of DNA is with future tech.

Whether it's more expensive today, or that there's no need for more data today, are not really arguments against it.

I do not intend to be arguing for snake oil or anything here though. If "DNA storage" is in a similar category of "perpetual motion machines" and "cars that run on tap water" then count me out.

I don't even know how our comments ended up being like arguing against each other. The only thing really I didn't agree with in the original comment was "Because, storage isn't something we lack", because I do find it lacking, both at home and in the cloud.

speed_spread · on July 10, 2023

Let's take a moment to appreciate how the classic "bandwidth of station wagon filled with tapes" scales with tape technology. Too bad there aren't many station wagon choices nowadays but I guess any minivan would do in a cinch.

piyh · on July 10, 2023

Can your tapes self replicate? /s

The goal of replacing memory cards is dumb, the tech that enables the storage is a foundational step forward in bio engineering.

jvanderbot · on July 10, 2023

Can't wait until they do a worldwide investigation to find patient zero from the selfie encoded in the next pandemic.

codetrotter · on July 10, 2023

> Can't wait until they do a worldwide investigation to find patient zero from the selfie encoded in the next pandemic.

Then they realise the picture is this dude: https://blogs.loc.gov/loc/2022/07/robert-cornelius-and-the-f...

justsocrateasin · on July 10, 2023

Reminiscent of a very interesting company I interviewed for last year called Cache DNA

https://www.cache-dna.com/

This is the future. I don't think it will look exactly like this, and I don't think it will be here any time soon, but I'm excited to see these advancements.

What Cache is doing presently is trying to do archival storage in DNA - it has a lot of potential to be cheaper, more energy efficient, and more redundant. But some of the processes still aren't there yet.

ray__ · on July 10, 2023

Even just storing family photos would require DNA sequences that are orders of magnitude larger than the human genome, so you're going to be looking at very expensive or very time consuming read/write (and certainly no instant read write at any cost–the turn around time can't be less than hours, even for small files, even with high-end HTS or nanopore approaches afaik). What is the plan for getting around this?

JumpCrisscross · on July 10, 2023

Something like this was posited in Banks’ Excession, and I remember thinking how advanced passing messages via DNA embedded in bacterial seemed.

stevenwoo · on July 10, 2023

It’s also a key plot point in Tchaikovsky’s recent Children of … trilogy.

js8 · on July 10, 2023

It reminds me of children story https://en.m.wikipedia.org/wiki/The_Mystery_of_the_Third_Pla..., which had flowers that captured the surroundings in layers, like a film camera.

swamp40 · on July 10, 2023

I've always wondered if the plant/animal shapes and sizes were represented literally in a 3D mapping of the DNA. Like we could already have a picture of what it will become, if we could just decode the DNA sequences properly.

Like we have the sequence of numbers for a jpg, but we've never seen the picture.

ryanjamurphy · on July 10, 2023

Related: The Verge's DNA time capsule [0].

[0]: https://www.theverge.com/c/22173998/dna-time-capsule

LearningToWalk · on July 10, 2023

I've been up close to one project working in this space. The obstacles are obviously many, but fascinating to see that progress is made nontheless. Clearly a piece of the-future-puzzle.

stainablesteel · on July 10, 2023

this is like spy technology, super cool