Literally the modern equivalent of the old video-based backup systems. I remember they existed for both the PC and the Amiga. You would load a blank VHS tape into a VCR, connect the output of the computer to that VCR's input, and then tell the program which data you'd like to backup to the tape. It would generate this flashing "mess" of black and white pixels that you'd record to the tape. To restore, you'd connect the VCR output to a little box that came with the product, it would convert the black and white data in the video signal to a data stream that the program would use to restore your data.
A portion of the signal would be used for timing, metadata and error correction, so the program could tell you if the data was sufficiently damaged upon restore.
We used to use regular audio cassette recorders to store/restore data on the TRS80 before hard/floppy drives. It's also how you backed up/restored midi data from early synths. It basically just sounded like an early dial up modem transmitting data when you played it back as audio.
> It's also how you backed up/restored midi data from early synths.
In a weird closing of the circle, I now store the internal sounds backup of my vintage Juno 60 synthesizer as a WAV file recorded from that tape backup output.
So the digital info of the internal synthesizers gets converted to analog audio in the synth, then passed as audio to my modern computer’s audio interface, which converts it to a digital representation of the analog audio.
And vice versa to restore the backup into the synthesizer’s memory.
Incidentally those backups are more reliable now than when using analog tape decks, since one doesn’t encounter physical tape degradation or a cassette deck “eating” the tape.
I haven’t done any testing with compressed audio formats, but I would expect even lossy formats to perform well, if one keeps the lossiness within certain bounds, so that the highest frequencies in the audio file are preserved.
> I haven’t done any testing with compressed audio formats, but I would expect even lossy formats to perform well, if one keeps the lossiness within certain bounds, so that the highest frequencies in the audio file are preserved.
MIDI as a compression format for that kind of audio data would be a lossy way to encode such an audio stream, and it certainly would perform well, so yes, such lossy formats do exist.
Most research in audio compression has been done on compressors that exploit the limits of human perception, though, so of the shelf lossy compressors may not do very well.
My Apple ][+ had a tape interface. It mostly worked - if the tape stretched or if the tape speed changed for some reason (dirty capstan, power supply fluctuations, low volume, high volume, evil pixies) then you wouldn't be able to read it back.
This site describes the format, which was basically a header tone, a sync tone, data bits, and then a checksum (not described there but other sites say it was just an XOR). When we got a Disk ][ (5-1/4" floppy drive) all those issues went away.
How you loaded games onto my Spectrum 48k, had to make sure the volume on the output wasn't too high or too low. I could guess it had a bandwidth of about 2-3kbit from how long I think it took to load (nearly 40 years ago)
There was also DVStreamer for Windows and other tools for other platforms which would store data on MiniDV tapes. This is of course a bit less interesting than storage to VHS, since MiniDV was already storing a bitstream, but still a clever oddity. I think you could store ~13.5GB in SP mode or 20GB in LP mode (reduced error correction).
Had one of these with my C64. When my floppy drive broke down, I actually ended up ordering a copy of "Silent Service" on cassette from someone in Great Britain. It kept me sane while I saved for the floppy repair.
Minor nit - the Sony PCM adaptor worked with a Betacam deck. My parents did a documentary in the late 80s where almost all the original music for the soundtrack was recorded with a Sony PCM-1 and SL-F1. I wish I still had the masters for it.
Beta had a slightly high bandwidth but probably not enough to make a difference.
These days you get four channels of 96kHz 20-bit audio in a wee box the size of a Betamax tape, with hours and hours of recording on an SD card. That physical size is mostly a function of needing half a dozen XLR connectors on it and a big enough screen to see what you're doing.
(IMO there is not enough of these posts, and getting less over time.)
A refreshing "actual hacker" project that makes me look anew at the tools I always use...
So, my coffee maker is sending data to the net - maybe I can use that for backup, and have it replicated both in the fridge and in the living room lights...
But how would I retrieve that? Hmm. I assume that both Alexa and Google assistant are tracking everything that goes through my IoT devices. I'll ask GPT how to hack my Nest device to pull back data on demand, that oughta work, surely?! :D
Aha! I was wondering where the github stars were coming from. :)
I did get a kick out of this from the OP:
> Binary: Born from YouTube compression being absolutely brutal. RGB mode is very sensitive to compression as a change in even one point of one of the colors of one of the pixels dooms the file to corruption.
It's more than youtube compression -- video compression in general wreaks absolute havoc on our meticulously arranged (and sometimes colored) pixels. It's actually pretty fun/instructive to step through the transition between (what you want to be) two distinct frames when you're trying to (ab)use video for this sort of use case -- there are segments of the frames that get correlated and "flip" together first, resulting in in-between frames that are gibberish even with a modest amount of ECC in play.
It starts with "no monitors facing windows" and "all visitors hand over phones and any other devices with photographic possibilities" and moves up the paranoia/professional caution scale from there.
It isn't exactly a "glitch", just something Google doesn't care about (but absolutely will care about if too many people start doing it).
I remember way back in the day someone came up with a clever way of using Gmail attachments to build a cloud storage drive mounted to your filesystem. Then Google themselves released Drive soon after.
I doubt "too many people start doing it" is ever going to happen.
Obviously this is so difficult to use that most people would rather pay $10/month to get 1TB of storage that can be very easily accessed. Even if someone has 100TB of data and wants to back them up, I don't they would do conversion to and from YouTube videos.
An interesting idea, but probably won't get much real world use.
Pirates will take advantage of any suitably easy to use storage. I think YouTube is probably a poor target these days, though - Google's Denial of Service can probably detect something like this in pretty short order.
You also run the risk of YouTube deleting your videos / banning your account. I’m sure they wouldn’t appreciate being used as a generic backup provider.
Nice, until Google introduces a new compression algorithm that says: hey this looks like noise, let's replace it by this other user's noise so we can save on storage costs.
I like the novelty of this project, but if you value your Google account I wouldn't try this out.
Google has been known to close accounts and "related" accounts for abuse (as defined by them). So even if you create another account, don't expect your main account to survive if there's any possible link between them.
They are the judge, jury and executor, so eff around at your own peril.
$20 a month gives you "unlimited" storage at google. they gladly take my encrypted files for years now and I'm up to 80TB. i think its more than reasonable to pay them for that type of service and be slightly above board (the account type i have says i need a minimum of 5 people but its just me).
Which means that if for whatever the reason they decide to close your account - say one of the pictures in those 80TB triggers something that looks like CSAM [1] - and you are seriously up the creek.
Ditto if someone gets hold of your phone and changes the login on your account, or they decide to not let you in because something "looks suspicious".
You are brave. I hope, for your sake, you have a local backup.
How long does it take for you to download 80TB? From what I can see Google allows you to download 10TB per day but who knows when they will change that limit.
Previous average lifespan of a human being. Just needed a number to stop the analysis at. The one that comes bundled with the implication "Welp, I'm dead now" felt appropriate given that if you are dead, and the data is too hard to access, probate will likely be the end of your data storage foray. Any longer, and you're most certainly talking organizational scale preservation efforts.
Price per TB appears to have fallen below $8. So that's $640 worth of storage. Basically, if you were to buy your own hard drives it works out to about $20/mo over two years..
This particular account while loss making for them it is not by all that much.
A comparable Cloud Storage account on GCP with Coldline storage would be $320/month ($0.004 GB/month) or just $96/month for archival ($.0012/month).
The actual cost to Google is probably < $80/month for this 80TB ( most of the data is going to be in stored in a version of archival given the standard restrictions of 10TB on export.
80TB is also an heavy outlier, given the typical available bandwidth today and disk sizes commercially available for most users it will take a lot of dedicated investment of effort and time to upload this amount of data into the cloud.
Also Google's personal storage pricing is not competitive for pure storage, Backblaze is only $7/month for example. The higher price and value is derived from able to integrate into other Google products and provide storage for those like Gmail, Photos etc.
A well chosen model has an AFR of well below 1%.
To get about say, 100TB, you'd need a dozen drives or so with ZFS and a nice enclosure. It is unlikely even one of them will fail in a given year and you will not experience data loss.
This is starting to change. India has a new law requiring social media companies to have a grievance officer and a formal grievance process that allows users to speak to an actual human. It lays out a set of valid reasons to suspend a user, and cannot suspend or penalize a user for reasons not on the list, and must do so in a fair manner as prescribed by law. If the grievance process fails it can be appealed to a government office and then courts.
I wrote something just like this with Discord, and I even got it to host full videos which you can play back in browser. It's a good backup service. [0]
I want to expand this in into a fully modular service that you write payloads and scripts for various services, so when you upload a file its spread out across many different providers. When you're downloading, you just go down the list check what still exists, and verify the checksum. This should be stable for many years.
I plan to take a look into facebook and see what can/cant be accessed there. I had this exact thought with youtube and thought about using a pixel reader to exact out data. Same idea for different image hosting services like imgur.
I've observed that with any piece technology where you're permitted to write / upload information and freely access it afterwards, someone will attempt to (ab)use it for file storage and write a blog article about it later :)
My favorite example of this was people storing files in "secret" subreddits by using posts and comments to store bytes. When they were later discovered by other users, the seemingly random strings sparked a huge conspiracy about their possible meaning.
However, you always have the problem that your unwilling host may remove your "files". I sometimes wonder about file storage using a textual output format that can't be distinguished from normal user interactions.
I remember when GMail was by invite only, and at the time they were offering quite a larger amount of storage for Mail than anyone else so people started using their GMail drafts to store files.
That was the first time I can across such a thing.
Someone even made an extension for Windows XP that allowed you to mount GMail as a storage volume.
> GMail Drive is a Shell Namespace Extension that creates a virtual filesystem around your Google Mail account, allowing you to use Gmail as a storage medium.
Writing the GmailFS HOWTO, and fixing a bug in the process, was my first exposure to the power of OSS. Looking back, I'm pretty sure this is what led me to persue software engineering as a career!
> I sometimes wonder about file storage using a textual output format that can't be distinguished from normal user interactions.
You could use a reproducible LM (for instance using Bellard's NNCP as basis), and encode one bit in one word by taking the {first, second} most probable next word.
This is fascinating! And the file transfer can be then fully disguised as a conversation, with a ChatGPT-like client and all. An unsuspecting user will see a chat bot; a specialized client app would be able to receive files by talking to it.
In this modern cloud-giant world it's abused for file storage yes. But I come from the more traditional web hosting world of the early 2000s and back then the general rule was that anything that could store information online would sooner or later be used to store porn.
This makes me think of Turing machines which store their own code inside them selves, which you can use for all kinds of interesting proofs. I wish I could find more about this.
> I sometimes wonder about file storage using a textual output format that can't be distinguished from normal user interactions.
I guess it depends on what noise-to-signal density you’re after.
With a a long enough ChatGPT generated output, no one would question a few out of place characters or even an emoji. With 3000+ different emojis to choose from that encodes an entire byte of data.
Another idea is using “they’re”, “their”, “there” as bits.
I vaguely recall some secretive company (perhaps Apple) using adjustment of spacing, capitalization, etc. to encode a unique serial number in messages sent by the CEO, which could then be used to trace leaks.
It's a plot point in a Patriot Games, a 1987 Tom Clancy novel that introduced the term "canary trap" for this trick. He says he invented the term, but not the technique, which was already in use.
In a spat over the plot of Star Trek III (so, early 1980s), Harve Bennett distributed slightly different versions of the script, allowing him to track a leak back to Gene Roddenberry.
The book SpyCatcher says it was in routine use at MI-5, and you can find variations of it in lots of fiction too.
Pretty easy to do that, use a fixed point implementation of GPT(N) of whatever size you like and range code your data into the model probabilities. This also will achieve a close to rate optimal embedding-- allowing you to embed about as much data as the language model thinks the text has...
If you encrypt the data and include a checksum or other identifying bytes in the ciphertext you can even have unwitting human participants in the discussions and if their posts are context your embedded data will be credible replies. You just have to be sure that threading behavior doesn't make it impossible to give the decoder identical context.
> However, you always have the problem that your unwilling host may remove your "files". I sometimes wonder about file storage using a textual output format that can't be distinguished from normal user interactions.
One of the things I’ve successfully used YouTube for was video storage of my security camera system. Unlimited video storage with a simple app to watch them in case I need to check something out!
And it’s simple: camera uploads automatically via FTP, inotifywait script uploads to google!
All they'd have to do is limit the amount of private videos you're allowed to store. If your only option for storing unlimited security footage is to make it public, then people probably wouldn't do that.
Alternatively, if they're allowed to use the footage to train some AI that will help them take over the world, then maybe they want all your random footage for free.
Security cameras are usually low-framerate and compress highly anyway, due to not much happening between each frame, so I doubt it's going to be a significant cost in comparison to all the other, far more massive content which is also constantly being uploaded.
haha this is so cool, i made something similar https://punkjazz.org/scrambled-eggs/ few years ago to explore transferring files directly through the camera so nobody can "see" what you download, because no packets go through the internet, i managed to do 10kbps or so
the modern qr readers are so fast and easy to use, its unbelievable
Nice! It's such a neat way to transfer information :)
This guy extended the idea using fountain codes, which allows you to miss arbitrary frames and still recover the full message without waiting for the missed frames to re-appear:
i was actually thinking about that, could be even more cool now with modern text to speech and whisper and some funky word based encoding with huge dictionary like:
teacher: 0b00010010101001, school: ...
and then the website can encode the data as a sentence and just text to speech it and the receiver can use whisper to speech to text and decode
will be the most creepy thing because it can be very steganographic and sound like a real sentence
Video steganography might be a better approach and would be less likely to trigger account banning or claims of abuse by the hosters. The issue of avoiding data loss due to lossy compression algorithms seems to be an active area of research:
> "Moreover, most video-sharing channels transmit the steganographic video in a lossy way to reduce transmission bandwidth or storage space, such as YouTube and Twitter. . . Robust video steganography aims to send secret messages to the receiver through lossy channels without arousing any suspicions from the observer. Thus, the robustness against lossy channels, the security against steganalysis, and the embedding capacity are equally important."
I suppose in this project, the blocks of pixels are large enough to avoid data loss due to compression?
You could do this with any service which accepts user content. You could have a tumblr blog focused on “paranormal phenomenon in white noise images” and fill it full of data embedded in images. If anyone ever asks you just explain that like many pattern illusions not everyone can see images contained within - try squinting, or covering the eye on the predominant side of your body, stand on your head, blah blah blah.
This is even easier, because jpg's ignore additional data past the end of the file. Post a low-res ~200kb jpg that has an additional ~20mb of data appended. It'll still render perfectly fine.
You could do the same thing with PNGs and different thunk types. Although in both cases you run a risk that some paranoid developer might filter out unexpected thunk types or additional data so in both cases it would be best to put the data in the image payload.
The other consideration is that Tumblr was always very “creator” oriented and while they might produce thumbnails of various sizes the original image is still available and not mangled by resizing algorithms. Other free image hosts are going to crush that image down the maximum amount tolerable to the human eye. Google even does that for paid photo hosting.
I understand that the goal is to make the data survive video compression, but wouldn't it make sense to use at least some color information instead of entirely black and white pixels?
Chroma is lossier than luma in most common video codecs. AVC is 4:2:0 on YouTube. 4:2:0, quite confusingly, means that chroma is halved in both dimensions compared to luma (so one chroma pixel is congruent with four luma pixels). As well, most decoders will apply filtering on the chroma to upsample it to match the luma, meaning that your color boundaries are going to be indistinct at best, and you might even lose the original chroma values entirely in the process. You'd have to use multiple chroma pixels as one metapixel in order to increase resilience, which would diminish the capacity. With modern codecs, a monochrome signal seems better to use for actual data, although I could see it being useful to use chroma for metadata.
Seems like it could benefit from forward error correction to defend against bit errors (this is how QR codes survive big chunks being partially obscured or replaced by logos, and also how CDs survive being scratched within certain limits).
You can choose how much correction you get, in terms of how many bit errors you can correct per 'n' bits. And you need surprisingly few bits to get pretty great performance under "reasonable" bit-error rate channel (like under 10% overhead). You can wind up the strength of the error correction if you anticipate a noisier channel.
QR codes have 4 levels of correction you can use depending on how robust you wish them to be. CDs and DVDs use two chained, fixed, levels to keep the decoders simple. CDs have 25% overhead, but their correction is very strong: they can correct 4000 bits in a row.
There are two encoding modes, RGB and B/W. It uses a pixel-to-data width of 2x2, but says YouTube's compression algorithm is brutal, and one corrupted pixel already renders the whole thing corrupted.
I think the size of the effective chroma metapixels is more important than the range of values. You need to make them larger in order to keep the decoder from blending them together when upscaling the 4:2:0 chroma.
Now, if you're using a 4:4:4 format to do this, then you should be able to use smaller chroma metapixels (I still wouldn't use the full chroma resolution, though, unless you're using a high bitrate or a lossless codec). However, that risks data corruption if passed through a pipeline that downsamples the chroma.
This reminds me of a stupid idea I had: would it theoretically be possible to store data using the backbone of the Internet itself? You'd bounce packets (probably TCP) back and forth between two hosts with bytes that aren't actually written to a disk anywhere so they just exist as a stream until one end decides to copy a section for itself.
This isn't a new idea. It can be traced back to delay-line memory [1] and many thought experiments have been suggested to use a large network as such. Even some actual demos have been made [2][3].
Suckerpunch did a video on “harder drives” where he implemented a block storage device by storing data in ping packets. It’s one of my all time favourite technical talks - his style is amazing, and he’s an incredible story teller.
You’re effectively relying on two computers to be up and running 24/7. It’d be twice as better of an idea (which is still a very low number) to just store that data in RAM on a single device, rather than rely on two.
This is bound to get you banned. I would do it a little bit more clever (with lower bitrate/throughput/storage sizes)...
Encode the data inside audio, preferrably outside human audible range, and then use a nice video of singing birds, or whales talking, and use the "hidden" frequencies to hide the data.
I don't know if Youtube has any filters that cut out frequencies, but this way they can't ban you, since you've uploaded a really nice personal video of your singing birds, instead of the conspicuous looking QR-like codes as in the OP ;-)
With any lossy audio compression algorithm, everything outside the human audible range is filtered away completely as a first step. That's compression 101.
Also there's much less bandwidth in the audio channel than the video channel, and then far less again if you're trying to hide a signal in another signal.
Do this at you own risk, ive done this with lidar data (which didnt need to be as persice as binary, which is what im seeing in this post) which worked fine. 3 years later i revisited the project and it was broken because youtube compressed the files in such a way where it made the lidar just innaccurate enough to be unusable. I cant imagine storing data in binary where just one bit wrong screws everything
I have many old videos that have lost their "HD" encoding, and now look like potato vision. I no longer (silly that I did) trust YouTube for video storage.
Nice work! I made a much worse variant of this years ago, with a “mosaic” mode[1]: whatever YouTube was doing for compression at the time handled multiple QRs tiled next to each other much better than it did a single large one.
Does YouTube let you store unlimited video content (real video like screen recordings etc of our own work - no shady or sneaky stuff, nor any copyrighted stuff etc)
With all videos marked private ...so they are just "storage" by account owner and no other users can access them and youtube cannot monetize it ?
I had a good look into these sorts of technologies but the host almost always changes the file so it makes it impossible to retrieve the data hidden in the file.
You need a file hosting platform that guarantees not to change the uploaded file.
If you look at the example video, it doesn't depend on the video not being changed, but it does depend on a minimum level of quality. That is, as long as the video quality is high enough (720p in this case) to get back the original black and white pixels, you're fine. The data is not hidden, it's there in plain sight in the video.
It‘s described in the README. The video has 2x2 pixel blocks that are either black or white, so each one signifies a bit. So a 1920x1080 frame encodes 518,400 bit = 64.8KB
The assumption is that video compression won’t mess up those blocks beyond recognition, so you should retain the information as long as the rendered resolution and bitrate don’t drop too low.
Maybe this could be improved by e.g. using 32 colors instead of 2, and bumping the block size to 3x3 (for safety) which should yield ca 144KB per frame.
The block size should honestly be tuned for the codec in use, chiefly to determine the best block size to fit with the codec's macroblock size. That's usually either 8x8, or with newer codecs 16x16. I feel like something like maybe 8x2 would be smart, and I like the idea of monochrome for resiliency, since chroma is downsampled. The fewer possible pixel combinations you have within a macroblock, the better the compression will probably end up being as well. And 8x2 would somewhat evoke the look of the old video backup systems as well, for the fun of the nostalgia of that.
It'd be cool to add a FUSE wrapper around this. At one point I had a POC for a few of these sorts of things going (not as cool as this project, just data stored to X free cloud store/metadata) and creating a redundant transparent FUSE wrapper was probably the next step. With multiple sources, you could even treat mux data between slow/unreliable sources (content hosts in eg russia or asia) to 'stripe' the data. And then, you could make these modular so that new sources could be onboarded easily...
People do this all the time with any web connected service that accepts data. People use open strings in AWS services, like lambda function names, to store arbitrary bits.
Using this can get your google account and related IP addresses banned? Isn't this sort of a Vandalism? But why attack Youtube out of all places? Do it to TikTok instead. They won't notice the difference(LOL). I would've said "delete this" normally but today's political climate demands more free space on the internet per individual definitely so...
This looks like a fun project to tinker around with. I have one small request. The pack file in the repo is 146MB. Is there a way to make that smaller on github?
Interesting to note that how black and white pixels get compressed less and thus make for easier restoration. It would be interesting to generate a 512x512 QR code that changes every 3 frames to see how well that would work for recovering data.
I wonder if you could get a better pixels/bit ratio when using DCT/2DFFT based encoding since you'd still encode lower frequency data but it would be in a format that compression algorithms would also try to maintain.
Considering this might very well end up you losing your google account, it is a very necessary warning.
If anything, the author should be more clear about what happens if youtube gets mad: you might lose your google account along with access to mail, drive, photos etc
The warning is fine, the warning while playing Pollyanna is annoying and disingenuous. To put it in a more constructive way - the author should be proud of the hack and all the fun and turmoil it could cause.
I feel like this is overcomplicating things. You should be able to download the original video you uploaded instead of downloading a compressed version. I'm sure the uncompressed version still exists.
Ensuring that you can retrieve the data from the viewable video means that this is also be a way of file transfer, one that other people won't be able to download the original video for.
Dumb yet interesting idea, but if you care your data you should put it on google especially you are abusing their service, if you do not care why you even waste your time doing all this except for fun.
i would've save this "improvement" for my own project, but consider using colours instead of just black and white squares for higher data density! i am unsure how much compression can affect its effectiveness.
A portion of the signal would be used for timing, metadata and error correction, so the program could tell you if the data was sufficiently damaged upon restore.
LGR has a video on the PC version from Danmere: https://youtu.be/TUS0Zv2APjU
Here's a video example of the Amiga industry's take on the idea: https://youtu.be/VcBY6PMH0Kg?t=573
Sony even did this in 1980 to record CD-quality PCM audio onto VHS tape. https://youtu.be/bnZFLzBO3yc