Hacker News new | past | comments | ask | show | jobs | submit login

Is there a good reason to back up raw disk images with block-level deduplication, rather than running file recovery to get filesets, and then doing file-level deduplication?

On the one hand, I can imagine "cryonically" preserving a disk image for a later, better filesystem recovery program to come around. (This "cryonic" approach would give even better results by preserving bit-level analogue flux recordings of the disk platters, rather than relying on the output of digital reads from the disk heads.)

On the other hand, the longer you leave these disk images as dead blobs of data, the more layers of legacy container+encoding formats you'll have to try to get your system emulating when you finally do want to pull the files off. One day your OS won't have drivers for reading e.g. FAT16, or zfs2, or ReiserFS2.11, nor will it be able to parse out the meaning of an MBR-partitioned disk. Reaching back through Linux kernel archives for something old enough to understand those things, will result in a kernel that won't boot your PC. You'll end up having to do something convoluted with qemu just to get your disk read.

Personally, I'd much rather throw out all the intermediate containers I can, as soon as I can: not just extracting files from the disk's filesystem, but further extracting files from any proprietary archive formats on the disk (using the extractor tools probably installed on the same disk), and even canonicalizing containers like AVI by remuxing them into modern extensible formats like MKV. The goal being to give a file-level deduplication process the best possible inputs to work with, most likely to match: not just for space-saving, but because reducing "junk duplicates" helps greatly in actually finding anything in all that mess, let alone organizing it.




Interesting point! To be honest, I didn't consider that it might be more difficult to interpret the underlying filesystems and encodings in the future. My goal was to get a lossless (well, as much as possible) archive of the disks since my time and physical access was limited. It is hard to predict what I will want to access in these images over the scale of decades, so my thinking was to leave them as untouched as possible, since I don't know what information will be important.

That being said, a good guess would be that the most interesting data will be media files (especially old photos) and documents. For that data, your advice of collecting and re-encoding the files is wise. For the purpose of discovering the media files in these backups, I found my favored approach to be a brute force recursive search for file types. Exploring the original structure of the filesystems was interesting, but my intuition for where the valuable data was usually proved wrong.


You want to convert the data to an open format and keep that around in modern containers, where you can easily transfer it to a different container should that be necessary. BUT for many applications you also want to preserve the environment. For instance, if you write your thesis in LaTeX, whatever installation you have is unlikely to be replicable in 10 years, so keeping a minimal VM that can compile it is preferable. You would need to keep this VM runable and upgrade it to newer formats as things progress.


I'm not sure I buy the absolute supremacy of open formats. They are good for emergency recovery (no depending on a company that just went kaput) but they often lack momentum. MP3 is not an open format, but I would argue it's one of the best choices for storing your music because it is everywhere. MP4 hasn't gained quite the same traction yet but we're headed that way.

Now, if you are a savvy programmer and you archive the (open) OGG spec, you could argue that you can write your own custom OGG -> MP2073 encoder at any time to recover your ancient music. But that's not most of us.

Another example that comes to mind is the many MS Office replacements I've used over the years. I have a small trove of old documents in old open formats used by StarOffice, OpenOffice, & others that are a pain in the butt to access and don't always render correctly, while my ancient .doc files from twenty years ago still open in two clicks.


I'd consider mp3 an open format, and office documents are an exception. MS has gone above and beyond for it to work. There is thousands of dead formats out there unfortunately.


LaTeX is plain text markup so old LaTeX documents can usually just be read without processing. (The major exception is that graphics are hard to visualize as a sequence of draw commands!). Further, ordinary LaTeX from the last few decades can still be processed without difficulty. The older versions of LaTeX are still available online.

Of course, things are never as simple as they should be. LaTeX is a markup language that through macro expansion ends up expanding into TeX. Since TeX 3.0 in 1989, Knuth has attempted to keep the TeX system stable. Since then TeX documents should produce the same output, pixel for pixel, as they do now running on the current version 3.14159265 (yes, the version number is converging to pi, there won't ever be a TeX 4.0).

Few people, however, produce documents in plain TeX--the LaTeX markup is so much more convenient than the lower level TeX. LaTeX has been slowly evolving and there are some backward compatibility issues, but they are minor. The first release of LaTeX seeing general use was described by Leslie Lamport (the creator of LaTeX) in his 1985 book[1]. That version of LaTeX, 2.09, can still be processed by today's LaTeX 2e in compatibility mode. LaTeX 3 is supposed to supersede LaTeX 2e someday, but it's not clear how many more years that will be.

Since LaTeX is open source the distributions from TUG (the TeX User's Group) are easily obtained and they have all the historical versions of LaTeX and TeX available.

This all makes LaTeX/TeX seem like one of the best ways to maintain a document's source for the future. A few tips:

- Fonts can be a problem because fonts evolve. Either use something like TeX's extensive collection of "built in" Computer Modern fonts or save the font files along with the source of anything that you might want to work on in twenty years.

- LaTeX has a wide number of very sophisticated third party extensions. Along with the document source it would be a good idea to keep the contemporaneous versions the extensions used by the document. (These extensions are just files of additional macros.)

- If one is only interested in the typeset results of a LaTeX document, use a LaTeX extension like pdfx (or Adobe Acrobat) to generate a PDF/A version of the document (LaTeX programs normally generate pdfs). PDF/A is a pdf specification from Adobe that is intended for archival use and rendering far in the future (fonts are embedded in the output, etc.).

- Pandoc can convert LaTeX to a wide number of alternative formats (HTLM etc.) with some success depending on the target language and the document complexity.

[1] http://www.amazon.com/Latex-Document-Preparation-System-User...


Actually

    PDF/A is a pdf specification from Adobe that is intended
    for archival use and rendering far in the future (fonts are    
    embedded in the output, etc.).
Maybe they state that, however as long as there is no reader that is working on the newest hardware this is just wrong.

And the Specification for PDF/A-1 - PDF/A-3 (a,b,u) is really really long and hard to implement correctly. I doubt that 70% of the solutions (even for money) incorrectly parses the spec. Actually I doubt that only Adobe Acrobat (Reader) would actually print a 100% correct PDF/A-3a file to the screen.

Actually this is the (inofficial/offical) spec:

    http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
It should be really close to the iso one (http://www.iso.org/iso/home/store/catalogue_tc/catalogue_det...)


Uunfortunately that has not been my experience. Between various linux distros LaTeX environments and 3rd party plugins, I've yet to be able to easily recreate the environment to produce similar output. If I struggle with it now, I have little hope that I can do it in 10 years - and I have more faith in VM containers.


I agree; you're right. It isn't easy. I wouldn't want to go back and figure out my TikZ/PGF diagrams if that package lost backwards compatibility. But at least there is hope that a straightforward LaTeX document can be read in the future. You can always run the LaTeX source through Pandoc or just read it as it stands as text with markup. It's better than having a WordPerfect or XYWrite document (I liked both of those better than Word way back when).

Your idea of a VM container is interesting--I hadn't thought of that, but what about the software to run the VM. Does VirtualBox stay backward compatible over the next 10 or 20 years? Maybe the way to go is Markdown, that is so easy to read even without processing it.


I don't expect it to be 20 years backwards compatible, but since I use VMs daily, I have a good idea of when to start looking for a way to convert the machine to a different more modern format.

Re: markdown or similar like restructuredtext, I don't have any experience with advanced markup in them, but I think you'll have the same plugin problem as with LaTeX. The readability of the source isn't that important, since you'll probably have the rendered output as well, and that is,or should be, copy/pastable if needed.


Disk is cheap in online storage services, so its "cheaper" in time to dump the image and pull out whatever you need in the future.

> One day your OS won't have drivers for reading e.g. FAT16, or zfs2, or ReiserFS2.11, nor will it be able to parse out the meaning of an MBR-partitioned disk.

Maybe in 10-20 years.


Couldn't you just load them up in a VM anyway?


Yes, for reading these disk images, I've typically resorted to using VMs anyway, so I can experiment with old code in a sandbox. Actually, one project I've had in the back of my mind is attempting to boot some of those old Linux installs in a VM. Can you imagine re-opening a carbon copy of your old workspace from over a decade ago, right where you left it? I think the exercise would be thought-provoking.


Damn that would be interesting. I should make a backup of my current one and see how it changes year after year. Good idea!


Since I’m young, a decade ago I probably hadn’t discovered proper version control. I expect I’d find folders of neatly organised, timestamped ZIP files of my code.

Perhaps some things are best left in the past.


A few weeks ago I tried to copy a Windows 7 VM from VirtualBox to VMWare Player. I expected it to be easy, but it was actually difficult enough that I gave up on it after a few days.

If your backup plan involves mounting a disk image on a VM, I strongly advise you to test it before you need it :)


Emphatically agreed. Migrating VMs (IME especially Windows VMs) is a frustrating proposition.

Creating new Linux VMs expressly for the purpose of reading particular file formats is a much safer bet. It is unlikely that the ability to conveniently emulate a basic i386 system with block storage will go away in the next few decades. My assumption is that any formats I can read on Linux today will be readable in the future, as long as I have a copy of the source / binaries. This is why I included a copy of lrzip's git tree with my backups -- everything else is in a standard Ubuntu install.


Just curious, why would you choose MKV over MP4? MP4 has native HTML5 support, so I figured that would have more long-term viability. And as far as I know, they're both open formats (maybe MP4 isn't and is just free to use, I'm not sure).


> maybe MP4 isn't and is just free to use, I'm not sure

Correct. MP4 is not an open format.

From [1]:

> MPEG-4 contains patented technologies, the use of which requires licensing in countries that acknowledge software algorithm patents. Over two dozen companies claim to have patents covering MPEG-4. MPEG LA licenses patents required for MPEG-4 Part 2 Visual from a wide range of companies (audio is licensed separately) and lists all of its licensors and licensees on the site. New licenses for MPEG-4 System patents are under development and no new licenses are being offered while holders of its old MPEG-4 Systems license are still covered under the terms of that license for the patents listed (MPEG LA – Patent List).

> AT&T is trying to sue companies such as Apple Inc. over alleged MPEG-4 patent infringement. The terms of Apple's QuickTime 7 license for users describes in paragraph 14 the terms under Apple's existing MPEG-4 System Patent Portfolio license from MPEG LA.

[1]: https://en.wikipedia.org/wiki/MPEG-4#Licensing


Do you know of any good resources on storing/maintaining video files? I recently ripped over a TB of uncompressed video from old 8mm tapes and I'd like to preserve these memories digitally for as long as possible.


Lossless x264 (with FLAC audio) in MKV with a recent version of FFMPEG.


Better store the sourcecode of the exact version used for transcoding/playback and back it up propperly too. You might need to re-recreate the binaries in dozens of years from now.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: