Hacker News new | past | comments | ask | show | jobs | submit login
I need to copy 2000+ DVDs in 3 days. What are my options? (reddit.com)
393 points by huhtenberg on Dec 18, 2018 | hide | past | favorite | 191 comments



The context -

> Maybe you should disclose your city or at least state/general location

Washington DC, specifically College Park, even more specifically The National Archives at College Park.

> What is the content of the discs?

Years ago in partnership with the National Archives Amazon.com digitized a tremendous amount of film for the National Archives, with the catch that the material cannot be freely disseminated online by the National Archives until Amazon breaks even on their digitization investment. It’s been years now and for a variety of reasons - many of which are Amazon’s fault - there still exist a solid number of discs that Amazon hasn’t sold even one of. (In part because amazon hasn’t even had them all available to sell. Pretty ridiculous).

The flipside of the catch is that the DVDs can be viewed and even copied for $0 on site by researchers. I am doing research there and I have have a research pass. I was looking at many of these titles yesterday. It’s time to set these national treasures free.


Im wondering how the already preserved and publicly available film content is being considered being made freed ? Theres very clear information on rights, reproduction restrictions, and the preservation efforts that have already been made here: https://www.archives.gov/research/order/film-sound-video-dc....


I wonder what this material is. Are these commercial films, historical footage, or what?


My guess is they're the films described in the CustomFlix partnership announced here:

https://www.archives.gov/press/press-releases/2007/nr07-122....

For all the helpful advice offered both here and on Reddit about how to do this, I wish there had been more time asking if this was something the OP should be doing in the first place.

I completely understand the frustration if this is indeed the above set of films and it's been over eleven years since the digitization agreement and files have not yet been made freely available to the public.

That said, planning to use a researcher's pass to "set these treasures free" and blaming Amazon for not having generated more revenue from these films makes it sound like the OP thinks he knows better than the staff of the National Archives how to best care for these assets and that he knows better than the folks at Amazon how to turn a profit. To me, it sounds more than a little arrogant.

A big reason the National Archives enters into agreements like this with companies is that digitization, especially on their scale, is expensive. If it weren't for agreements like this, Amazon would only want to digitize the films for which they knew they could turn a profit and the vast majority of the collection would sit un-digitized and be at risk of loss. The tradeoff is between the immediacy of access vs the number of assets digitized and the team at NARA made the decision that it was better for the American people to have more assets digitized.

I'm worried that the next time NARA is in talks with someone about a digitization agreement (for example, if there's a large number of early jazz audio recordings on 1/4" reels and Spotify is interesting in paying the cost of digitization in exchange for 2-3 years of exclusivity) that the company will point to this example and say "didn't you just let a researcher publish the entire collection Amazon digitized? How can you assure us the same thing won't happen with these recordings?" The result will be the National Archives clamping down on researcher access. I think that would be a net loss for everyone.


A reasoning that doesnt require (too much) arrogance: contracts and business rules hold despite changes in context. Plans change, people leave, things are forgotten. But the contract, and its limitations, go on.

Since amazon was only looking to get its investment back, but not a profit, then clearly the intent of both parties was to eventually have it release to the public. Whoever on amazon’s side pushed the deal presumably thought they would be sold, but it clearly didn’t happen. It’s much more likely this has been sitting on a backburner somewhere, left as some forgotten plan, than some kind of long-term strategy to induce sales.

Thus, by virtue of an ill-written contract, we’ve entered a situation that no one wants. Amaxon no longer cares about it, the museum presumably prefers releasing it, and the public can only benefit from it.

By virtue of that same contract, there’s an escape hatch that might bring us back to a state where everyone is content.

It would make sense to exploit it.

Ofc, this is assuming Amazon doesn’t care. But amazon is a company, and the larger a company is, the less distinguishable it is from a government. And governments certainly have control over things it has collectively forgotten about, as it only rarely operates a single, like-minded, cooperative organism.

And amazon is indeed a very large company. It would hardly be unsurprising for Amazon to not even be aware this contract still exists.


> The result will be the National Archives clamping down on researcher access.

Or maybe the National Archives will stop making deals where free distribution of the material is unlikely to happen for decades and will instead open up greater access to researchers.

Or they could make the same deals but put a deadline on them to prevent the exclusivity from sitting in limbo forever. That would at least give Amazon an incentive to distribute the material rather than just squatting on its monopoly.


> A big reason the National Archives enters into agreements like this with companies is that digitization, especially on their scale, is expensive. If it weren't for agreements like this, Amazon would only want to digitize the films for which they knew they could turn a profit and the vast majority of the collection would sit un-digitized and be at risk of loss.

And yet there are researchers willing to do this quickly and for free, so why do we even need Amazon in the loop?


The OP says they're public domain, so I'd guess it's films produced by the government.


...or past the copyright window.


> Years ago in partnership with the National Archives Amazon.com digitized a tremendous amount of film for the National Archives, with the catch that the material cannot be freely disseminated online by the National Archives until Amazon breaks even on their digitization investment. It’s been years now and for a variety of reasons - many of which are Amazon’s fault - there still exist a solid number of discs that Amazon hasn’t sold even one of. (In part because amazon hasn’t even had them all available to sell. Pretty ridiculous).

> The flipside of the catch is that the DVDs can be viewed and even copied for $0 on site by researchers. I am doing research there and I have have a research pass. I was looking at many of these titles yesterday. It’s time to set these national treasures free.

-- https://old.reddit.com/r/DataHoarder/comments/a6fkpm/i_have_...


That doesn’t answer what type of media is on the disks.


It says film. So, it could be movies or images. What else would be possible? Are you thinking of quickest ways to rip the DVDs if they are movies?


Yes, but "film" could refer to a lot of different content. Movies, TV shows, documentaries, commercials, interviews, news reports etc.

GP was just wondering out loud about the content itself, not the format of said content.


A little off-topic, but it's interesting to see the shift in meaning of the "reel of film" icon over the decades; a long time ago, it would signify any video, but now it's more associated with an actual cinema film.


So... it's already been 3 days, right?


Seems counter-intuitive to Amazon's own public dataset program / policy: https://aws.amazon.com/opendata/public-datasets/


>Democratize access to data by making it available for analysis on AWS.

>AWS evaluates applications to the AWS Public Dataset Program every three months.

>If we bring your dataset into the AWS Public Dataset Program, we will cover the costs of storage and data transfer for a period of two years

yeah i don't think that's public,more like a rental service. i wouldn't bother working with that unless i had free AWS access through corporate


[dead]


No. And even if it was, millions of people can do the time, including programmers.


It's actually very similar to Aaron Schwartz's plan of using academic access to copy and widely distribute papers. This researcher does not have a legal basis to copy 2000+ disks in order to free them from the national archives. At most he has a restricted right of access to use some of them in his academic research.

To test he could ask the national archive for permission to carry out his plan and observe how quickly his access is revoked.

-----

Research Room Rules.

The rules are intended to provide clear instructions to both staff and researchers on the research room:

  * registration process;
  * entry and exit procedures;
  * items authorized and not authorized for use in the * research rooms;
  * pull requests;
  * records handling guidelines; and
  * copying and equipment approvals.
The rules will be available for researcher use in each facility or can be requested in writing from the facility managers listed below.

Archives II, College Park, MD – Michael Knight, michael.knight@nara.gov


[flagged]


Factual information isn’t always useful in another context.


Easy: get a bunch of DVD drives and a ~1 Gbps uplink. 2000 4.7GB single layer DVDs in three working days is about 870Mbps. Upload them as you rip to S3 and then send a pull request!

https://github.com/awslabs/open-data-registry/


Easy: get a bunch of DVD drives and a ~1 Gbps uplink.

Can you detail how one easily gets a 1 gigabit uplink installed temporarily in the National Archives?

I've had telcos drop fiber into the middle of fields in the middle of nowhere, so it's probably possible. But it feels like there's a lot of complicated bureaucracy to navigate, and the researcher OP is time-restricted.


Sneaker shuttle the SSDs from the National Archives to your friends apartment with the uplink.


sigh

No one got the joke. I’m suggesting that the DVD’s should be submitted as an AWS public dataset, because Amazon is the company that digitized them in the first place, and the content is apparently public domain.


So the 3 days is for figuring out how to use git, then?


obligatory xkcd: https://xkcd.com/1597/

If anything is out of your few commands, just save work elsewhere, delete the project, and download a fresh copy.


I've, frankly, done that many times at this point... when your merge request gets too wonky, it's just easier. I do try to remember to squash, then rebase generally, though there are times I just forget or mess up.


When in doubt, "how does the team here handle these conflicts? At my old job we just did <made up thing> to solve it, I've forgotten the process here."


lmao


>In part because amazon hasn’t even had them all available to sell. Pretty ridiculous

not ridiculous, good lawyer preparing the deal


What's "good" about deliberately keeping public domain content out of the public domain?


Not "good" as in "good vs. evil", but rather "good" as in "proficient at the skill of lawyering".


Not ridiculous, but very sad


I literally did this at a startup. We had eight eight-at-time Netflix accounts (DVD accounts). This was in 2009. We made a search engine for movies. Yeah, like Mr Skin. We ripped them six at a time. Took us about six months to do 4000 films. The DMCA violations would have been in the trillions.


“Hypothetically” you mean


Normally the Statute of limitations would apply, but it seems like US Law is becoming even more inscrutable in the last few years.

https://en.wikipedia.org/wiki/Statute_of_limitations


There's no such thing as "the" statute of limitations. There may or may not be a statute of limitations governing each possible legal action. I count at least three: criminal copyright violation (U.S. v. NoblePublius), civil claim for copyright infringement (MPAA v. NoblePublius), and breach of contract (Netflix v. NoblePublius). Possibly others, e.g., if the DCMA has a separate criminal penalty for circumventing the copy protection.


No need to use that word. The DMCA is basically toothless at this point and can outright be ignored. You get a notice that can be safely thrown away. It's basically "don't do that again". Then you do it again.

Honestly it's always been pretty easy to ignore it since it's inception unless you are running a counterfeiting operation.


Some ISPs will cut you off after X repeat notices in under Y timeframe.


Why did a search engine for movies need the actual movie files, and not just the metadata? Or was it actually making the movies themselves available?


A guess would be more movie meta data than you can think of; - show me movies with Swiss Alps settings - show me movies with the quote "I'll be back" - tell me which part of Back to the Future shows me where they have hoverboards - actor/director/artist/crew: for my IMDB profile, tell me when my name is on screen (start credits, end credits)


I would guess it’s in order to build the hash/IDs/searchable dataset that can be used to match the movies. Think Shazam style fingerprints but for movies.

I was looking into this sort of thing recently for a project. Very interesting technology space with some genuine direct benefits coming out of the application of machine learning.


Correct


Why did a search engine for movies need the actual movie files, and not just the metadata?

Someone's got to build that metadata. Especially back in 2009, you couldn't just download it from some public database. Databases have to be built.


I'm guessing something along the lines of Morbotron [1]. Not the actual use case for the screens/subs/etc. but more the data required to do either of them: frame by frame images for analysis with timestamps and subtitles.

1: https://morbotron.com/


We needed high quality video files to create the metadata. Machine vision wasn’t good enough yet so we did it the old fashion way. Interns and lots of typing in real time


What was the bottleneck ? tray ejection time ? human operator sleep patterns ?


Not getting caught by Netflix or Hollywood.


In USA wouldn't you get a research exemption?

Was the postal stream the limitation on throughput?


With even a 2-day turn around, 8 accounts x 3 disks == 24 disks at a time... so, would think so.

Aside: Re-encoding to a tighter format at that time wasn't too bad, but still relatively slow IIRC.


Just cracking the encryption on the DVD for any purpose is a DMCA violation


Judging from the date on the reddit post, it's a bit late for my comment (two days after the post), but the real issue is not how to duplicate so many DVDs, the issue is how to RIP those DVDs onto some cheap storage for later burning onto DVDs. So high data bandwidth between stacks of DVD players with lots of memory for caching (or however you cache DVD data streams). Fry's had a 4TB external hard drive for $89 this weekend. Maybe have a bunch of SSDs as the intake point for the DVDs, and then offload the ripped copies from the SSDs onto physical drives while you're changing out the DVDs. Would want to use the fastest interfaces available. I've no idea these days what the cool kids are using. (My first hard drive was a 5-1/4" full-height 10 MB MFM. Thought I would never need more storage than that. I think my second HD was 20 MB and used RLL.)

Organizing the DVDs by length would allow optimizing the loading/ripping process to assure minimum time lost waiting for the operator's hands to be free. This kind of planning makes for an interesting project.

It might make an interesting crowd-funded project, if it's reasonably easy to get a research permit. Plan it out, go in with the hardware, come out with the images. Use all the error correction opportunities you've got available.

Do a web search for "bulk dvd ripping" (without quotes) and you'll find lots and lots of discussion and advice, including some about building a dedicated DVD ripping rig. MakeMKV gets good press, in my very quick read of a few posts.

And there's always the option of crowd-funding to raise the exact amount needed to pay off the break-even for Amazon's investment. I can't imagine they'd fight back too hard when looking at a large check vs. a non-performing asset, unless Bezos personally never intended to let the footage go free.


A 16x DVD drive is around 21MB/s output. Any modern hard drive is likely to have 5-12x the sequential write performance as a 16x DVD drive. Shouldn't be any need for SSDs.


> Any modern hard drive is likely to have 5-12x the sequential write performance

The data rate falls to the floor as soon as the access pattern isn't sequential though, which if you are using multiple readers it won't be. While an OS might be bright enough to organise data flowing out of write buffers so it isn't as random as it could be there is a limit to how far they will go with this because they are general purpose OSs and optimising for multiple bulk streams will punish more interactive activity. If you have a tool they bypasses the OS cache and works in large enough blocks you might see better results except if the write activity from each lines up at which point this will make things worse.

Pulling the data off multiple DVD drives onto an SSD, swapping to output to another once near full to continue while its contents are dumped sequentially to cheaper-per-Gb traditional drives, would probably be the way I'd suggest.

In fact, you would get away without swapping between two SSDs: the read activity pulling data from the SSD to a traditional drive is unlikely to have much effect on the write performance for the data coming off the DVDs unless you have a great many readers in one machine. If doing this all relatively manually, to reduce manual steps once a DVD copy is complete add it to a queue to be moved using something like https://en.wikipedia.org/wiki/TeraCopy so you don't have to worry about manually coordinating the SSD-to-cheaper copy operation to keep it sequential.

Assuming 15 minutes to read each disk (it is a long time since I pulled data off a DVD in bulk so this is guess work based on old memory of it taking a little more than 10 minutes to read a full DVD9 disk, and rounding up to 15 to allow for manual process inefficiencies and some disks being slower to extract due to condition causing rereads, etc) you are looking at wanting 21 or more drives constantly on the go to get the job done in 3 solid 8-hour days (2,000x15/3/8/60 = 20.8). Five laptops each with an internal SSD (128G+) to extract to, five DVD readers on USB3 to extract from, and a 4+Tb spinning disk (also external) to finally write to, might do the job and have the space (2,000x8.5/5=~3.5Tb output per laptop). You'll need a powered USB dock/ for each laptop instead of a passive hub, and you are going to want to add more of everything to allow for the possibility of device failures.

Of course significantly less resource is needed (or you get more contingency time (and/or spare kit to deal with failures) from the same resource) if most of the media is DVD5 and/or not full disks. I've assumed the initial three days is just for obtaining the content - I've not accounted for any other processing (such as indexing and transcoding) or further distribution.


I've done this kind of stuff, and between OS buffering, and making sure the ripping software is writing large blocks (say 4-32MB at a time) its possible to run drives at basically full bandwidth with something less than a dozen streams. There is going to be more inner/outer track bandwidth variation than the perf falloff going from 1 to 6 streams with large blocks (say 4-32M sequences). There are a lot of reasons for this, but a lot has to do with data placement effectively combining multiple streams into data writes to the same sequential track.

More interesting is that even "sequential" read/writes already have seek times built in because HD's aren't spiral track, so head switching, and track to track seek (and the associated rotational/finding the servo track) are inherent in sequential IO perf. So most filessytem placement/schedulers aren't going to place 3 files being written at the same time on opposite sides of a disk, so those head switch times and track times have nearly immeasurable increases because the drive itself is also storing a large part of a track write and moving 3 tracks and a head, is basically the same as just moving a head.


there's an easy fix. suppose Ubuntu or openSUSE Linux: run your install on an SSD of at least 128 gb. set a swapfile of at least 64 gb right on your root partition & make sure it's mounted as swap (put in /etc/fstab or do it manually each boot) now attach & mount your hi capacity storage.

just have your script queue up a few in /tmp before moving to the mounted storage.

pretty easy & now you have multi level buffering caches that Linux knows how to work with efficiently & that can nearly guarantee sequential writes


> research permit

The government has to explicitly permit you to do research?

What the fuck is wrong with this world?!


It gives access to an archive containing a huge quantity of irreplaceable material. Here are the requirements: https://www.archives.gov/research/start/researcher-card.html


I manage the IT for a school. As we've gone more and more away from DVDs (thanks, Apple), teachers have been asking me to rip their old DVDs. Tried doing that, but a lot of the DVDs did not rip perfectly, so that the ripped video ended up having a lot of corruption. I'm not entirely clear why software could not play the ripped DVDs and just deal with the jitter and corrupted data, but playing from a DVD drive could deal with it. It was weird. Googled a bit, couldn't find much information other than the fact that CDs and DVDs can deteriorate over time.

My question is not necessarily related to OP. But can anyone tell me why my ripped DVDs could not play in DVD software properly when there was jitter or corrupted data, but it could play properly when in the DVD drive? When I got lucky, the ripped DVDs would play fine. Otherwise, it was play fine for a few minutes until it ran into the first bit of corruption. Confused the heck out of me.


That's generally caused by a combination of a cheap/bad high speed DVD-ROM drive and/or bad ripping software/options settings. What seems to happen is the ripping software is running a marginal drive/dvd at full speed and disabling the drive retry on error. Combined with all the soft ways these drives fail it results in a bad rip.

This used to happen to me a few years ago, and it was overwhelmingly just a sign of a marginal drive. For about 10 years, I was replacing drives about once a year. It doesn't happen as frequently since I started using bluray drives (usually LG) to rip DVDs and a piece of software designed for "piracy".

Anyway, two software things to try, are lower the ripping speed to 1x-2x, and make sure to leave the drive retry on bad sectors enabled. Most decent ripping software will have options to control the error correction/retry logic. The problem is some copy protection schemes leave bad sectors on the disks, and this will hang a lot of the "honest" ripping software that doesn't know how to deal with it.


If the OP has been using a low-quality DVD drive, I recommend using an Apple SuperDrive, since he mentioned Apple.

They may not have all the latest whiz-bang features, but they are rock solid, and are USB powered. I have two at home, and use them to rip DVDs, and have never had a failure in I don't know how many years.

New they run $79 from Apple. Or you can get a used one off shopgoodwill.com usually for around $15.


Problem happened with SuperDrives too though. :(


As I understand it, software playing a ripped DVD doesn't "know" when a data source is a ripped DVD and does not expect any corruption -- it will not seek over or ignore issues like a DVD player connected to a TV will. I've run into this issue a lot, but fortunately MakeMKV, my software of choice, will fail to rip DVDs when it is unable to make a perfect rip. Then, I know there is a problem before I begin ffmpeg compression.

The solution: a wet polisher for DVDs. Luckily, my local used-game shop polishes disks for $3 each. They use an [Elm Eco AutoSmart][1] and it makes the discs like-new. I have never seen a disc it could not fix. You'd have to ask/search around locally to find one near you... I think some libraries have these machines, too, to maintain their DVD collections.

[1]: https://www.elm-usa.com/products/eco-autosmart


If they're not copy-protected, try gddrescue; it logs what it's read, and it can resume. Try another drive if you can. And, I know it sounds odd, you can rub your finger against the side of your nose and then across any scratches on the shiny side.


When I was a kid I used screen capture software to do what handbrake does today. Considering how much lower quality DVD quality is, it should be fine. This is incredibly time consuming though.


That sounds horrible. I'd rather tell everyone that I can't rip their DVDs! :(


Assuming you were ripping to video files, you might have better luck creating a disk image for the DVD, mounting it with something like Daemon Tools, and playing from there.


Tried this, same result.


The real question here is how the heck does Amazon have a deal with the National Archives that prevents that public body from distributing public domain content that it now holds in its archive? Is this typical for digitally archived content?


Ancestry.com has a similar agreement with the National Archive: https://www.archives.gov/files/digitization/pdf/ancestry-201...

Five year embargo on the National Archive releasing whatever Ancestry chooses to digitise and host on their proprietary database. This model of digitzation appears to be the principle way that the records are digitized now: https://www.archives.gov/digitization/principles.html


Think the Archives have the film reels, strips, which you're free to go look at and copy whenever you want. Amazon paid to digitize them and put them on the DVDs, so Amazon gets to make the rules for the DVDs.

I assume the film is still around in boxes, so event he DVD window drops someone is still free to go digitize them again, but given how fragile film is that's touch and go.


The National Archive doesn't have the resources to digitize the material, but Amazon does. If the National Archive isn't interested in having the material digitized for them, nobody is forcing them to do a deal.


I think the main takeaway here is that we (citizens) should be volunteering to digitize material for the National Archives.


National Archives had them undigitized. Amazon said it would digitize it and sell the digital copies until it recouped the investment to digitize them, and then National Archives could distribute the digital copies.

It's no different than if you went to the National Archives and scanned something; you would not be required to distribute the scanned version.

However, apparently part of the deal was that the National Archives would get a digital copy that they could lend to researchers on-site. So this individual wants to copy those digital versions, piggybacking off Amazon's efforts. He/She's probably allowed to since they are public domain.


It is done when the digitization is done in exchange.

Archives preserve and disseminate. Budgets usually make this a challenge. They probably (rightly) view the preservation of content on decaying media as a big win.


I’d call a staffing company or a very active meetup group and ask for help. Everyone could meet in a conference room and write the contents locally to their machines. I’d create Pod groups of 5-6 people to assist each other with the physical parts of the task, organizing and labeling what’s completed etc.

Thinking 15 minutes per you have 30k minutes. 1 person can get 4 done per hour - let’s think 2 12 hour shift you just need about 25 people + their laptops.

If you need the equipment it could be rented by someone like this that does it for trade shows https://meetingtomorrow.com/austin-computer-rentals


There are more and more “active” senior living communities that may likely have a trove of willing and interested volunteers.


I had a similar challenge many years ago. I worked at a computer lab and had access to about 25 Dell Optiplex 380s, and I had about 200 DVDs I wanted to rip. They were for a friend, who was compiling city council meeting videos for a county.

So, I slightly modified Knoppix and started a netboot server, and made a dvd ripping cluster.

I didn't end up doing exactly this, but it's worth trying DVD::Rip. It specifically has support for ripping clusters[1].

But it doesn't even have to be something as complicated as a cluster, just as long as you parallelize things. How many laptops are you allowed to bring with you? Could you bring five laptops and 10 dvd drives?

[1]: https://ubuntuforums.org/showthread.php?t=1217643


I had a similar challenge when the first iPod came out. I got one on launch day in late October 2001 for my wife as a Christmas present

Two HP towers, each with two optical drives, I spent nights and weekends ripping her CD collection. I just barely got it all done before Christmas. She still has no idea how much work that was.


You need DVD readers, storage, and the ability to rip to disk images (try to not reencode or modify the bits on disk for archival purposes).

Maybe put out a call to Jason Scott/ArchiveTeam/ArchiveCorps to saddle up. ArchiveCorps (run by Jason) has a Slack team for such Call To Arms.

OP: Have you attempted to renew your research card with the National Archive to get more time?


Jason Scott has a nice podcast where he talks about a range of topics of interest to the HN community.


Jason Scott's podcast: http://ascii.textfiles.com/podcast


Give a call to Jason Scott at the internet archive? Get lots of friends to help?


3 days isn't enough time to do a quality job on all of them given "one shot".

Better to just prioritize some sample of discs and do the best you can-- consider it a pilot run. If you monitor everything you'll be able to use the pilot run to estimate and propose what it would take in terms of equipment, time and process to do the whole thing in a reasonable time.


I've done this with an old PC, 6 DVD drive bays and a bootable USB stick, ripping to network storage, did like 200 movies in a weekend.

That PC sits in a basement now, I think. If anyone lives in Milwaukee and wants to rip some DVDs over the holidays, get in touch, more than happy to loan the machine out.


Do DVDs offer better error correction than CDs? I know from ripping CDs it can take over an hour per CD to get an accurate rip because the program has to read the disk multiple times to get the correct data out.


Do you have context on this? I’ve been ripping/verifying old CD based games using raw cdrdao and can rip a CD with error correction in a few minutes. I often sell games and use the same method to verify the integrity of the game before sending it off.


There's a CD ripper called Exact Audio Copy that has some pretty heavy EC/misread mitigation that, in its most accurate mode, can easily take an hour for a moderately scuffed-up disc

It's pretty much the gold standard for bitperfect CD rips amongst the torrent community


Sounds strange.

I thought CDs are digital so whatever is received is exact by definition?


When you are playing audio from a CD and there are read errors, you get a worse listening experience if you wait for the CD to spin around a couple more times trying to read the data again causing a pause in the audio than just playing the corrupted data block causing a short glitch in the audio. So that is what CD drives do for audio CDs. The entire format of an audio CD is different from a data CD, the audio tracks are not just files. At least early CD drives did the entire audio processing in the drive and they even used to have a separate audio cable that directly connected the CD drive to the sound card. All this means - or at least meant, I have not ripped CDs in ages - that you have to work pretty hard to get an bit accurate rip of an audio CD because a lot of the processing is done in the drive and the raw bit stream is not easily available to the operating system and applications. Especially error correction - which is best effort for audio CDs and does not guarantee the absence of errors - at least used to happen so early and transparently to the following processing steps that it was very hard to know what happened or did not happen to the data stream.


In addition to the other points made, when errors get through even the error correction that CDs have, for the same reason FLAC can compress audio by about 2:1 and MP3s can do even better, waveforms can be repaired or even outright made up for significant fractions of second without being noticed by a human. (Half a second would be noticed, but you can forge, say, 1/16th of a second fairly successfully with some simple algorithms. If you get really unlucky and you miss one of the really important voice transients like the letter T versus V or B versus D, it might be noticed, but otherwise it's really easy to fill in a gap like that.)

DVDs do not have that characteristic. While the streams will be recoverable, they will also be very visibly broken if the visible data gets corrupted, and that's much harder to repair. We are much more sensitive to visual corruption in general, and the way the compression works can make even small errors loom large on the screen. (As the compression gets more sophisticated, this gets more true, which is why DVDs generally just have fairly confined block errors, but modern codec errors can cause significant corruption on the screen, followed by the corruption on the screen "moving" like the video was supposed to.)


You can distort a digital waveform to the point where a 1 is read as a 0... if you get a perfect read of the disc then sure. But for uncompressed audio, a couple of bits being off here and there won't make much of a difference for most uses. (I'd bet high-speed ripping amplifies this problem)

Of course it's not 'bitperfect' at that point but then thats why EAC does so many retries.


Ah, I'm aware of at least one copy protection system where a bit was set so that if read multiple times, would give multiple answers. If you ripped it, it would be read as a definite 1 or 0, but the loader relied on it being read differently on multiple runs.


CDs are a physical media storing digital data. The data is stored in microscopic pits on the disc surface that, like all physical media, can be subject to read errors, due to various imperfections. It has built-in error correction but it is not perfect, so that program reads the same thing multiple times to get statistical confidence that the data is correct.


Your rips are full of errors unless the drive you're using has C2 (many do), even then-- you've got a few errors.

The only way to get an accurate rip of audio data is to compare it to a known good source. This is why Accurate Rip exists...

Source: I've ripped 4000cds :)

EDIT: This applies only to audio tracks. Data tracks don't suffer from this issue.


>The only way to get an accurate rip of audio data is to compare it to a known good source

That's the only way to confirm you got an accurate rip. It says nothing about how to get the rip in the first place. In my experience, "full of errors" sounds exaggerated. Most of CDs can be easily ripped in 'burst mode' (in EAC) or equivalent in other software and you still get same (exact CRC32) result from AccurateRip DB with no C2 involved (I disabled it), as soon as you set the offset correctly.

Disclaimer: I didn't rip 4000 CDs, only hundreds.


You are technically correct, which is the best kind of correct :-)


It's worth noting that those problems are entirely caused by drives acting like black boxes, not the CD format itself.


Sort of though. CD Audio is meant to be played back with a minimal buffer before the DAC and to be kept spinning at a very precise speed. It's more like a digital vinyl record than it is a file system, for instance.


There's enough error detection that if you can actually get access to what bits come off the laser you won't have any major problems making a perfect rip. Your whole-disk alignment might be off by a couple sectors, but that's about it.


Aside from various copy-protected CDs, neither data CDs nor DVDs have the problem that audio CDs have with rip accuracy.


Make sure and test the copies. I used to do bulk copies years ago and the media + writing speed would make a difference. Writing at full speed would technically work, but the disks were sometimes unreadable in some readers.

'Back in my day' I would chain FireWire drives together (better than USB/Hubs at the time)


With that time constraint I wouldn't bother re-burning to discs, just rip them to a HDD (or several) to save the data for later.


At 12 hours of labor per day, that's roughly one per minute.

So, at minimum you need one machine more than the number of minutes it takes each machine to rip the DVD. If you also have to burn a second set of DVDs, you will need at least one machine more than the number of minutes it takes to burn a DVD plus the aforementioned ripping machines.

If you're just storing the DVD images after ripping, that part can be at least partly automated, set up a machine as a central server (about 20 terabytes should do it if nothing goes wrong) and have each 'ripper' run a script to copy the image up after the rip is done (which hopefully it can do while also starting the next rip, if not you'll even need more machines).


I wanted to do this but with Blu-ray’s. My goal was to build a library of every movie ever made that was worth seeing. It seemed like 2018 was a good time to do this since good movies have apparently stopped being made. Anyway, If you wanted to watch tons of good movies, you would normally end up paying tons of money to rent it from iTunes. And even then you only get to see it once and you have to have an internet connection. And streaming services don’t have even a fraction of the selection needed. But Netflix’s mail dvd service seems to have every movie I can think of. So why not open a few Netflix accounts, order disks in the mail and just save all the disk images? It seemed like a good idea until I looked into Blu-ray copy protection. Of course, I wanted to have my library consist of only the highest quality and highest fidelity so Blu-ray’s were called for. But Blu-ray copy protection is devious, ingenious and very effective. Each disk consists of two regions: A region that holds encrypted movie data and a region that holds a key. It is illegal for players to be sold that read the key and then forward it to user-facing interface like a computer. Players must always only read the key only in order to use it internally to decrypt the movie data. This stops all legitimate entities from selling players that reveal the key to the user. But what about the illegitimate entities that might want to sell modified players that provide the key? Or just publish keys online? Well, the key on the disk is itself actually encrypted. And it is encrypted in such a way that multiple keys can decrypt it. Blu-ray players come with special hardware that is flashed with a key at the factory. This hardware uses that key to decrypt the Blu-ray’s key. In the event that a key is compromised and published online, or used widely in any way, that key is depreciated and all Blu-ray’s from that point onward contain keys that cannot be decrypted with the compromised hardware key. Instead, a newer key is used. This new key is still able to decrypt all the old Blu-ray keys as well as all the new ones. This effectively defeats people publishing keys online. It’s ingenious in that the people who conceived it realized that the only time key compromise is a problem is when those keys are disseminated widely, and that when keys are disseminated widely they are easy for authorities to detect. If you want to get perfect rips of any blu-Ray you might come across, you are forced to go through the pain of probing the hardware yourself to get that key, which is quite difficult. There’s no way around it.


I’ve ripped every Blu-Ray I’ve purchased easily with MakeMKV, probably a couple hundred. I dunno, but someone is making it seem easy....


Like I said, makemkv (which is indeed the idiomatic tool for the job) will never be a reliable way to rip any bluray. For my purposes, it was very important to be able to rip absolutely anything I came across. If you’ve got some Blu-ray’s and want to try it out then fine. It will probably work if they are older movies and were “pressed” long ago. The Blu-ray copy protection scheme also has the quality of making legitimate players obsolete if they were the source of stolen keys. So even a legitimate older player might not be able to read a new Blu-ray.


> The Blu-ray copy protection scheme also has the quality of making legitimate players obsolete if they were the source of stolen keys. So even a legitimate older player might not be able to read a new Blu-ray.

This is the reason why I never bought a single player or disk, Blu-ray and HDMI/HDCP copy protection went way to far (especially with the chain of custody nonsense). At the end, if the industry wants to fuck legitimate customers like that, fine, I just torrent everything, this way only a single person/group has to figure out how to rip the disk or broadcast, just once. There is no reason what so ever to feel bad about it, they did it to themselves, no informed customer should ever buy into that shit. Its self defense, I'm not buying a TV for thousands for the copy protection to brick it eventually.


MakeMKV seems to have processing keys dating from late 2018 (at time of writing) so, ignoring BD+, it should be able to decrypt every disc pressed before then.


You may be right, but that doesn’t reflect my experience. Most the BR’s I rip are just barely released (pressed) and I have a 100% success rate with makemkv. They all literally worked on the first try, even new releases. Maybe I’m just lucky.

The only problem I’ve had is when they try to make it hard by putting a gazillion titles on the disk to make it hard for you to figure out which is the right one. That sucks, but is not insurmountable.


This has been my experience too. I've never had makemkv fail to open something. I suppose that if the dude who runs it ever gets hit by a bus or something, then it might not be useful going forward from then, but presumably somebody else would take up the work.


As long as you dump the encrypted key, you can reliably trust that there will become a way to decrypt it later.

> In the event that a key is compromised and published online, or used widely in any way, that key is depreciated and all Blu-ray’s from that point onward contain keys that cannot be decrypted with the compromised hardware key.

If someone merely provides a title key decryption API, is there any way to figure out which device key they're using?


Wow I had not actually thought of that. Hosting that service would cost money, unlike releasing keys on pastebin, and any attempt to do something like this, especially if monetized, would meet considerable retaliation from Blu-ray people. So I guess that’s why you don’t see it.

Getting keys from your hardware is a hassle and I didn’t want to wait for a decryption solution later.


Would that service really take more than a raspberry pi on tor?


Hmm... web/bittorrent over tor, with update file references. Just have the swarm on a given directory, update to the latest "full" swarm every day/release with the same directory for the keys available.


Yeah (DoS factor).


Just hold on to the discs you can't rip. Eventually enough keys will be leaked to decrypt them. Blurays will likely die out when internet is fast enough to stream high quality video and at that point you will be able to rip them all when a key is leaked.


Internet is already fast enough to stream high quality movies. The reason you can only access a comprehensive list of movies thru iTunes or disks is legal, not technological.


Not at BR quality. Well, it may be fast enough in sections, but it isn't economical to send out thousands of BR quality streams, so they don't.

Newer codecs help with that, yes. However, streaming will not be as high quality as a "station wagon full of tapes" i.e. local, in the near future.


Easily at BR quality with a high speed connection — it's only 20-30 mbps — but as you say, consumers don't care enough to pay what it would cost to stream at higher bitrates.


Really? Most people still use internet slower than 10 Mb/s and treat a computer as an enemy that works against them.


Source? I don’t know if any isp that still offers 10mbit service.


Come out to Western Montana. CenturyLink would only offer 10mbps/1mbps, of which I was lucky to get about half. They let me upgrade last month to 25/2, of which, I am lucky if a speed test results in 10/1.5. I'll take it as it is quite literally the best I can get. I recently paid Spectrum around $5k to extend a line to my house, but I'll have to wait until spring for it and its 400/25 (I am counting the days!).


Haha, come to Australia. That's faster than the connection my whole office at work shares. I don't know anyone who could stream a bluray which is usually about 40gb for an hour of video.


https://o2.cz

My top speed is 8 Mb/s. This is the most prominent provider in my country. My plan is 20 Mb/s, but that has literally never happened. A lot of people in my country use wifi providers with outdated equipment, giving them around 10 Mb/s as well. And lastly, check the page on average speeds of the Internet on Wikipedia.

BTW the absolute majority of customers are sharing their bandwidth and the infrastructure is built with that in mind. If everyone streamed 4K videos, the actual speed would go down massively - of course since the only solution would be to rebuild the infrastructure, that is not going to happen, instead they will lower the speed or introduce FUP.

You can see Czech Republic pretty high in that list - that's because it's average speed, not median. I have a friend that has a 600 Mb/s plan that's 3 times cheaper than mine, but that's not the norm - and I guess it's similar all around the world, either you're lucky... or not.


I am half hour from a major metro, I would gladly pay another 80/mo to have 1.5m on a semi reliable basis. You take a lot for granted. Our infrastructure sucks ouside the suburban bubble.


Wow I'd forgotten how lucky I am right now--that's what I pay for gigabit fiber. Thanks for the perspective.


Pretty standard in the UK outside of major towns/cities.


Not that uncommon outside of major towns/cities either - I was getting barely 12Mbps on ADSL in inner London, and that only 600m from an exchange.


ADSL speeds aren't great, but fibre is available almost everywhere in cities in the UK...


True. Although even on fibre, I only get 25Mbps peak. Often it's hovering around 18-20Mbps (which is not quite enough for 4K streaming, IIRC.)


Of course that's fake fibre aka fibre to the green cabinet in your street. The UK allows that to be marketed as fibre despite the large difference in performance to a real fibre connection.


Having since upgraded to real fibre (I saw the man splice it into the router connector with my own eyes), the difference is staggering - now I regularly get 500Mbps+ up and down with not infrequent 800Mbps+.


And even that is surely not just for you, but shared.


Why not just read the unencrypted data that the Blu-ray players give you, though.


Like, ripping the HDMI output? You're limited to realtime speed for that, which is going to make the process take ten times longer.


Plus, critiacally, that signal is the result of the player messing around with the data. It might do all kinds of stupid processing. You don’t know what quality of decoder they’ve put in the player. I guess there are ways to find out so It’s not a dealbreaker but it’s just another thing that complicates stuff. But overall I’d say that finding a player with really good decoding and capturing the resulting signal is your best option for the scenario I outlined. It wouldn’t make a difference if it took 2 minutes or two hours because the window between getting a movie and mailing it back is way more than two hours.


> But overall I’d say that finding a player with really good decoding and capturing the resulting signal is your best option for the scenario I outlined.

Exactly.


Also HDCP becomes an issue.


You can get HDMI splitters on Amazon that will get rid of HDCP.


I’ve ripped a few of my own BR discs, hasn’t been a insurmountable problem.


You make it sound like you cracked the disk encryption or lifted keys off your player. Or did you just try out keys you found online?


Nope, just used Handbrake. It uses whatever library that handles BR decoding and key management. I believe sometimes a key hasn't yet been leaked on newer discs, however if you wait a few months it will have.


My hands are bleeding just thinking about the logistics of opening and closing 2000+ DVD cases, and loading/unloading that many DVDs, over the span of three days. Ouch.


My tip would be - find a nearby game company, especially a publisher. They tend to have large amounts of extra burning towers from the pre-USB-drive days - 2000 1-1 is a bit more difficult than 2000 copies of one though. I'd bet there's a copy function on some of them.


You need one find one of those Firewire Sony 200 DVD changers. Used to have on hooked up to my HTPC. Just a standard DVD writer in there.

Maybe you could reach out and borrow one?

From a quick google, it's the VGP-XL1B/VGP-XL1B2/VGP-XL1B3


There is a company in Utah called "VidAngel" which I believe had built up a fairly sophisticated DVD ripping/streaming operation until they got sued by Disney.

They are still around but have since switched to a streaming only model. You could reach out and see if they still have their DVD ripping infrastructure, I'm sure they would love to help.


Look for a local gaming cafe and rent all of their machines.


One PC with 8 DVD R/W Drives.

I use to own one in the 90s. I would do work for churches and conferences. I could get a thousand done in a few days. They use to have automatic feeder drives for producing more. No one sells them anymore it appears.


40 TB of HDD and 10 (high speed) DVD-readers with two guys should get the job done in one day. I wonder though if Amazon did use DRM ? That would be an issue.


Contact archive.org, this is what they do.


Does anyone know why they can't just network the file system?

If he has network access, then it might be faster to just upload the entire set directly to another drive over the internet, since most universities have fairly good connections, that shouldn't be any slower then trying to write them all to DVDs.


Get several cheap laptops with dvd players. install ubuntu. Buy a bunch of USB HDs. Use DVD::Rip to extract the VOBs onto the USB drives. At some point later, convert the VOB data to ISOs. Burn to DVD as time permits.


Complete speculation here and I have no idea if it's actually possible, but couldn't you take a high resolution photograph of the DVD and then use software to turn the image data back in to DVD data? Maybe with a couple of different colored light sources at obtuse angles to the disc? It's obviously not a practical solution but it'd be really fast if you could make it work.

It's definitely possible with phonograph records.


Maybe if you shot a laser at the DVD and imaged 1 bit at a time. And then maybe you could rotate the disc while you do this, so you can get a lot of bits in rapid succession.


I don't usually come to Hacker News to laugh hysterically, but thank you for that great start to the day.


Well, that sounds like it would never work!

(Yes, I know.)


Bravo, sir.


Just think about the amount of data stored on a DVD.

It's about 4 GB, which means that if you got a perfect image of just the data of the DVD (one pixel imaging each bit) the size of the picture would also be 4 GB. That means, as far as I understand, a 4 gigapixel image AT BEST, assuming you were able to get every single pixel to capture one bit of data on the DVD. In reality, since DVDs are circular and images are square you're looking at images that would be more like 6 gigapixel if my math is correct.

Today, available cameras are at best 50-megapixels, which means you need at the very minimum about 100 photos of each DVD, but in reality, there is no way you can ever get one bit per pixel, because to begin with the bits are not aligned in rows unlike the camera's pixels, so you're most probably looking at ten times that.

It's an interesting idea but I'm pretty sure it's completely unpractical.

Edit and I realised I counted 4 GB, but I should actually have used 32 Gb as a starting number, as what matters here are the bits, not the bytes, so add another 8x to the amount of photos you need.


Unhelpful to GP.

Pied Piper could do it. Would use their “middle out” compression algorithm - anyone with a 5.2 Weissman score could do this pretty easily.


Consider that you'd be trying to photograph 100 billion bits of pit and land. To capture it properly a terapixel image might work. Nothing can do that in one photo, or short series of photos.


No. DVDs are much higher density than phonograph records.


If I recall correctly, even for vinyl, "high resolution" in the context of demonstrations that had decently high quality audio reproduction meant scanning over the record with a microscope, with resolution significantly beyond what would be possible with a normal camera.


[citation needed]


Think about it -- a vinyl can hold ~22 minutes of music per side, and a DVD-R can hold ~120 minutes of video per side, in a much smaller area.


Just a rough calculation:

Each pit on the DVD represents 1 bit, but need at least 1 pixel to stored. You need 8 bit to store a pixel. So you have write amplification of at least 8 when taking picture of DVD.


That's why I suggested using a couple of light sources - you could take an image that's lower resolution than 1 pixel per bit and still resolve the image data to the DVD data based on the color. Eg 10 would be a different color to 01.

It's not even remotely practical but I thought it was an interesting idea. Judging by the way the votes are yoyo'ing up and down, about half of HN readers agree. :)


I'd assumed some optical magnification would be required in combination with multiple photos per disc.

Then software would have to sort out the mess.


Come on, you can calculate that without a calculator.


You don't even need to do the maths. Just looking at a vinyl would give the game away since you can see grooves with your naked eye (as a DJ in an earlier life, we used to use that as a visual clue for when to queue up the next record, swap the basslines, etc).

Not to mention how albums would often be spread over two records where as the same album could fit on a single audio CD (let alone DVD).

I'm guessing the GP has never owned any vinyl. I could forgive someone questioning the density of records if he's not familiar with the tech.


I'm guessing the GP has never owned any vinyl. I could forgive someone questioning the density of records if he's not familiar with the tech.

I'm old enough to know about vinyl. That's not really the point though - the principle is the same, just with much higher information density. With a suitably high resolution camera, or set of cameras, or video, and some serious magnification it should be possible to photograph a DVD and play back the data from the image. I find that quite interesting, or at least entertaining, to think about. I guess the HN readers who're downvoting the post don't.

Something to consider - technically we can already do it - it's how DVD players work, albeit with a single coherent light source and only reading one bit at a time.


> I'm old enough to know about vinyl. That's not really the point though

I was talking about user U2EF1 (the "citation needed" comment). I didn't have any particular issue with your comment aside the impracticality of it with current consumer hardware (which I'll address below). But as a concept it's an interesting point.

> the principle is the same, just with much higher information density.

The problem is in the detail. Even with gramophone records, you'd get a very low success rate with a consumer camera. Plus you'd need multiple macro pictures and a precise way to stich them together. At which point it would be quicker to play the record while recording it in Audacity (or similar) in real time. So you're talking a less than 1x record speed and lower success rates to boot; and that's just low density vinyl. When talking about DVDs you'd need to improve the operation by several orders of magnitude in terms of accuracy and resolution.

So I think your comment was interesting from a conceptual point but we're a long long long way off having that level of detail in consumer photographic devices.

> Something to consider - technically we can already do it - it's how DVD players work, albeit with a single coherent light source and only reading one bit at a time.

I think it's a little disingenuous having a laser reading binary reflections sequentially compared to a digital sensor detecting literally billions of analogue reflections (because a camera isn't just detecting the existence or absence of light) in parallel and then forming a precise sequence of digital bits from that. The technologies are completely different, the scales are completely different, the concept is completely different. They're just not equatable.

> I find that quite interesting, or at least entertaining, to think about. I guess the HN readers who're downvoting the post don't.

It wasn't me who downvoted you but if it's any consolation, I've gotten downvoted for factually accurate comments before (let alone impractical suggestions). You just have to remind yourself that it happens occasionally and usually the positive outweighs the negative. :)


This is actually an interesting idea in some ways, but there's such an enormous difference in scale between the problem and your solution that it's completely impossible. Like the bit with the guy trying to jump across the English Channel: https://youtu.be/Qvk2wNWmB20?t=47s


This sounds like a fun DIY project to attempt.


There are many old SCSI and Firewire DVD changers available. Just gotta ask the right people.


700 per day. About 30 per hour. Each DVD is what, 4gbytes? Maybe 10 laptops with SSD drives or external hard drives would be enough. And lots of caffeine!


More granularly, about 2 minutes per disc. A 20x drive can read one disc in around 3 minutes, so a few of those running in parallel should be enough. (This is supposing all the discs can be read at full speed, which is not necessarily the case in reality; but increasing the parallelism to compensate can be done easily.)


Surely that could be pumped up, write a script to auto-grab the video files, rename them to $dvd_name(n) and place in folder. Deal with combining those files later(when you have more time).

Note:it's been a looking time since I have tried to pull DVD files, but if my (rusty) memory serves certain types of DVD readers will just let you copy the vob.


https://github.com/automatic-ripping-machine/automatic-rippi... can be modified fairly easily to just copy the disc image and not attempt any encoding.


You’d want a block level image of the DVD for archival purposes, such as an ISO or BIN format.

Using dd (Linux) or Disk Utility (Mac) makes this straightforward.


Though it's probably not the case here, there are some discs with copy protection that specifically trips up block-level copies. The drive will continually error and misread sectors. I'm not sure what open-source tooling can be used to circumvent; I always end up falling back to AnyDVD.



Dd if=/(name of drive) Bs=1 Of=/( name of drive to copy to include the path Conv=notrunc

Create a directory and a file on the target machine first a name like /myfile/this file Then to create a bit by bit copy of that file use this command Something like this Dd if=/dev/cdrom bs=1 of=/dev/sdc/myfile/thisfile.iso conv=notrunc


I’d use gddrescue - it logs what it’s successfully read, so you can try multiple drives. And there should be some way of reading the dvd / volume title from a script.


Why set the block size to 1? Isn't that just going to cause a ridiculous amount of syscalls?


Yah, and be incredibly slow as well. It should be at a minimum 32K which is the DVD block size, but even larger would be better (but may interleave badly with IO to other devices).


It should be 2048 which is DVD sector size.


Because of ECC it doesn't read a single sector at a time. And actually even 32K is not large enough for good performance, you should go higher, maybe 1MB.


That does not deal with DRM


does it need to? from the source material it sounds like there should not be any DRM on it, but even if, does DRM prevent a bitwise copy of a media? if not, you can deal with DRM after making the copy.


It was mentioned elsewhere - some drives will randomly flip bits in order to make bitwise copies impossible if you just start copying. A software that can actually correctly work with the CD/DVD drive is necessary.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: