Hacker News new | past | comments | ask | show | jobs | submit login

Aaron Swartz, cofounder of Reddit and inventor of RSS and Markdown, was hounded to death by an overzealous prosecutor for downloading articles from JSTOR, with the intent to learn from them. He was charged with over a million dollars in fines and could have faced 35 years in prison.

He and Sam Altman were in the same YC class. OpenAI is doing the same thing at a larger scale, and their technology actually reproduces and distributes copyrighted material. It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.

https://flaminghydra.com/sam-altman-and-aaron-swartz-saw-the... https://en.wikipedia.org/wiki/Aaron_Swartz




I'm familiar with Aaron Swartz's case, and that is actually why I phrased it as "books". In any case, while tragic, Swartz wasn't prosecuted for copyright infringement, but rather for wire fraud and computer fraud due to the manner in which he bypassed protections in MIT's network and the JSTOR API. This wouldn't have been an issue if he downloaded the articles from a source that freely shared them, like sci-hub.


It would be incredibly naive to assume that the scraping done for these models did not at any point circumvent protections.

The fundamental contention is that both accessed, saved and distributed material that they didn't have a "right" to access, save, and distribute. One was made a billionaire for it and another was driven to suicide. It's not tragic, it's societal malpractice.


Will what OpenAI & others serve as precedent for Alexandra Elbakyan of SciHub and avenge Aaron?

Cynically, I imagine it will not but I hope that it could.


You could argue that they are avenging him in doing exactly what he did, or worse, and not being punished for it. They are establishing precedent.


I'm responding specifically to this sentence:

> It's shameful that they are making claims that they aren't infringing creator's rights when they have scraped the entire internet.

Scraping the Internet is generally very different from piracy. You are given a limited right to that data when you access it, and you can make local copies. if further use does something sufficiently non-copying, then creator rights aren't being infringed.


Can you compress the internet including copyrighted material and then sell access to it?

At what percentage of lossy compression it becomes infringement?


> Can you compress the internet including copyrighted material and then sell access to it?

Define access?

If you mean sending out the compressed copy, generally no. For things people normally call compression.

If you want to run a search engine, then you should be fine.

> At what percentage of lossy compression it becomes infringement?

It would have to be very very lossy.

But some AI stuff is. For example there are image models with fewer parameters than source images. Those are, by and large, not able to store enough data to infringe with. (Copying can creep in with images that have multiple versions, but that's a small sliver of the data.)


Commercial audio generation models were caught reproducing parts of copyrighted music in a distorted and low-quality form. This is not "learning", just "imitating".

Also, as I understand they didn't even buy the CDs with music for training; they got it somewhere else. Why do organizations that prosecute people for downloading a movie do not want to look if it is ok to make a business on illegal copies of copyrighted works?


I said "some" for a reason.


When you identify where the infringing party has stored the source material in their artifact.{zip,pdf,safetensor,connectome,etc}. In ML, this discovery stage is called "mechanistic interpretability", and in humans it's called "illegal."


It's not that clear cut. Since they're talking about taking lossy compression to the limit, there are ways to go so lossy that you're not longer infringing even if you can point exactly at where it's stored.

Like cliff's notes.


It was overzealous prosecution of the breaking into a closet to wire up some ethernet cables to gain access to the materials

Not the downloading with intent

And apparently the most controversial take on this community is the observation that many people would have done the trial, plea and time, regardless of how overzealous the prosecution was


> breaking into a closet

"The closet's door was kept unlocked, according to press reports"

When's the last time a kid with no record, a research fellow at Harvard, got threatened with 35 years for a simple B&E?


They threaten

Its the plea or sentencing where that stuff gets taken into account for a reduction to community service


I'm glad you still have that much faith in the system. That's much more faith than I have in the system (and more faith than I had in the system back then, too).


Wasn’t John Gruber the inventor of Markdown?


> for downloading articles from JSTOR, with the intent to learn from them

For context, according to sources, he downloaded 4.8 million articles.


Maybe he was about to train an LLM on them /s


35 years is a press release sentence. The way DOJ calculates sentences when they write press releases ignores the alleged facts of the particular case and just uses for each charge the theoretically maximum possible sentence that someone could get for that charge.

To actually get that maximum typically requires things like the person is a repeat offender, drug dealing was involved, people were physically harmed, it involved organized crime, it involved terrorism, a large amount of money was involved, or other things that make it an unusual big and serious crime.

The DOJ knows exactly what they are alleging the defendant did. They could easily looks at the various factors that affect sentencing for the charge and see which apply to that case and come up with a realistic number but that doesn't make it sound as impressive in the press release.

Another thing that inflates the numbers in the press releases is that defendants are often charged with several related charges. For many crimes there are groups of related charges that for sentencing get merged. If you are charged with say 3 charges from the same group and convicted on all you are only sentenced for whichever one of them has the longest sentence.

If you've got 3 charges from such a group in the press release the DOJ might just take the completely bogus maximum for each as described above and just add those 3 together.

Here's a good article on DOJ's ridiculous sentence numbers [1].

Here's a couple of articles from an expert in this area of law that looks specifically at what Swartz was charged with and what kind of sentence he was actually looking at [2][3].

Why do you think Swartz was downloading the articles to learn from them? As far as I've seen know one knows for sure what he was intending.

If he wanted to learn from JSTOR articles he could have downloaded them using the JSTOR account he had through his research fellowship at Harvard. Why go to MIT and use their public JSTOR WiFi access, and then when that was cut off hide a computer in a wiring closet hooked into their ethernet?

I've seen claims that he wanted to do was meta research about scientific publishing as a whole which could explain why he needed to download more than he could download with his normal JSTOR account from Harvard, but again why do that using MIT's public WiFi access? JSTOR has granted more direct access to large amounts of data for such research. Did he talk to them first to try to get access that way?

[1] https://web.archive.org/web/20230107080107/https://www.popeh...

[2] https://volokh.com/2013/01/14/aaron-swartz-charges/

[3] https://volokh.com/2013/01/16/the-criminal-charges-against-a...


He might have wanted other people to have access to the knowledge, and for free. In comparison, AI companies want to sell access to the knowledge they got by scraping copyrighted works.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: