Hacker News new | past | comments | ask | show | jobs | submit login
Three hundred and sixty years of United States caselaw (case.law)
187 points by crunchiebones on Oct 29, 2018 | hide | past | favorite | 58 comments



Hi! I'm a developer on this project. Happy to answer questions for the next hour or so, and I'll check back tomorrow as well.

This is a first public release and the Harvard Library Innovation Lab is a small team, so please look at the site as just the beginning! In particular we're starting by targeting developers with an API and bulk data downloads. More user friendly features like a front-end browser, PDF scans, ngram browser, etc. are on the roadmap, but we're hoping other developers will step in and build some great tools as well.


I'm assuming that as a university library team you're familiar with the Blacklight ecosystem of library search tools?

I and a single Rails dev modified the current Blacklight codebase to handle PDFs (and other office docs) with minimal effort a few months ago, I think doing so when you get to PDF handling and a web UI would be a completely valid starting point.

I'll save you the trouble of figuring out which is the only existing feasible open source library to sift through hundreds of thousands of PDF pages... QPDF is it.


Thanks for this suggestion! We'll definitely kick Blacklight around as an option. We have a complex data model (cases, opinions, parties, citations, courts, jurisdictions, volumes ...), and tens of millions of pages, and weird access restrictions, and multiple output formats (text/html/xml/pdf), so whatever we end up with will have to be pretty custom.


I worked for the British Medical Journal a while ago which had 120 years of articles digitised, along with several different medical databases, with around a dozen different applications utilising this content in various ways. It was basically all stored internally as XML using the Documentum/xDB CMS/XML database stack, with lots of XSLT used to generate content for different apps and in different formats.

Not at all the easiest system to work with, but it does allow you to handle a lot of structured data such that you need the minimum amount of customisation possible - which is still a lot!


If the project is truly "free" - build it in a way that can be passed down through generations: easily indexed with semantic data and without proprietary logins or fiddle-some navigation. archive.org would be a good long term home or some other long lasting and perpetually funded service, beyond what Harvard would spend its billions of alumni bucks on.


Definitely! I'll sleep better when we have all the bulk data shipped to everyone who wants it. That will happen by March 2024, or earlier for any states that switch to official digital publishing.

(I mean, Harvard isn't a bad home for this -- I work in a building with books that predate the printing press, and I work on stuff like Creative Commons-licensed forkable textbooks. Libraries are cool places. But Harvard definitely shouldn't be the only place that preserves this data set.)

As far as preservation-friendly formats, our bulk data download format is xzipped jsonlines, which is tuned for NLP (highly compressed, parseable in a few lines of python with low memory requirements) rather than preservation:

https://case.law/bulk/download/

Internally we have a preservation format where each volume is stored as a bagit bag containing METS XML for the OCR and case-level data, plus color and black and white images of each case. These are much harder to work with, so it's not a focus to share them right now, but we can definitely share if someone makes a case for it.


No kidding. After looking at this project for a bit longer, I'm much less optimistic about this project. Frankly, it really depresses me that they require research agreements to download all information.

Developers of this project: Please make the information freely available in a way that doesn't require agreements with a giant for-profit company.

For now, I'm convinced that this project is nothing more than a veiled advertisement for lexis nexis.


Unless you can provide them with the information, it's not the developers who are making the rules. There's a reason Lexus is as profitable as they are.


My reading of this is that it's something they're contractually bound to for the near term but not forever:

> Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024.

Hopefully this means no logins, also, but that's less clear.

The blame on this should really fall on the courts that allow private companies to paywall access to the rule of law. At least some states have started to do it right:

> Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.


> Hopefully this means no logins, also, but that's less clear.

The only things we have behind logins are what we are contractually required to, yes. This gets pretty fine-grained -- if you do a logged-out search across jurisdictions, requesting full text, the json contains error fields for the specific fields we aren't allowed to share without a login yet.


Awesome, thanks for the clarification!


Hi! This is so exciting! Quick question: how does this compare with the Courtlistener/free law project data? (And are there any plans to work with them?)


Great question! We love Mike Lissner and have talked with him about connections between our projects. We'll definitely work with him if we get a chance.

Benefits of our data set: we're more complete, being a census of all known volumes of official caselaw back to the beginning; we're easier to work with for data processing, since all of our data, across centuries and states, is in one consistently structured format; we have the page images, meaning if there's any question of accuracy, we can check the final authority.

Benefits of FLP (and there may be others I don't know): they're updated in realtime from scanning court websites, so they'll stay up to date in a way we won't; their scraped text for modern cases doesn't have OCR errors; their site is much more featureful.

At this point I see our strength as being a complete/consistently-formatted/authoritative data set of printed cases, which leaves lots of room for other caselaw databases with complementary goals.


Thanks! I have a half-completed python library to access their api [1] --- maybe this will be good motivation to finish it and add yours, see if there's a useful way to query both at the same time in some circumstances.

[1] https://github.com/paultopia/lawpy


Do you have any documentation that outlines the meaning of all the metadata fields? The one I'm curious about is the "whitelisted" value on the jurisdiction, such as https://api.case.law/v1/jurisdictions/us/.


We do have some docs, though suggestions are welcome! Here's the definition of whitelisted:

https://case.law/api/#def-whitelisted


Clicking Washington on the map fails, with a 400 error.

It looks like the link from the map is using "wa" for Washington, when it should be "wash".

I'm curious now. Some states do use a 2 letter code, some use 3, and some use 4. Why didn't you use the same naming format for all of them?


Thanks, I passed on your bug report!

For the jurisdiction slugs we use the standard legal citation abbreviations for each state:

https://law.resource.org/pub/us/code/blue/IndigoBook.html#T1...

This has the advantage of matching the citations to cases, like "123 Wash. 456".


Ha...you posted this 11 minutes ago as of right now. 12 minutes ago I found a similar link citing Bluebook abbreviations and posted a "never mind, I figured it out note" (and mentioned I felt silly for not noticing it right away--I went to law school in Washington and so spent three years seeing "Wash." on the spines of case reports).

It took me 12 minutes to post because I tossed in some explanations for people not familiar with legal citations. As soon as I posted, I saw yours, and deleted mine as redundant.

There is one thing, though, that I discovered working on that post and am curious about now. I picked a case at random to use as an example, Peterson v. City of Seattle, 316 P.2d 904, 51 Wash. 2d 187 (1957).

Here's a link to the Washington Supreme Court's opinion:

https://law.justia.com/cases/washington/supreme-court/1957/3...

Within the opinion they cite the case as 51 Wn.2d 187 (1957) at the top. Inside the opinion they cite some Washington cases as Wn and some as Wash. I could see no obvious patter as to which they pick.

Anyone happen to know offhand what determines which form they use?


omg the limericks. Pls tell me how many there are. I need to know what I'm getting myself into...

<clicks "New Rhyme!" button>

...

> It is expressly laid down in Bull

> I refer to the receipt in full.

> The company first.

> Decision reversed.

> Did they state they would replace your wool?


The limericks are absolute gold.

> Brown with his papers in his trunk.

> Kilburn laid down on the top bunk.

> The court reconvened.

> The State intervened.

> She thought perhaps Kristin was drunk.


They're randomly generated. You'll be in for the long haul.


Very interesting! What scanner, resolution and tools for OCR did you use?


How long will it take to replicate Lex Machina? :D


Is caselaw data an artifact originating from the public dollars which funded the whole show to begin with?

Why isn't this data available for free to any American Citizen who pays taxes? In the form of a torrent, the distribution cost is negligible. A reasonable duplication fee seemed reasonable back when replicating vast amounts of information involved massive amounts of paper and toner. But today.. I don't understand.

I'm curious enough to have asked this same question on Quora [0].

YMMV.

[0] https://www.quora.com/unanswered/How-can-LexisNexis-own-the-...


Caselaw data is public domain, but it's very expensive to digitize -- partly because it's mostly stored on paper, and partly because it's mixed together with copyrighted material.

For this project we had to scan 40,000 volumes of caselaw. We used a high speed scanner at the Harvard Law Library, and went through about 40 million pages at a rate of 500,000 pages a week over a couple of years. The pages then had to be redacted of copyrighted material like headnotes inserted by private publishers, since courts typically don't publish the cases themselves, and those redactions had to be checked by humans.

That work was funded by a startup, Ravel, which is why we ended up with temporary limits on commercial use of the data. No later than March 2024, however, it will all be fully available for bulk download by anyone in the world. If necessary we'll set up a torrent. :)

(Hopefully earlier! For any state that starts officially publishing its caselaw in digital form, we can immediately release their caselaw back to the beginning, as we have already for Illinois and Arkansas.)


One note that seems relevant to the grandparent comment, Ravel is now owned by LexisNexis.


> Is caselaw data an artifact originating from the public dollars which funded the whole show to begin with?

Yes and no. It’s important to realize that the US courts (1) are a distributed system comprising hundreds of autonomous courts; and (2) predate the internet, photocopiers, telephones, telegraph, a large centralized federal government, and indeed the federal government itself.

Today court opinions are published as PDFs on courts’ websites. But back in the day, they were published as slip opinions stored in the clerk’ office of each individual court. Private companies like West undertook to collect cases from all these hundreds of courts and publish them in books called “reporters.” Back then (and even today) that meant sending someone out to hundreds of courts to collect and copy the decisions. They not only published the opinions, they organized everything within a comprehensive ontology of their own creation, and added their own annotations.

When computers were invented a century later, these publishers were well placed to digitize their collections and offer access to them over pre-internet electronic systems. Then, of course, those moved to the Internet.

Even collecting these cases together on a going forward basis is no easy task. As noted above, the courts are decentralized, by design, even within the federal system. Just getting the decisions from hundreds of courts and uploading them would be an expensive endeavor. Nothing stops someone from undertaking this—court decisions themselves cannot be copyrighted and you’re free to go to and court and ask to copy published decisions.


I worked for one of the first few web based services. West Law was the defacto then (2000/2001). They even sued us and claimed that page numbers used in citations were proprietary information.

To answer your question re: its origins being publicly funded, no. When there was no internet database to connect to people bought the books from a print publisher (either at huge maintenance cost or sparingly at the expense of not knowing what was current). The print publishers bore the cost and reaped the profits of consolidating all of this data.

To answer the next obvious question: yes, there are probably people in prison or not depending on whether a small town law library bought the updates from West Law in a timely fashion 20 years ago.


It is available. Go to any law school library, look on the shelves.

But that doesn't mean that private entities who via their own labor, digitize it, are required to give your their resulting digital datasets.

All the more reason to be thankful that these folk are doing exactly that, and giving us tools to get at the data.


"Case Law" has a specific meaning for those of us who are not versed in the jargon:

https://en.wikipedia.org/wiki/Case_law


The Norwegian equivalent of this project, got sued for copyright infringement of supreme court verdicts: http://www.wiumlie.no/2018/rettspraksis/10-22-returns



"The agreement with our project partner, Ravel, requires us to limit access to the full text of cases to no more than 500 cases per person, per day. This limitation does not apply to researchers who agree to certain restrictions on use and redistribution. Nor does this restriction apply to cases issued in jurisdictions that make their newly issued cases freely available online in an authoritative, citable, machine-readable format. We call these whitelisted jurisdictions. Currently, Illinois and Arkansas are the only whitelisted jurisdictions."

TLDR: 500 cases per day, but looks like you can buy access from Ravel [1].

https://home.ravellaw.com/


Yep— Ravel will absolutely negotiate commercial licenses. No later than 2024, the entire corpus will be available for free for all use cases, including commercial.


Needs a simple JavaScript or similar client (REPL?) baked into the page.

API appears powerful, but lots of free alternatives that don’t require a comfort level with shell scripting or REST calls.


Project dev here. We’re hoping to have something simple available within a week or two.

If you’re just looking for a nice case browsing interface, in the interim, you should check out Ravel’s site.


I'm not aware of a free alternative for all US published caselaw. Could you cite?


google.com/scholar

I also really like casemine.com.

I have no idea how incomplete those archives are but I doubt they’re missing much I’d be looking for.


Well, google scholar certainly does not have all published opinions, far from it. It is not an adequate research tool, just a starting point. I'm not familiar with casemine, will check it out.

EDIT: casemine appears to be neither free, nor to have all published US decisions as far as I can tell.


Consider github.

German and French law is already there.

https://github.com/bundestag/gesetze

https://github.com/steeve/france.code-civil


If the creators are listening in - do you have any plans to include the text of the court docs as well?


The full text of the cases is available. For example, here is a query for the full text of Illinois cases, which are available without a login:

https://api.case.law/v1/cases/?jurisdiction=ill&full_case=tr...

Full text for cases other than Illinois and Arkansas is limited to 500 cases per person per day, which is why we don't include it in API results by default.


Gotcha! I've been looking for something like this for a while now. Thanks for making it!


Do you have or know of, anything like this for federal courts (bankruptcy, district, appellate, supreme) ?


project member here ... this site includes access to federal cases, e.g. https://api.case.law/v1/cases/?cite=&name_abbreviation=&juri...


I always wondered if any of the cases directly contradict each other.


Can you elaborate on the notion of US case law, before the US existed? I don't mean this to be snarky, I assume this was intentional. Is this referring to cases prior to the US's founding, that the US then adopted as relevant legal precedent?


Good question! This refers to caselaw from courts that predate the United States, such as Maryland and Massachusetts:

https://api.case.law/v1/cases/

State courts didn't come into existence with the US Constitution -- the Massachusetts Supreme Judicial Court, for example, dates back to 1692, and those precedents are still "good" in some sense though unlikely to be cited.

We don't have English precedents, unfortunately, as guessed by some sibling comments.


The US inherited the common law of England. There's usually something more relevant now, but you'll still sometimes see citations of old English cases or the Magna Carta.


The US inherited the common law of England.


Mnn 360 years of United States coleslaw.


"Coleslaw" is an anglicization of the Dutch "koolsla" or cabbage salad. Its existence in the US would likely date back to the first Dutch colony in North America: New Netherland, which was first settled in the 1620s. As such, American coleslaw has a history of roughly 390 years and antedates American caselaw by a few decades.


You got me curious -- our first case with the modern term "coleslaw" is from 1935, in Rhode Island, and begins ominously with "The plaintiff in this case was injured by swallowing two small pieces of wire concealed in an order of beef stew, bread and coleslaw which she had purchased and was eating at defendant’s lunch counter."

https://api.case.law/v1/cases/?search=coleslaw&full_case=tru...

Apologies that the link requires a login to view the full text, and for various other shortcomings in the current browsing experience. Also apologies to anyone who actually reads the coleslaw case -- caselaw is a scary place.


Coleslaw caselaw could go on Bob Loblaw's law blog.


This is the comment I came to this thread hoping to see.


As irrelevant as this comment is, this is EXACTLY what my mind thought about the subject line too. If you hadn't posted it, I might have. Power on!


This sort of comment isn't really in the spirit of this forum.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: