Hi! I'm a developer on this project. Happy to answer questions for the next hour or so, and I'll check back tomorrow as well.
This is a first public release and the Harvard Library Innovation Lab is a small team, so please look at the site as just the beginning! In particular we're starting by targeting developers with an API and bulk data downloads. More user friendly features like a front-end browser, PDF scans, ngram browser, etc. are on the roadmap, but we're hoping other developers will step in and build some great tools as well.
I'm assuming that as a university library team you're familiar with the Blacklight ecosystem of library search tools?
I and a single Rails dev modified the current Blacklight codebase to handle PDFs (and other office docs) with minimal effort a few months ago, I think doing so when you get to PDF handling and a web UI would be a completely valid starting point.
I'll save you the trouble of figuring out which is the only existing feasible open source library to sift through hundreds of thousands of PDF pages... QPDF is it.
Thanks for this suggestion! We'll definitely kick Blacklight around as an option. We have a complex data model (cases, opinions, parties, citations, courts, jurisdictions, volumes ...), and tens of millions of pages, and weird access restrictions, and multiple output formats (text/html/xml/pdf), so whatever we end up with will have to be pretty custom.
I worked for the British Medical Journal a while ago which had 120 years of articles digitised, along with several different medical databases, with around a dozen different applications utilising this content in various ways. It was basically all stored internally as XML using the Documentum/xDB CMS/XML database stack, with lots of XSLT used to generate content for different apps and in different formats.
Not at all the easiest system to work with, but it does allow you to handle a lot of structured data such that you need the minimum amount of customisation possible - which is still a lot!
If the project is truly "free" - build it in a way that can be passed down through generations: easily indexed with semantic data and without proprietary logins or fiddle-some navigation. archive.org would be a good long term home or some other long lasting and perpetually funded service, beyond what Harvard would spend its billions of alumni bucks on.
Definitely! I'll sleep better when we have all the bulk data shipped to everyone who wants it. That will happen by March 2024, or earlier for any states that switch to official digital publishing.
(I mean, Harvard isn't a bad home for this -- I work in a building with books that predate the printing press, and I work on stuff like Creative Commons-licensed forkable textbooks. Libraries are cool places. But Harvard definitely shouldn't be the only place that preserves this data set.)
As far as preservation-friendly formats, our bulk data download format is xzipped jsonlines, which is tuned for NLP (highly compressed, parseable in a few lines of python with low memory requirements) rather than preservation:
Internally we have a preservation format where each volume is stored as a bagit bag containing METS XML for the OCR and case-level data, plus color and black and white images of each case. These are much harder to work with, so it's not a focus to share them right now, but we can definitely share if someone makes a case for it.
No kidding. After looking at this project for a bit longer, I'm much less optimistic about this project. Frankly, it really depresses me that they require research agreements to download all information.
Developers of this project: Please make the information freely available in a way that doesn't require agreements with a giant for-profit company.
For now, I'm convinced that this project is nothing more than a veiled advertisement for lexis nexis.
Unless you can provide them with the information, it's not the developers who are making the rules. There's a reason Lexus is as profitable as they are.
My reading of this is that it's something they're contractually bound to for the near term but not forever:
> Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024.
Hopefully this means no logins, also, but that's less clear.
The blame on this should really fall on the courts that allow private companies to paywall access to the rule of law. At least some states have started to do it right:
> Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.
> Hopefully this means no logins, also, but that's less clear.
The only things we have behind logins are what we are contractually required to, yes. This gets pretty fine-grained -- if you do a logged-out search across jurisdictions, requesting full text, the json contains error fields for the specific fields we aren't allowed to share without a login yet.
Hi! This is so exciting! Quick question: how does this compare with the Courtlistener/free law project data? (And are there any plans to work with them?)
Great question! We love Mike Lissner and have talked with him about connections between our projects. We'll definitely work with him if we get a chance.
Benefits of our data set: we're more complete, being a census of all known volumes of official caselaw back to the beginning; we're easier to work with for data processing, since all of our data, across centuries and states, is in one consistently structured format; we have the page images, meaning if there's any question of accuracy, we can check the final authority.
Benefits of FLP (and there may be others I don't know): they're updated in realtime from scanning court websites, so they'll stay up to date in a way we won't; their scraped text for modern cases doesn't have OCR errors; their site is much more featureful.
At this point I see our strength as being a complete/consistently-formatted/authoritative data set of printed cases, which leaves lots of room for other caselaw databases with complementary goals.
Thanks! I have a half-completed python library to access their api [1] --- maybe this will be good motivation to finish it and add yours, see if there's a useful way to query both at the same time in some circumstances.
Do you have any documentation that outlines the meaning of all the metadata fields? The one I'm curious about is the "whitelisted" value on the jurisdiction, such as https://api.case.law/v1/jurisdictions/us/.
Ha...you posted this 11 minutes ago as of right now. 12 minutes ago I found a similar link citing Bluebook abbreviations and posted a "never mind, I figured it out note" (and mentioned I felt silly for not noticing it right away--I went to law school in Washington and so spent three years seeing "Wash." on the spines of case reports).
It took me 12 minutes to post because I tossed in some explanations for people not familiar with legal citations. As soon as I posted, I saw yours, and deleted mine as redundant.
There is one thing, though, that I discovered working on that post and am curious about now. I picked a case at random to use as an example, Peterson v. City of Seattle, 316 P.2d 904, 51 Wash. 2d 187 (1957).
Here's a link to the Washington Supreme Court's opinion:
Within the opinion they cite the case as 51 Wn.2d 187 (1957) at the top. Inside the opinion they cite some Washington cases as Wn and some as Wash. I could see no obvious patter as to which they pick.
Anyone happen to know offhand what determines which form they use?
Is caselaw data an artifact originating from the public dollars which funded the whole show to begin with?
Why isn't this data available for free to any American Citizen who pays taxes? In the form of a torrent, the distribution cost is negligible. A reasonable duplication fee seemed reasonable back when replicating vast amounts of information involved massive amounts of paper and toner. But today.. I don't understand.
I'm curious enough to have asked this same question on Quora [0].
Caselaw data is public domain, but it's very expensive to digitize -- partly because it's mostly stored on paper, and partly because it's mixed together with copyrighted material.
For this project we had to scan 40,000 volumes of caselaw. We used a high speed scanner at the Harvard Law Library, and went through about 40 million pages at a rate of 500,000 pages a week over a couple of years. The pages then had to be redacted of copyrighted material like headnotes inserted by private publishers, since courts typically don't publish the cases themselves, and those redactions had to be checked by humans.
That work was funded by a startup, Ravel, which is why we ended up with temporary limits on commercial use of the data. No later than March 2024, however, it will all be fully available for bulk download by anyone in the world. If necessary we'll set up a torrent. :)
(Hopefully earlier! For any state that starts officially publishing its caselaw in digital form, we can immediately release their caselaw back to the beginning, as we have already for Illinois and Arkansas.)
> Is caselaw data an artifact originating from the public dollars which funded the whole show to begin with?
Yes and no. It’s important to realize that the US courts (1) are a distributed system comprising hundreds of autonomous courts; and (2) predate the internet, photocopiers, telephones, telegraph, a large centralized federal government, and indeed the federal government itself.
Today court opinions are published as PDFs on courts’ websites. But back in the day, they were published as slip opinions stored in the clerk’ office of each individual court. Private companies like West undertook to collect cases from all these hundreds of courts and publish them in books called “reporters.” Back then (and even today) that meant sending someone out to hundreds of courts to collect and copy the decisions. They not only published the opinions, they organized everything within a comprehensive ontology of their own creation, and added their own annotations.
When computers were invented a century later, these publishers were well placed to digitize their collections and offer access to them over pre-internet electronic systems. Then, of course, those moved to the Internet.
Even collecting these cases together on a going forward basis is no easy task. As noted above, the courts are decentralized, by design, even within the federal system. Just getting the decisions from hundreds of courts and uploading them would be an expensive endeavor. Nothing stops someone from undertaking this—court decisions themselves cannot be copyrighted and you’re free to go to and court and ask to copy published decisions.
I worked for one of the first few web based services. West Law was the defacto then (2000/2001). They even sued us and claimed that page numbers used in citations were proprietary information.
To answer your question re: its origins being publicly funded, no. When there was no internet database to connect to people bought the books from a print publisher (either at huge maintenance cost or sparingly at the expense of not knowing what was current). The print publishers bore the cost and reaped the profits of consolidating all of this data.
To answer the next obvious question: yes, there are probably people in prison or not depending on whether a small town law library bought the updates from West Law in a timely fashion 20 years ago.
"The agreement with our project partner, Ravel, requires us to limit access to the full text of cases to no more than 500 cases per person, per day. This limitation does not apply to researchers who agree to certain restrictions on use and redistribution. Nor does this restriction apply to cases issued in jurisdictions that make their newly issued cases freely available online in an authoritative, citable, machine-readable format. We call these whitelisted jurisdictions. Currently, Illinois and Arkansas are the only whitelisted jurisdictions."
TLDR: 500 cases per day, but looks like you can buy access from Ravel [1].
Yep— Ravel will absolutely negotiate commercial licenses. No later than 2024, the entire corpus will be available for free for all use cases, including commercial.
Well, google scholar certainly does not have all published opinions, far from it. It is not an adequate research tool, just a starting point. I'm not familiar with casemine, will check it out.
EDIT: casemine appears to be neither free, nor to have all published US decisions as far as I can tell.
Full text for cases other than Illinois and Arkansas is limited to 500 cases per person per day, which is why we don't include it in API results by default.
Can you elaborate on the notion of US case law, before the US existed? I don't mean this to be snarky, I assume this was intentional. Is this referring to cases prior to the US's founding, that the US then adopted as relevant legal precedent?
State courts didn't come into existence with the US Constitution -- the Massachusetts Supreme Judicial Court, for example, dates back to 1692, and those precedents are still "good" in some sense though unlikely to be cited.
We don't have English precedents, unfortunately, as guessed by some sibling comments.
The US inherited the common law of England. There's usually something more relevant now, but you'll still sometimes see citations of old English cases or the Magna Carta.
"Coleslaw" is an anglicization of the Dutch "koolsla" or cabbage salad. Its existence in the US would likely date back to the first Dutch colony in North America: New Netherland, which was first settled in the 1620s. As such, American coleslaw has a history of roughly 390 years and antedates American caselaw by a few decades.
You got me curious -- our first case with the modern term "coleslaw" is from 1935, in Rhode Island, and begins ominously with "The plaintiff in this case was injured by swallowing two small pieces of wire concealed in an order of beef stew, bread and coleslaw which she had purchased and was eating at defendant’s lunch counter."
Apologies that the link requires a login to view the full text, and for various other shortcomings in the current browsing experience. Also apologies to anyone who actually reads the coleslaw case -- caselaw is a scary place.
This is a first public release and the Harvard Library Innovation Lab is a small team, so please look at the site as just the beginning! In particular we're starting by targeting developers with an API and bulk data downloads. More user friendly features like a front-end browser, PDF scans, ngram browser, etc. are on the roadmap, but we're hoping other developers will step in and build some great tools as well.