Hi! I'm a developer on this project. Happy to answer questions for the next hour or so, and I'll check back tomorrow as well.
This is a first public release and the Harvard Library Innovation Lab is a small team, so please look at the site as just the beginning! In particular we're starting by targeting developers with an API and bulk data downloads. More user friendly features like a front-end browser, PDF scans, ngram browser, etc. are on the roadmap, but we're hoping other developers will step in and build some great tools as well.
I'm assuming that as a university library team you're familiar with the Blacklight ecosystem of library search tools?
I and a single Rails dev modified the current Blacklight codebase to handle PDFs (and other office docs) with minimal effort a few months ago, I think doing so when you get to PDF handling and a web UI would be a completely valid starting point.
I'll save you the trouble of figuring out which is the only existing feasible open source library to sift through hundreds of thousands of PDF pages... QPDF is it.
Thanks for this suggestion! We'll definitely kick Blacklight around as an option. We have a complex data model (cases, opinions, parties, citations, courts, jurisdictions, volumes ...), and tens of millions of pages, and weird access restrictions, and multiple output formats (text/html/xml/pdf), so whatever we end up with will have to be pretty custom.
I worked for the British Medical Journal a while ago which had 120 years of articles digitised, along with several different medical databases, with around a dozen different applications utilising this content in various ways. It was basically all stored internally as XML using the Documentum/xDB CMS/XML database stack, with lots of XSLT used to generate content for different apps and in different formats.
Not at all the easiest system to work with, but it does allow you to handle a lot of structured data such that you need the minimum amount of customisation possible - which is still a lot!
If the project is truly "free" - build it in a way that can be passed down through generations: easily indexed with semantic data and without proprietary logins or fiddle-some navigation. archive.org would be a good long term home or some other long lasting and perpetually funded service, beyond what Harvard would spend its billions of alumni bucks on.
Definitely! I'll sleep better when we have all the bulk data shipped to everyone who wants it. That will happen by March 2024, or earlier for any states that switch to official digital publishing.
(I mean, Harvard isn't a bad home for this -- I work in a building with books that predate the printing press, and I work on stuff like Creative Commons-licensed forkable textbooks. Libraries are cool places. But Harvard definitely shouldn't be the only place that preserves this data set.)
As far as preservation-friendly formats, our bulk data download format is xzipped jsonlines, which is tuned for NLP (highly compressed, parseable in a few lines of python with low memory requirements) rather than preservation:
Internally we have a preservation format where each volume is stored as a bagit bag containing METS XML for the OCR and case-level data, plus color and black and white images of each case. These are much harder to work with, so it's not a focus to share them right now, but we can definitely share if someone makes a case for it.
No kidding. After looking at this project for a bit longer, I'm much less optimistic about this project. Frankly, it really depresses me that they require research agreements to download all information.
Developers of this project: Please make the information freely available in a way that doesn't require agreements with a giant for-profit company.
For now, I'm convinced that this project is nothing more than a veiled advertisement for lexis nexis.
Unless you can provide them with the information, it's not the developers who are making the rules. There's a reason Lexus is as profitable as they are.
My reading of this is that it's something they're contractually bound to for the near term but not forever:
> Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024.
Hopefully this means no logins, also, but that's less clear.
The blame on this should really fall on the courts that allow private companies to paywall access to the rule of law. At least some states have started to do it right:
> Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.
> Hopefully this means no logins, also, but that's less clear.
The only things we have behind logins are what we are contractually required to, yes. This gets pretty fine-grained -- if you do a logged-out search across jurisdictions, requesting full text, the json contains error fields for the specific fields we aren't allowed to share without a login yet.
Hi! This is so exciting! Quick question: how does this compare with the Courtlistener/free law project data? (And are there any plans to work with them?)
Great question! We love Mike Lissner and have talked with him about connections between our projects. We'll definitely work with him if we get a chance.
Benefits of our data set: we're more complete, being a census of all known volumes of official caselaw back to the beginning; we're easier to work with for data processing, since all of our data, across centuries and states, is in one consistently structured format; we have the page images, meaning if there's any question of accuracy, we can check the final authority.
Benefits of FLP (and there may be others I don't know): they're updated in realtime from scanning court websites, so they'll stay up to date in a way we won't; their scraped text for modern cases doesn't have OCR errors; their site is much more featureful.
At this point I see our strength as being a complete/consistently-formatted/authoritative data set of printed cases, which leaves lots of room for other caselaw databases with complementary goals.
Thanks! I have a half-completed python library to access their api [1] --- maybe this will be good motivation to finish it and add yours, see if there's a useful way to query both at the same time in some circumstances.
Do you have any documentation that outlines the meaning of all the metadata fields? The one I'm curious about is the "whitelisted" value on the jurisdiction, such as https://api.case.law/v1/jurisdictions/us/.
Ha...you posted this 11 minutes ago as of right now. 12 minutes ago I found a similar link citing Bluebook abbreviations and posted a "never mind, I figured it out note" (and mentioned I felt silly for not noticing it right away--I went to law school in Washington and so spent three years seeing "Wash." on the spines of case reports).
It took me 12 minutes to post because I tossed in some explanations for people not familiar with legal citations. As soon as I posted, I saw yours, and deleted mine as redundant.
There is one thing, though, that I discovered working on that post and am curious about now. I picked a case at random to use as an example, Peterson v. City of Seattle, 316 P.2d 904, 51 Wash. 2d 187 (1957).
Here's a link to the Washington Supreme Court's opinion:
Within the opinion they cite the case as 51 Wn.2d 187 (1957) at the top. Inside the opinion they cite some Washington cases as Wn and some as Wash. I could see no obvious patter as to which they pick.
Anyone happen to know offhand what determines which form they use?
This is a first public release and the Harvard Library Innovation Lab is a small team, so please look at the site as just the beginning! In particular we're starting by targeting developers with an API and bulk data downloads. More user friendly features like a front-end browser, PDF scans, ngram browser, etc. are on the roadmap, but we're hoping other developers will step in and build some great tools as well.