Hi! I'm a developer on this project. Happy to answer questions for the next hour...

RNCTX · on Oct 30, 2018

I'm assuming that as a university library team you're familiar with the Blacklight ecosystem of library search tools?

I and a single Rails dev modified the current Blacklight codebase to handle PDFs (and other office docs) with minimal effort a few months ago, I think doing so when you get to PDF handling and a web UI would be a completely valid starting point.

I'll save you the trouble of figuring out which is the only existing feasible open source library to sift through hundreds of thousands of PDF pages... QPDF is it.

JackC · on Oct 30, 2018

Thanks for this suggestion! We'll definitely kick Blacklight around as an option. We have a complex data model (cases, opinions, parties, citations, courts, jurisdictions, volumes ...), and tens of millions of pages, and weird access restrictions, and multiple output formats (text/html/xml/pdf), so whatever we end up with will have to be pretty custom.

spiralx · on Oct 31, 2018

I worked for the British Medical Journal a while ago which had 120 years of articles digitised, along with several different medical databases, with around a dozen different applications utilising this content in various ways. It was basically all stored internally as XML using the Documentum/xDB CMS/XML database stack, with lots of XSLT used to generate content for different apps and in different formats.

Not at all the easiest system to work with, but it does allow you to handle a lot of structured data such that you need the minimum amount of customisation possible - which is still a lot!

ugh123 · on Oct 30, 2018

If the project is truly "free" - build it in a way that can be passed down through generations: easily indexed with semantic data and without proprietary logins or fiddle-some navigation. archive.org would be a good long term home or some other long lasting and perpetually funded service, beyond what Harvard would spend its billions of alumni bucks on.

JackC · on Oct 30, 2018

Definitely! I'll sleep better when we have all the bulk data shipped to everyone who wants it. That will happen by March 2024, or earlier for any states that switch to official digital publishing.

(I mean, Harvard isn't a bad home for this -- I work in a building with books that predate the printing press, and I work on stuff like Creative Commons-licensed forkable textbooks. Libraries are cool places. But Harvard definitely shouldn't be the only place that preserves this data set.)

As far as preservation-friendly formats, our bulk data download format is xzipped jsonlines, which is tuned for NLP (highly compressed, parseable in a few lines of python with low memory requirements) rather than preservation:

https://case.law/bulk/download/

Internally we have a preservation format where each volume is stored as a bagit bag containing METS XML for the OCR and case-level data, plus color and black and white images of each case. These are much harder to work with, so it's not a focus to share them right now, but we can definitely share if someone makes a case for it.

bpchaps · on Oct 30, 2018

No kidding. After looking at this project for a bit longer, I'm much less optimistic about this project. Frankly, it really depresses me that they require research agreements to download all information.

Developers of this project: Please make the information freely available in a way that doesn't require agreements with a giant for-profit company.

For now, I'm convinced that this project is nothing more than a veiled advertisement for lexis nexis.

wheaties · on Oct 30, 2018

Unless you can provide them with the information, it's not the developers who are making the rules. There's a reason Lexus is as profitable as they are.

paulgb · on Oct 30, 2018

My reading of this is that it's something they're contractually bound to for the near term but not forever:

> Access limitations on full text and bulk data are a component of Harvard’s collaboration agreement with Ravel Law, Inc. (now part of Lexis-Nexis). These limitations will end, at the latest, in March of 2024.

Hopefully this means no logins, also, but that's less clear.

The blame on this should really fall on the courts that allow private companies to paywall access to the rule of law. At least some states have started to do it right:

> Once a jurisdiction transitions from print-first publishing to digital-first publishing, these limitations cease. Thus far, Illinois and Arkansas have made this important and positive shift and, as a result, all historical cases from these jurisdictions are freely available to the public without restriction. We hope many other jurisdictions will follow their example soon.

JackC · on Oct 30, 2018

> Hopefully this means no logins, also, but that's less clear.

The only things we have behind logins are what we are contractually required to, yes. This gets pretty fine-grained -- if you do a logged-out search across jurisdictions, requesting full text, the json contains error fields for the specific fields we aren't allowed to share without a login yet.

paulgb · on Oct 30, 2018

Awesome, thanks for the clarification!

paultopia · on Oct 30, 2018

Hi! This is so exciting! Quick question: how does this compare with the Courtlistener/free law project data? (And are there any plans to work with them?)

JackC · on Oct 30, 2018

Great question! We love Mike Lissner and have talked with him about connections between our projects. We'll definitely work with him if we get a chance.

Benefits of our data set: we're more complete, being a census of all known volumes of official caselaw back to the beginning; we're easier to work with for data processing, since all of our data, across centuries and states, is in one consistently structured format; we have the page images, meaning if there's any question of accuracy, we can check the final authority.

Benefits of FLP (and there may be others I don't know): they're updated in realtime from scanning court websites, so they'll stay up to date in a way we won't; their scraped text for modern cases doesn't have OCR errors; their site is much more featureful.

At this point I see our strength as being a complete/consistently-formatted/authoritative data set of printed cases, which leaves lots of room for other caselaw databases with complementary goals.

paultopia · on Oct 30, 2018

Thanks! I have a half-completed python library to access their api [1] --- maybe this will be good motivation to finish it and add yours, see if there's a useful way to query both at the same time in some circumstances.

[1] https://github.com/paultopia/lawpy

ardy42 · on Oct 30, 2018

Do you have any documentation that outlines the meaning of all the metadata fields? The one I'm curious about is the "whitelisted" value on the jurisdiction, such as https://api.case.law/v1/jurisdictions/us/.

JackC · on Oct 30, 2018

We do have some docs, though suggestions are welcome! Here's the definition of whitelisted:

https://case.law/api/#def-whitelisted

tzs · on Oct 30, 2018

Clicking Washington on the map fails, with a 400 error.

It looks like the link from the map is using "wa" for Washington, when it should be "wash".

I'm curious now. Some states do use a 2 letter code, some use 3, and some use 4. Why didn't you use the same naming format for all of them?

JackC · on Oct 30, 2018

Thanks, I passed on your bug report!

For the jurisdiction slugs we use the standard legal citation abbreviations for each state:

https://law.resource.org/pub/us/code/blue/IndigoBook.html#T1...

This has the advantage of matching the citations to cases, like "123 Wash. 456".

tzs · on Oct 30, 2018

Ha...you posted this 11 minutes ago as of right now. 12 minutes ago I found a similar link citing Bluebook abbreviations and posted a "never mind, I figured it out note" (and mentioned I felt silly for not noticing it right away--I went to law school in Washington and so spent three years seeing "Wash." on the spines of case reports).

It took me 12 minutes to post because I tossed in some explanations for people not familiar with legal citations. As soon as I posted, I saw yours, and deleted mine as redundant.

There is one thing, though, that I discovered working on that post and am curious about now. I picked a case at random to use as an example, Peterson v. City of Seattle, 316 P.2d 904, 51 Wash. 2d 187 (1957).

Here's a link to the Washington Supreme Court's opinion:

https://law.justia.com/cases/washington/supreme-court/1957/3...

Within the opinion they cite the case as 51 Wn.2d 187 (1957) at the top. Inside the opinion they cite some Washington cases as Wn and some as Wash. I could see no obvious patter as to which they pick.

Anyone happen to know offhand what determines which form they use?

patcon · on Oct 30, 2018

omg the limericks. Pls tell me how many there are. I need to know what I'm getting myself into...

...

> It is expressly laid down in Bull

> I refer to the receipt in full.

> The company first.

> Decision reversed.

> Did they state they would replace your wool?

akuji1993 · on Oct 30, 2018

The limericks are absolute gold.

> Brown with his papers in his trunk.

> Kilburn laid down on the top bunk.

> The court reconvened.

> The State intervened.

> She thought perhaps Kristin was drunk.

meangrape · on Oct 30, 2018

They're randomly generated. You'll be in for the long haul.

flexie · on Oct 30, 2018

Very interesting! What scanner, resolution and tools for OCR did you use?

Immortalin · on Oct 30, 2018

How long will it take to replicate Lex Machina? :D