1.5 million pages of ancient texts to be made accessible online

jedberg · on April 15, 2012

Soon, historians will need to become experts at processing big data so that they can make new discoveries.

Protip for the college kids: If you're a computer science person who really like history, take courses that will teach you how to process large datasets. Then you can be on the leading edge of historical research!

nekojima · on April 15, 2012

I had this thought as a History student twenty years ago, though I was already interested in programming. I wrote up several proposals for my entrepreneurship class on using computing to automate many of the more difficult and time-consuming tasks when faced with huge amounts of data. At the time, all the information was only available in printed format or microfiche. We still used card catalogs (paper index cards in long thin boxes) to find find books and huge sets of books to find articles (for those too young to know the pre-interweb days).

The issues faced at the time where cost (there were perhaps two or three scanners on campus, that I knew about, and we were a leading computer science school) and revenue generation. I am quite glad these are now long over-due being addressed, but I do worry history students now aren't learning the solid research skills we developed and acquired twenty years ago, with basic research being so much easier now. Though the volume of research being able to be conducted in a much shorter timeframe is very favourable.

grepherder · on April 15, 2012

The tip is of course valid and pro, and I'd recommend the same, but it's already being done, under machine translation. Also, in this area big data loses its meaning, as you don't really need traditional databases, you just process raw text. There are literally thousands researching how to intelligently select and process this data.

gliese1337 · on April 15, 2012

It's not just machine translation; it's image processing / cleanup (to handle huge amounts of data for multispectral imaging and figure out how to combine it into sets of false-color images that people can read), optical character recognition (for ancient handwriting in weird writing systems), system-level programming to run the scanners, etc. There's a big ol' book on this, "Rome Wasn't Digitized in a Day": http://www.clir.org/pubs/reports/pub150/pub150.pdf BYU (which I attend and whom I work for) has done a huge amount of work in this field: http://maxwellinstitute.byu.edu/about/cpart.php

A few years ago I was writing web applications to support transcription of images of medieval documents in Old French- avoiding close-to-insurmountable OCR problems using grad students, but that still requires segmenting images properly. The LDS church does similar stuff on a very large scale to digitize genealogical records. It makes research a whole lot easier, but there's still plenty of room for improvement; image maps don't always reliably match up with the fields that you're trying to read/transcribe on images of documents, and that's kind of a pain.

TheAmazingIdiot · on April 16, 2012

What we need here are true eyeballs to read the scripts.

I do medieval and renaissance dance reconstruction and dance performance. Having just been to an event, I took a class on the Dances of the Gresley Manuscript.

Well, what is this manuscript? It isn't a dance treatise, or anything of the sort. Gresley was a law student from the 1530-1550's (we know from latter court cases by a lawyer Gresley). These dance instructions come from the margins of his law book.

He wrote in musical notation, dance notation and other descriptive words. He even left words that have no meaning in the dance community. We have to deduce what he meant by a multitude of methods, none of which we can guarantee.

But back to the topic of OCR... How does these document scanners and OCR's plan to deduce this kind of source written in the margins?

gliese1337 · on April 16, 2012

    But back to the topic of OCR... How does these document scanners and OCR's plan to deduce this kind of source written in the margins?

I have no idea; probably they don't, yet. Everything I've worked on uses students' eyeballs to do the actual character recognition, so I'm not deeply familiar with the state of the art. I do know that OCR is mainly used for documents that have a well-defined structure where you can make an image map identifying different semantic fields, and the contextual field information allows for much more intelligent OCR; it's not so good for big blocks of paragraph text.

When you get to figuring out stuff scrawled in the margins, there are image pre-processing techniques that can identify regions of handwriting and then normalize it by rotation and scaling, but I'm pretty sure a complete solution is still in the realm of stuff considered AI (because, of course, once you know how to do it reliably, it becomes machine learning or pattern matching or something like that and no one calls it AI anymore).

dbuxton · on April 15, 2012

Does anyone have a lead on a framework or service that can handle the storage, display and metadata management for assets of this type online? On a low (non-institutional) budget?

My family has a large number (thousands to tens of thousands) of photographs, sketchbooks and other historical documents that we're in the course of digitising - partly so that they are not lost to posterity but also so that we can share with other branches of the family.

At the moment we have thousands of scans in a massive Dropbox account but it's becoming unmanageable very quickly, and only allows minimal metadata storage. (The quality of the scans is something that we have concerns about but for the moment it's good enough).

Apologies for the slightly unrelated post but I've been mulling this for a while and may see if I can hack something together if nothing is out there already.

zrail · on April 15, 2012

I'm working on a personal photo album project (personal as in "for my use only") that kind of matches this. It took me a little under five hours to put together a simple rails site that lives on heroku, handles the uploads and thumbnailing and sticks the whole thing on S3, as well as handling display. Right now the only metadata is "name" and "description" but it's just a postgres database, you can add whatever you want.

Here's the basic app if you want to poke around. There's almost literally nothing to it:

https://github.com/peterkeen/phytos

richardlehane · on April 16, 2012

Omeka is a free and open source web application used by libraries, museums and archives for online display of collections: http://omeka.org/

iRobot · on April 16, 2012

My first Pascal program should be in there somewhere.