Character (not) Recognition

The great thing about experiments is that you never know how they’re going to turn out. Remix the Manuscript was sparked by a data visualization question: how should we display transcriptions of marginal notes? At some point someone suggested that it would be good to have a transcription of the text itself. The PDF file of the edition seemed like a good short cut to jump start the process. Just one problem: Middle English characters thorn and yough don’t convert that smoothly into recognizable characters.

The process of turning pictures of letters into searchable letters (Optical Character Recognition, OCR) turned into the first project with a tangible output.

People who study born-analog documents have to grapple with any number of technical translations before they have any digital data to work with. The printed book might be digitized, but it has to be converted to OCR before it can searched. If the letters are fuzzy or imprecise for any reason, that conversion results in lots of unreadable characters. In the case of manuscripts, from any time period, the conversion of idiosyncratic handwriting into searchable characters is still largely out of technical reach. In short, in the humanities the data are not there to be found: they have to be made.

Several major research projects are tackling these issues, including eMOP: The Early Modern OCR Project and the Monk System.

We’re working on a detailed description of our ventures in Middle English OCR.  Stay tuned.

Meanwhile, here’s a snapshot of the automated “Full Text” conversion for The Chronicles of England (Brut) on archive.org:

OCR from Archive.org

 

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *