Working With Transkribus: Machine Learning Applied in Transcription

By Brian Guo ’25

After initially experimenting with a few tools like FromThePage and TPEN, I found out that online transcription is still largely manual. With the recent public attention with chatgpt, I went to search for a tool that would incorporate machine learning with transcription. Back in Remix’s 2015 wishlist under “Transcribe”, there was a question of “Can a manuscript page become machine-readable?”. It looks like the question has been answered by Transkribus. Transkribus basically automatically generates recognitions of either written or printed texts based on a certain model. The users can either use a public model (as of August 12, 2023, there are 122 models available) or a private model that they train themselves.

A screenshot from Transkribus

Automatic recognition option on The Brut

Screenshot from Transkribus

Public models on Transkribus

For a text like The Brut, it is more beneficial for the user to train a private model, as Transkribus does not currently have a public model that is suitable for middle English. I tried to use a public model for English on a page of the Brut. This is the result:

Screenshot from Transkribus

Obviously, this is not very trustworthy, mainly because the model is ill-suited to the text. People use Transkribus to transcribe an abundance of modern text, so it’s not exactly directed at medieval manuscripts, or Middle English. For this tool to work best on The Brut one would need to create a model very specific to it.

Screenshot from Transkribus

The text box can be manually edited by the user, just like a website for manual transcriptions. On the upper right hand corner there is an option for the users to categorize the transcribed page as “ground truth”. If a user wants to train a model for a specific kind of writing, “ground truth” are pages that they can feed to the tool as raw material from which they sample. In the “training” page, we can see:

Screenshot from Transkribus

Transkribus are good for larger projects because they require more than 20 pages of “ground truth” to give the users satisfactory results. This can potentially be used for the Brut as long as there are 20-70 pages of transcribed texts available for the machine to learn from. The text is also over 200 pages, which means that the text can potentially benefit from the tool

 

A free account has about 500 pages of text-recognition credits for handwritten texts. The account can actually contain the entire Brut Chronicle. For a text like this, however, it requires some expertise before it can be applied neatly. The experience transcribing a piece of modern handwritten text this term (see part 2.0) proves to me that transcription is a very time consuming process – four pages of modern English handwriting took me. Transcribing 20 pages of medieval writing that is good enough to be fed to the machine as ground truth given the time that I have is going to be unrealistic. Transkribus is good for dedicated personnels in professional and specialized projects, and would take a long time and perhaps collaborations between many scholars in order to be effective.

This can also be seen in its price:

For people who want to use Transkribus for more than one manuscript, credits for 500 pages of written text is obviously not enough. Credit purchase works in two ways: on-demand or annual subscription. In subscriptions options, 300 pages annually costs 19.9€ /Year, but 500 pages is triple the cost, at 59€ /Year. Cost per page increases drastically, and dedicated scholars, I imagine, would have a demand for a higher number of pages.

Screenshot from Transkribus

Transkribus subscription options

The most expensive subscription that Transkribus offers is 30000 pages a year, which is 5184€ /Year. Transkribus is generally very expensive. It is questionable whether this tool would benefit the average users that much. I imagine that a study group or a library can possibly share a subscription and collaboratively use this tool so that it can be effective.

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *