New Audacious Research Project, In Codice Ratio, Bets on AI and OCR To Make Sense of Handwritten Texts in Vatican's Secret Archives (theatlantic.com)

← Back to Stories (view on slashdot.org)

New Audacious Research Project, In Codice Ratio, Bets on AI and OCR To Make Sense of Handwritten Texts in Vatican's Secret Archives (theatlantic.com)

Posted by msmash on Monday April 30, 2018 @12:30PM from the how-about-that dept.

A new project untangles the handwritten texts in one of the world's largest historical collections. From a report: The Vatican Secret Archives is one of the grandest historical collections in the world. It's also one of the most useless. The grandeur is obvious. Located within the Vatican's walls, next door to the Apostolic Library and just north of the Sistine Chapel, the VSA houses 53 linear miles of shelving dating back more than 12 centuries. That said, the VSA isn't much use to modern scholars, because it's so inaccessible. Of those 53 miles, just a few millimeters' worth of pages have been scanned and made available online. Even fewer pages have been transcribed into computer text and made searchable. If you want to peruse anything else, you have to apply for special access, schlep all the way to Rome, and go through every page by hand.

But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world.

4 of 111 comments (clear)

Min score:

Reason:

Sort:

Text measured in miles? by Ecuador · 2018-04-30 12:56 · Score: 4, Insightful

OK, so now the text is measured in miles? What lunacy is this?
I mean, it is the ONE article where Libraries of Congress would actually be a valid unit!

--
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
Re:Artificial Intelligence by ShanghaiBill · 2018-04-30 13:18 · Score: 5, Insightful

Think about it: it's only going to be as good as your training data.
Obvious counter-example: Alpha-Go Zero, which used NO training data, and was far better at its assigned task than any of its programmers.
There is no reason to believe that an AI's capabilities are inherently limited by training data. Nor is there any reason to believe that it can't surpass the abilities of its creators. That makes as little sense as saying children can never be more intelligent than their parents.
Easier said than done by azcoyote · 2018-04-30 15:05 · Score: 4, Interesting

This sounds like a great idea, but it's likely to be extraordinarily complicated. Not only does handwriting differ from age to age, culture to culture, and place to place (just try reading 20th century German Sütterlin), but many medieval manuscripts utilize complex systems of abbreviations called sigla. Interpreting these can be very complicated because they are heavily context-dependent. One symbol can mean several different things. For example, a cross through a p can mean per, prae, or pro. A line over some letters can signify anything being cut out in-between. Just try figuring out what this inscription says: here.
Reading such abbreviations was probably expected to be relatively simple for the human brain to decipher both because the human actually interprets the text while deciphering symbols and because the original audience would have a better sense of how a particular community tended to use abbreviations.
The task is not impossible for a computer, though. In most cases there are a limited number of words that could be signified by abbreviations, and it is possible to determine which word is most likely intended according to immediate context. However, that would require the machine to have a grasp of the Latin grammar, and even then not everything is going to follow perfect rules. There is so much potential interpretation involved. The AI component here does help with this inasmuch as it uses statistical data to optimize recognition, but it's still likely to run into many difficulties.
The main innovation in TFA, as I see it, is that it responds to one of the major problems of reading old Carolingian minuscule. The letters are bunched together and there are times when you cannot be sure whether you are looking at two i's or a u, for example. The two can look exactly the same, not even just similar. The software in question attempts to handle this by recognizing individual penstrokes. Although I am not sure that this is 100% better than the older approach mentioned--recognizing whole words at a time--it does show significant promise because of its combination with AI. Perhaps some day it will be able to note, for example, that a certain author always strokes the i in a certain way. However, I'm sure there's going to be plenty of hurdles before getting to that point.

--
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
1. Re:Easier said than done by Orgasmatron · 2018-04-30 17:29 · Score: 4, Informative
  
  Cuneiform texts have similar problems, and translation is a tedious process. I'm hopeful that new systems can help automate the process, but I'm not holding my breath waiting for it.
  For a hint at the problems, cuneiform was used for thousands of years, across several languages.
  In the early days, it was very terse, writing just the key words that would allow a literate native speaker of the language to reconstruct the real sentence. You would have a sentence written as "(picture of a man) (picture of a house) (picture of a noun that sounds like the verb to-build)". The reader would be expected to know that the intended sentence was something like "Lugale-e-mundu" or "The King built the house" and infer from the context (for example stamped on the still-wet bricks) that it meant "The king ordered the construction of these houses" or whatever.
  Over time, the symbols were pared down from little drawings to simplified figures, to abstract representations, to a couple of strokes that carry very little similarity to the original drawings.
  At the same time, the scribes got really inventive with the symbols. A written symbol could mean the noun that it once resembled, or it could mean a verb that sounds similar to the noun, or it could be a syllable, or it could be a marker to indicate that the next or previous stuff was a proper name, or the name of a deity.
  Additionally, symbols multiplied. They ended up with dozens of symbols for the "e" sound, for example, with different meanings. So you could have two sentences with different meanings that sounded exactly the same, but they could be written with exact symbols, or with generic symbols.
  To make things even more fun, Sumerian died out as a spoken language long before it faded as a written language. So, the scribes lost confidence in their writing and started gradually writing everything out longhand. This actually turned out to be fantastic for us, because it let us see the structure of the spoken language in ways that were completely hidden in older writings.
  And other cultures with completely different unrelated languages started using the writing system. So you might find a tablet that you can't translate because it is Akkadian written phonetically, for example. Even worse, it could be written as if it were Sumerian, so the structure would make sense, but the names wouldn't.
  That is actually how the Sumerian language was re-discovered ~1200 years after it died completely. There was a language still living enough for scholars to know what it sounded like, and ancient clay tablets written by people who had spoken that language centuries before.
  The scholars noticed two things. First, there was a huge pile of those tablets that were completely incomprehensible. And second, that the ones they could read showed a writing system that was a hilariously bad fit for the written language. Like the subject, object and verb order was SVO when spoken, but SOV when written, and the written language was full of markers that were not present in the spoken language, and the markers in the spoken language were completely absent in the writing system.
  Eventually, they figured out that they were looking at several different languages, and they were able to reconstruct Sumerian from that mess.
  Anyhow, to process one of these tablets, you need to examine the strokes in the clay and match them to symbols. Then you take a wild guess at which language you think it might be and see if you can find a meaningful translation in that language. If not, you go back to pick a different language and try again. And again, and again.
  Because of the tedium of doing this by hand, and the very short supply of people who know these languages and can do the work, our museums quite literally have tons of these tablets that have never been translated.
  Other ancient writings face a similar problem. We have more of them in storage than we know what to do with. It was big news a few weeks ago that a 1500-year old C
  
  --
  See that "Preview" button?