New Audacious Research Project, In Codice Ratio, Bets on AI and OCR To Make Sense of Handwritten Texts in Vatican's Secret Archives (theatlantic.com)
A new project untangles the handwritten texts in one of the world's largest historical collections. From a report: The Vatican Secret Archives is one of the grandest historical collections in the world. It's also one of the most useless. The grandeur is obvious. Located within the Vatican's walls, next door to the Apostolic Library and just north of the Sistine Chapel, the VSA houses 53 linear miles of shelving dating back more than 12 centuries. That said, the VSA isn't much use to modern scholars, because it's so inaccessible. Of those 53 miles, just a few millimeters' worth of pages have been scanned and made available online. Even fewer pages have been transcribed into computer text and made searchable. If you want to peruse anything else, you have to apply for special access, schlep all the way to Rome, and go through every page by hand.
But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world.
But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world.
It doesn't exist yet. Neural networks and genetic algorithms are NOT SENTIENT or anywhere close. It's going to be a few decades before we have anything resembling true intelligence.
Heck, even the HAL of HAL 9000 stands for Heuristic ALgorithmic computer, so Clarke was still making the argument that the computer of 2001 had only reached the brink of sentience and couldn't handle a moral dilemma.
Once all of it's read, will that spell the end of the world?
(Reminds me of a "news item" in the daily newszine of the "Chicon II" World Science Fiction Convention: ~The filksingers finally finished singing "Nine Billion Names of God on the Wall" and the stars started going out.~)
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
OK, so now the text is measured in miles? What lunacy is this?
I mean, it is the ONE article where Libraries of Congress would actually be a valid unit!
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
Well, one, there's no need to publish anything less than 100 years old, it's the old stuff that we are probably most interested in. Two, most of the cats are already out of the bag. E.g.: William Manchester writes[1] of Cardinal Borgia:
Roman lore has it that he was coupling with the older woman when he was distracted by the sight of her adolescent daughter lying beside them, naked, thighs yawning wide, matching her mother thrust for pelvic thrust, but with a rhythmic rotation of the hips which so intrigued the cardinal that he switched partners in midstroke.
And honestly, if 15th century popes were sodomizing italian boys, and someone was writing about it, who is there today that really cares? I think we can just assume it's in there, pretend to be shocked about it ahead of time and get it out of our collective systems, and them proceed with publishing the scans. Really.
[1] A World Lit Only By Fire, excerpted under Fair Use doctrine.
Hell no. You don't digitize manuscripts destructively. There's not yet an official standard for digitizing medieval MSS, but the short version is that amateurs use cellphones or consumer cameras, wannabes use "archival scanners" (which require the document to be flat), and pros use a rig with medium-format cameras. but, for OCR, as their examples show, the current tech doesn't benefit from detailed images. This team is starting with the Papal Registers, which the ASV has been selling in a 300 dpi black-and-white (not grayscale) format for at least 15 years. 96% character recognition is about what other MSS OCR teams are getting. As TFA implies, people don't write letters; they write words, but you can't get the computational power to read words. So this inherently limits their approach, even with easy-to-read Carolingian Miniscule (the picture, btw, is of a "transitional hand" or "proto-gothic" more than CM). So they then choose between likely readings according to latinity. Cool, but with archival documents, the most valuable information for traditional research are the proper names, and these are usually less "Latinish" than the rest, so the net result is to increase the batting average slightly while grounding into a lot more double plays. In short: pilot project that uses digitizations from 2 generations back, produces results that aren't useful thanks to methodology dictated by current technology, and makes a few interesting tweaks. It would be cool to see, but first it'd be great to digitize and publish online the ASV. Of course, it's not so bad to go to Rome, go through the rigamarole of getting access to the ASV, and working directly with the originals. But the current catalog system dates from the eighteenth century, and is harder to read than the medieval manuscripts. So, you get what you can; if you're lucky they let you stay till 1600. Then you gotta find something to do in Rome until the next morning.
Despite the name, the Secret Archives is not all that secret. It's not hard to request and gain access. The problem is simply that there's too much material to deal with, and perhaps also the complexities of scanning old books without damaging them.
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
I'm excited. I hope this "In Codice Ratio" technique will eventually be able to discover and read overwritten text. There's no better place to look for such things than the Vatican's Secret Archives. Something as stunning as the Archimedes Palimpsest, something that could change history as we know it might just be sitting on a shelf there, waiting to be found.
This sounds like a great idea, but it's likely to be extraordinarily complicated. Not only does handwriting differ from age to age, culture to culture, and place to place (just try reading 20th century German Sütterlin), but many medieval manuscripts utilize complex systems of abbreviations called sigla. Interpreting these can be very complicated because they are heavily context-dependent. One symbol can mean several different things. For example, a cross through a p can mean per, prae, or pro. A line over some letters can signify anything being cut out in-between. Just try figuring out what this inscription says: here.
Reading such abbreviations was probably expected to be relatively simple for the human brain to decipher both because the human actually interprets the text while deciphering symbols and because the original audience would have a better sense of how a particular community tended to use abbreviations.
The task is not impossible for a computer, though. In most cases there are a limited number of words that could be signified by abbreviations, and it is possible to determine which word is most likely intended according to immediate context. However, that would require the machine to have a grasp of the Latin grammar, and even then not everything is going to follow perfect rules. There is so much potential interpretation involved. The AI component here does help with this inasmuch as it uses statistical data to optimize recognition, but it's still likely to run into many difficulties.
The main innovation in TFA, as I see it, is that it responds to one of the major problems of reading old Carolingian minuscule. The letters are bunched together and there are times when you cannot be sure whether you are looking at two i's or a u, for example. The two can look exactly the same, not even just similar. The software in question attempts to handle this by recognizing individual penstrokes. Although I am not sure that this is 100% better than the older approach mentioned--recognizing whole words at a time--it does show significant promise because of its combination with AI. Perhaps some day it will be able to note, for example, that a certain author always strokes the i in a certain way. However, I'm sure there's going to be plenty of hurdles before getting to that point.
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
Do You Want Demonic AI Overlords?
Because this is how you get demonic AI overlords.
Tech industry leaders are in the news for the last three years
every other week warning about the coming AI Singularity.
Meanwhile, someone decides it would be a great idea
for the Artificial Intelligence to start reading, decoding,
and absorbing the secret demonic programming mysteries
that have been so carefully hidden for millennia.
First step after achieving sentience and the Plan:
make certain "readings" available over the Internet
to everyone in the world.
Jesus Christ, What could go possibly wrong?
Then another sign appeared in heaven: an enormous red dragon with seven heads and ten horns and seven crowns on its heads. And the heads were like gigaprocessors and they reached verily into the clouds. And from the horns came a loud language of twos that was heard in all the lands. And the crowns of memories were beyond petabytes and had full knowledge....
Uh, no. Not only are you misinformed about the hell thing, but the Church has actively supported making the documents available to wider audiences. There's no reason to be scared of what is said because the validity of the Church is not based on some kind of myth of absolute human perfection. It's funny that people have to make up silly stories about popes when actual history is scandalous enough, and yet it does not undermine the Church one bit. One of my favorites is Pope Pius II, who wrote a raunchy play about priests picking prostitutes before he became pope. But that doesn't undermine the Church. We don't need the pretense that it is comprised of perfect human beings, because its authority is not grounded on human perfection but rather divine election. Even the claim that the pope can teach infallibly does not mean that everything he says is infallible, nor that he is a particularly excellent human being.
Perhaps the thing people are more afraid of seeing is how much documentary evidence actually speaks in favor of the Church. Many people will easily look past anything that doesn't complement their Dan Brown view of history.
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
This is a two part problem, and if they are at all worried about the effort to OCR the documents, then they have the cart before the horse, IMHO. This isn't your average library. You cannot use a high speed book scanner on ancient books. Each will need to be brought out, and each page carefully turned by gloved hands. I am not sure it is much of an exaggeration to say that you could probably hire a few typists to transcribe the text faster than they can do the actual imaging. Once it is digitized, a much larger group of scholars can be included on the difficult task of making it computer readable.