New Audacious Research Project, In Codice Ratio, Bets on AI and OCR To Make Sense of Handwritten Texts in Vatican's Secret Archives (theatlantic.com)
A new project untangles the handwritten texts in one of the world's largest historical collections. From a report: The Vatican Secret Archives is one of the grandest historical collections in the world. It's also one of the most useless. The grandeur is obvious. Located within the Vatican's walls, next door to the Apostolic Library and just north of the Sistine Chapel, the VSA houses 53 linear miles of shelving dating back more than 12 centuries. That said, the VSA isn't much use to modern scholars, because it's so inaccessible. Of those 53 miles, just a few millimeters' worth of pages have been scanned and made available online. Even fewer pages have been transcribed into computer text and made searchable. If you want to peruse anything else, you have to apply for special access, schlep all the way to Rome, and go through every page by hand.
But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world.
But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world.
It doesn't exist yet. Neural networks and genetic algorithms are NOT SENTIENT or anywhere close. It's going to be a few decades before we have anything resembling true intelligence.
Heck, even the HAL of HAL 9000 stands for Heuristic ALgorithmic computer, so Clarke was still making the argument that the computer of 2001 had only reached the brink of sentience and couldn't handle a moral dilemma.
If restricted access to these documents is the problem then OCR can do little to help. At the same time OCR is not a requirement for granting public access, just scan and publish the images. Having an imperfect OCR is more of a hindrance than help.
Once all of it's read, will that spell the end of the world?
I'll be long gone before Micky Mouse is free.
So is this another project like Google's, to rip the covers and bindings off everything in sight, toss it all in a sheetfeeder to make sub-optimal scans and then pulp it all?
If everyone knows they exist
But honestly, isn't it long past time to open them up to scholarly research.
Scan them first. (How long do we estimate it will take?) Then start with the transcriptions, with or without OCR and deep learning.
And let's stop kidding ourselves: there is no AI. AI is a campy buzz word that the hipsters throw around because they think it makes them look kewl when they use it. But it really just makes them sound stupid.
OK, so now the text is measured in miles? What lunacy is this?
I mean, it is the ONE article where Libraries of Congress would actually be a valid unit!
Violence is the last refuge of the incompetent. Polar Scope Align for iOS
The secret service of vatican city, named OPUS DEI, it's never going to allow it.
By default some priests have to check everything for censorship, there's so much stuff in there they don't want people to know.
Step 2, involving OCR and AI etcetera, is a separate step. It could be done multiple times, refining the quality of the results as the technology develops, and augmented with human checking, intervention in difficult spots.
the Catholic Church were more or less rulers at one point. Less priests and more kings. There's bound to be no shortage of dirt in there. And Catholicism has been getting beat up lately as it is. That's why we got a Pope who openly questions the reality of Hell. A vast library full of texts nobody ever thought would be read by the common rabble wouldn't exactly improve their standing. In this case the Truth won't set them free.
Hi! I make Firefox Plug-ins. Check 'em out @ https://addons.mozilla.org/en-US/firefox/addon/youtube-mp3-podcaster/
This is ground breaking. No one has ever used NN to decipher handwritten text before. I know I didn't back in 2005. Truly amazing!
Depends on how endemic...paedophilia and general sex abuse was amongst the higher ranks of Catholicism and how much of these 'secret archives' document such activities, or the skills needed to commit the abuse while remaining publicly untarnished.
If someone could just go in and do a high-resolution scan of all the pages, wouldn't they have -- and then couldn't conceivably anyone try their OCR technology on it?
Artificial intelligence isn't sentience. Here's another surprise - I'm going to get my Masters from Georgia Tech, but I haven't yet decided between two different programs - Artificial Intelligence, or Machine Learning. The are two different degrees, covering different topics.
The English Oxford Living Dictionary gives this definition of artificial intelligence:
âoeThe theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.â
Despite the name, the Secret Archives is not all that secret. It's not hard to request and gain access. The problem is simply that there's too much material to deal with, and perhaps also the complexities of scanning old books without damaging them.
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
I'm excited. I hope this "In Codice Ratio" technique will eventually be able to discover and read overwritten text. There's no better place to look for such things than the Vatican's Secret Archives. Something as stunning as the Archimedes Palimpsest, something that could change history as we know it might just be sitting on a shelf there, waiting to be found.
This sounds like a great idea, but it's likely to be extraordinarily complicated. Not only does handwriting differ from age to age, culture to culture, and place to place (just try reading 20th century German Sütterlin), but many medieval manuscripts utilize complex systems of abbreviations called sigla. Interpreting these can be very complicated because they are heavily context-dependent. One symbol can mean several different things. For example, a cross through a p can mean per, prae, or pro. A line over some letters can signify anything being cut out in-between. Just try figuring out what this inscription says: here.
Reading such abbreviations was probably expected to be relatively simple for the human brain to decipher both because the human actually interprets the text while deciphering symbols and because the original audience would have a better sense of how a particular community tended to use abbreviations.
The task is not impossible for a computer, though. In most cases there are a limited number of words that could be signified by abbreviations, and it is possible to determine which word is most likely intended according to immediate context. However, that would require the machine to have a grasp of the Latin grammar, and even then not everything is going to follow perfect rules. There is so much potential interpretation involved. The AI component here does help with this inasmuch as it uses statistical data to optimize recognition, but it's still likely to run into many difficulties.
The main innovation in TFA, as I see it, is that it responds to one of the major problems of reading old Carolingian minuscule. The letters are bunched together and there are times when you cannot be sure whether you are looking at two i's or a u, for example. The two can look exactly the same, not even just similar. The software in question attempts to handle this by recognizing individual penstrokes. Although I am not sure that this is 100% better than the older approach mentioned--recognizing whole words at a time--it does show significant promise because of its combination with AI. Perhaps some day it will be able to note, for example, that a certain author always strokes the i in a certain way. However, I'm sure there's going to be plenty of hurdles before getting to that point.
Incipiamus, fratres, servire Domino Deo, quia hucusque vix vel parum in nullo profecimus.
Do You Want Demonic AI Overlords?
Because this is how you get demonic AI overlords.
Tech industry leaders are in the news for the last three years
every other week warning about the coming AI Singularity.
Meanwhile, someone decides it would be a great idea
for the Artificial Intelligence to start reading, decoding,
and absorbing the secret demonic programming mysteries
that have been so carefully hidden for millennia.
First step after achieving sentience and the Plan:
make certain "readings" available over the Internet
to everyone in the world.
Jesus Christ, What could go possibly wrong?
Then another sign appeared in heaven: an enormous red dragon with seven heads and ten horns and seven crowns on its heads. And the heads were like gigaprocessors and they reached verily into the clouds. And from the horns came a loud language of twos that was heard in all the lands. And the crowns of memories were beyond petabytes and had full knowledge....
The Venice Time Machine project has been doing this for quite some time. They even scan the books within opening them...
https://vtm.epfl.ch/
This is a two part problem, and if they are at all worried about the effort to OCR the documents, then they have the cart before the horse, IMHO. This isn't your average library. You cannot use a high speed book scanner on ancient books. Each will need to be brought out, and each page carefully turned by gloved hands. I am not sure it is much of an exaggeration to say that you could probably hire a few typists to transcribe the text faster than they can do the actual imaging. Once it is digitized, a much larger group of scholars can be included on the difficult task of making it computer readable.
The headline writing here has gone so downhill lately that I need some AI to be able to parse this nonsense. Seriously, writing headlines is journalism 101.
As an obvious requirement, the teams sent out to harvest this data would need to be equipped with something a little more advanced that your typical desktop scanner.Right now, when dealing with ancient texts, scans are done in the visual range, UV and IR, (full spectrum imaging) with more specialized scans (such as x-ray, x-ray fluorescence and hyper spectral imaging) being done in very few places. The Lazarus Project already has a portable multispectrum scanning set up, but they don't do any of the X-ray or gamma ray imaging stuff. There are many texts which are too fragile, or too precious to be transported to a European or North American University. so the ability to image in x-ray, thermal IR and gamma rays would be pretty important.
I need a wheelchair van for my son. Help me get the word out. https://www.gofundme.com/wheelchair-van-for-jj
Your post, and the post before, are fantastic examples of why I still come here for the comments. Thank you.
"goodbye and hello, as always" ~Prince Corwin, from Zelazny's Amber series