How Would You Archive Mounds of Genealogy Data?

← Back to Stories (view on slashdot.org)

How Would You Archive Mounds of Genealogy Data?

Posted by Cliff on Monday July 11, 2005 @02:44AM from the storing-your-family-tree dept.

dexter riley asks: "Hello, all. My mother, a librarian, historian and genealogist for over twenty years, died about a year ago. She left a huge amount of genealogy information, culled from books, magazines, and the internet, mostly in the form of typewritten, photocopied, and printed pages. My main goals are: Preservation - converting the documents into a compact format that can be easily copied and transferred to others; and Indexing - making it possible for someone else to easily find the documents referring to a particular person, family, place, or document type (like land, marriage, military, birth or death records). To this end, I would like to convert her work into a format that can be stored digitally and scanned for keywords, to make it easier for others to use this information for their genealogy projects later on. What tools do you recommend for handling a project of this size?" " I'd estimate there are at least 10,000 pages of documents in all. Much of it is organized by binder into family groups, but a lot of it is unorganized, loose paper. Besides being an irreplaceable resource for any future genealogists in my family, there are other researchers working on related lines that may find some part of this data useful. At the very least, I would like the satisfaction of keeping some part of her work from being lost for a few years more.

Here's a general list of things that I've determined I would need:

Scanners: What flatbed scanners would you recommend for fast, high-resolution scanning of documents?
Image formats: What lossless image formats would you scan your original documents into?
OCR software: Although OCR is not perfect, would you recommend using it to allow keyword searching to the original document? If so, which software would you suggest?
Document Indexing: In addition to OCR, are there other tools (document tags?) that you would use to help classify and organize images and other digital documents?
File organization software: Ultimately, many thousands of text and image files will be generated. Since I don't want to just convert a paper mess into a digital mess, what tools would you use to organize related image and text files?

Did I miss anything in the above list? Any suggestions you all might have would be hugely welcomed."

18 of 73 comments (clear)

Min score:

Reason:

Sort:

That is SOME work... by McSnarf · 2005-07-11 02:48 · Score: 4, Informative

1. Check the Document Management Continuum !
http://www.archivebuilders.com/whitepapers/index.h tml
2. Get two reasonable scanners that work with whatever software you choose. One with a document feeder (can be monochrome). Modern office MFPs work fine. The other one is a cheap flat bed scanner with color for anything the big one won't process.
3. Doc prep and Indexing will take much longer than the scanning - and unlike OCR, are a lot of manual labour. Expect a couple of weeks, minimum, especially if you have't got an indexing scheme in place.
4. Use TIFF G4 and PDF (OCRed text over the images).
5. Profit.
Dead tree by keesh · 2005-07-11 02:53 · Score: 2, Insightful

Dead tree lasts longer than computer media. Ask anyone who has ever tried to get data off twenty year old tapes...
1. Re:Dead tree by bill_mcgonigle · 2005-07-11 03:03 · Score: 4, Insightful
  
  Ask anyone who has ever tried to get data off twenty year old tapes...
  
  After I give them a dopeslap for not keeping their data current.
  
  I mean, my first Mac had a 40MB hard drive, but I still have all the data from it - it's become easier and easier each generation to copy all my old data forward.
  
  Granted, there's always the odd lost-tape found behind a cabinet, but that's someone who didn't have a good data retention plan in place and didn't care about that data too much.
  
  There will always be a need for forensic recovery, but compared with just a few years ago, almost all the casual users I know keep all their data on a hard drive. The floppies and ZIP's are gone. Some of it is on CD-R, but that's the new backup media, not current storage.
  
  Now getting them to do a good backup so I don't have to go rescue their drives with dd_rescue - somebody let me know how to do that!
  
  --
  My God, it's Full of Source!
  OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
2. Re:Dead tree by 4of12 · 2005-07-11 06:53 · Score: 3, Insightful
  Dead tree lasts longer than computer media.
  
  Frightful.
  
  I was looking at 100 year old newspapers from a small town in AL about 5 years ago. Yellowed, brittle, crumbly. Half of those old newspapers were useless to amateur pawed geneologists who were slowly contributing to their demise by even attempting to turn the pages.
  
  Then, there's the tons of valuable paper records that get passed to random descendents of record keepers (eg, Grandpa has a bunch of records of marriages, births, baptisms from some old church from 60 years ago that doesn't exist).
  
  If you make dead tree records I'd recommend making multiple copies that get distributed to different people in different places, preferably on acid-free paper.
  
  The biggest enemies of geneological records, IMHO, are
  
  uncaring descendents chucking out a bunch of "junk",
  
  their antecedents who never even bother to tell them or, better, write down who the hell is in those old photos, etc., and
  
  the odd house fire that consumes everything that can burn.
  --
  "Provided by the management for your protection."
Organization by poopdeville · 2005-07-11 02:54 · Score: 4, Insightful

Are you sure there's actually a mess? Since your mom was a librarian, it seems to me that she would know how to organize this information. Go through it and make sure the information isn't structured before you start messing around.
You might also want to ask the guys from rotten.com if they'll let you see the code behind the nndb.

--
After all, I am strangely colored.
OCR software: ABBYY by mindaktiviti · 2005-07-11 02:57 · Score: 2, Informative

ABBYY FineReader is great OCR software. However it's not free, and is for Windows only I believe.
Mormoms can help by HowlinMad · 2005-07-11 02:58 · Score: 5, Informative

The Mormon faith believes in tracing humans back to Adam and Eve. They have a hug geneaology library in Salt Lake City. There are several programs available that you can pu the information in, and submit it to them. They will keep it forever, and other can research it as well.

--
Great Linux Site
1. Re:Mormoms can help by HowlinMad · 2005-07-11 06:19 · Score: 2, Informative
  
  You would be very surprised. My mother is big into this stuff. You can put in any kind of information about a person. They want it. When looking for ancestors, that kind of information is very handy, as it tells the story of where they where, went, etc. That help you to uncover where to look next. You can include it all, picture, land purchases, you name it.
  
  --
  Great Linux Site
2. Re:Mormoms can help by menscher · 2005-07-11 06:23 · Score: 2, Interesting
  
  Although I'm sure the Mormons would welcome a fixed, formalized family tree, most of the information I have is far lower-level than they might be interested in. There's a little data like, "Jehod begat Ezekial begat Fred", but most of it is like "John Smith owned 12 acres in Norfolk County in 1728."
  Although they might not be able to take that level of detail into their central database, I think they would welcme info like that at a more local level. At the local levels, they often have small libraries of genealogical data relevant to that specific locality. This way, a researcher can simply contact the local family history library and look up any information they might want.
  
  Definitely look up The Church of Jesus Christ of Latter-day Saints in your phonebook and give them a call. Most areas have a local genealogy expert that can help you.
Mormons? by bill_mcgonigle · 2005-07-11 02:58 · Score: 3, Informative

Maybe it's best to consider outsourcing it. Groups like the Mormons do this kind of work as one of their missions.

This link may or may not be useful.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Google by Phillup · 2005-07-11 03:13 · Score: 3, Interesting

Give it to google, let them do it.

You'll have it forever, and anyone will be able to pull it up.

(Too bad they don't offer this service... yet.)

--

--Phillip

Can you say BIRTH TAX
Tools by ka9dgx · 2005-07-11 03:16 · Score: 2, Interesting

Image formats: It appears that TIF is currently the gold standard in terms of archival storage of documents. JPEG2000 will be the way to go, once it becomes commonplace.
Document Indexing/File Organization: A Wiki is the proper tool for this job, in my opinion. It makes it very easy to edit, and hyperlinking is instictive. You can easily attach documents to pages, you can usually export the whole thing as a directory tree. Most Wiki software also keeps track of all of the versions of a page, so you can worry less about making bad mistakes.

I've used both MoinMoin, which is a traditional web based Wiki, and WikiDpad, which is an IDE environment for Windows that does Wiki-like things. Both of these programs are open source, Python based applications.

You also might want to check out ThumbsPlus by Cerious Software, which stores thumbnails of images in a database (including SQL backends), along with keywords and user fields. It can help you as well.

--Mike--
Fort Wayne Library Genealogy by anhyzer_mush · 2005-07-11 04:05 · Score: 2, Informative

The Fort Wayne, Indiana library has an amazing genealogy section. From the website:
The Fred J. Reynolds Historical Genealogy Department of the Allen County Public Library was organized in 1961 by the library director for whom it was named. The department's renowned collection contains more than 300,000 printed volumes and 314,000 items of microfilm and microfiche. This collection grows daily through department purchases and donations from appreciative genealogists and historians. Because of the collection's size and continuous growth, the information in the following holdings summary will necessarily be brief and representative in nature.

Perhaps they would be interested in her collection. At the very least, they may be willing to help you with your project. Here's the website:

http://www.acpl.lib.in.us/genealogy/
In response to your questions... by Shimdaddy · 2005-07-11 04:20 · Score: 2, Interesting

* Scanners: I would go with something basic. I'm a debater for a high school squad, and last year when we decided to digitize literally thousands of pages of evidence, we used one of these HP 5550 It's great, cheap (Only $300)and USB 2.0, the only thing I would say is an absolute requirement. * Image formats: I would use tif. They can be huge and higer resolutions, but scanning at 1bit (Black and White) seems to keep things under control. You also could try adobe's pdf, but then you are locked in with adobe. * OCR software: If your copies are clean, I would say go for OCR, but don't let it replace the images of the page. That's because OCR can now keep 99.9%ish of the text, but it loses all the formatting. So scan in the text for searching, but keep the images around for viewing. * Document Indexing: I would just index them by date, which I wwould make the filenames. * File organization software: Paperport is absolutely great for this task -- you can "stack" images together, put them into colored folders, conversion between formats is just drag and drop, I would highly reccommend using it, it's on verion 9 or 10 by now.
Google says search, don't sort! by jimbro2k · 2005-07-11 05:24 · Score: 2, Interesting

Meaning that you don't necessarily need to organize the data, just be able to search it quickly.
If you agree with that philosophy, then, after you have it all in ASCII, just do a full text index of the data (which makes sense if the data is rarely or never updated) and it is quick to pull out anything you need.

--
There is not nearly enough love in the world, but there is far too much trust.
Fire. by Chess_the_cat · 2005-07-11 05:30 · Score: 2, Funny

And plenty of it!

--
Support the First Amendment. Read at -1
First things first by GCP · 2005-07-11 06:55 · Score: 2, Informative

Others are suggesting answers to your scanning needs, and your mother may have already done this, but just in case:

The most important thing you can do with a pile of data is turn it into useful information. I know that's what you're trying to do. In the case of genealogical data, the most important is not just an electronic version of paper sources, but something built from those sources: an electronic family tree with facts (birth date/place, death date/place/cause, graduations, marriage date/place, immigration, military service, etc.) attached to each person and SOURCES attached to each fact, and all of it exported to GEDCOM format.

For each source, good software will let you include transcripts (not scans but transcribed excerpts that aren't too big) along with the bibliographic reference that tells others where they could find the information for themselves.

I probably have as much of this sort of data as you have, but most of it has been "processed" by extraction. That leaves the rest as backup that I seldom need to refer to. Since 99% of the information (by VALUE, not by bits) has been extracted into a very useful form, the rest is referred to so rarely that paper is just fine (as long as there is another copy stored elsewhere).

I'm not saying that having electronic copies of all of the original paper sources wouldn't be better. It would, but mostly for ease of backup. It's just that the most valuable thing to do (in my opinion) is the extraction and assembly of a really rich information structure (family tree, facts, sources, transcriptions of sections of sources) in a standard interchange format (GEDCOM).

After doing so, the additional value in having electronic, searchable copies of the original documents gets much smaller, so if I were you I'd make sure the information extraction project was done first before embarking on the scan/OCR/indexing project. After you do the former, you may decide that the latter isn't worth the effort.

--
"Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
GEDCOM by D.A.+Zollinger · 2005-07-11 07:19 · Score: 2, Informative

A lot of what you are looking for has already been figured out by others who do not want to be locked into a single format, vendor, or system. There are many geneology programs out there, such as Gramps or Brothers Keeper who use a standardized file format, GEDCOM, that allows information stored in those specific programs to be transferred to other programs. This allows for easy upgrades in software, as well as the possibility of moving from one package to another as the information can be archived to the GEDCOM file, and then read in again once the new software has been installed.

As for hardware and other software, I would suggest you use what is familiar to you, or is compatible to the software package you have chosen.

--
I haven't lost my mind!
It is backed up on disk...somewhere...