Slashdot Mirror


How Would You Archive Mounds of Genealogy Data?

dexter riley asks: "Hello, all. My mother, a librarian, historian and genealogist for over twenty years, died about a year ago. She left a huge amount of genealogy information, culled from books, magazines, and the internet, mostly in the form of typewritten, photocopied, and printed pages. My main goals are: Preservation - converting the documents into a compact format that can be easily copied and transferred to others; and Indexing - making it possible for someone else to easily find the documents referring to a particular person, family, place, or document type (like land, marriage, military, birth or death records). To this end, I would like to convert her work into a format that can be stored digitally and scanned for keywords, to make it easier for others to use this information for their genealogy projects later on. What tools do you recommend for handling a project of this size?" " I'd estimate there are at least 10,000 pages of documents in all. Much of it is organized by binder into family groups, but a lot of it is unorganized, loose paper. Besides being an irreplaceable resource for any future genealogists in my family, there are other researchers working on related lines that may find some part of this data useful. At the very least, I would like the satisfaction of keeping some part of her work from being lost for a few years more.

Here's a general list of things that I've determined I would need:
  • Scanners: What flatbed scanners would you recommend for fast, high-resolution scanning of documents?
  • Image formats: What lossless image formats would you scan your original documents into?
  • OCR software: Although OCR is not perfect, would you recommend using it to allow keyword searching to the original document? If so, which software would you suggest?
  • Document Indexing: In addition to OCR, are there other tools (document tags?) that you would use to help classify and organize images and other digital documents?
  • File organization software: Ultimately, many thousands of text and image files will be generated. Since I don't want to just convert a paper mess into a digital mess, what tools would you use to organize related image and text files?
Did I miss anything in the above list? Any suggestions you all might have would be hugely welcomed."

73 comments

  1. That is SOME work... by McSnarf · · Score: 4, Informative

    1. Check the Document Management Continuum !
    http://www.archivebuilders.com/whitepapers/index.h tml
    2. Get two reasonable scanners that work with whatever software you choose. One with a document feeder (can be monochrome). Modern office MFPs work fine. The other one is a cheap flat bed scanner with color for anything the big one won't process.
    3. Doc prep and Indexing will take much longer than the scanning - and unlike OCR, are a lot of manual labour. Expect a couple of weeks, minimum, especially if you have't got an indexing scheme in place.
    4. Use TIFF G4 and PDF (OCRed text over the images).
    5. Profit.

  2. iPod?? by SimianOverlord · · Score: 0, Troll

    I find the most convenient method of carrying reams of data around is my iPod. All you need to do is scan in all your documents and use it like you would any other storage device. The advantages of this are:

    1) You can also listen to music and
    2) You could convert your genealogy data into music notes, record them into mp3 files or aac, and listen to them. If you developed enough facility with this music -language (musuage) you could listen, on the hoof and answer questions relatives may have in real time.

    Perhaps Aunt Nora may approach you at a BBQ and ask you about your mothers brothers in laws second cousins puported realtionship to Henry VII. One quick spin of the patented iPod wheel later, and you're listening to that relationship aurally and giving her a running commentary of that side of the family, whilst thoughtfully munching on a burnt sausage roll. I can see big things with this approach.

    --
    Meine Schwester ist sehr, sehr reizvoll - Nietzsche
  3. Dead tree by keesh · · Score: 2, Insightful

    Dead tree lasts longer than computer media. Ask anyone who has ever tried to get data off twenty year old tapes...

    1. Re:Dead tree by bill_mcgonigle · · Score: 4, Insightful

      Ask anyone who has ever tried to get data off twenty year old tapes...

      After I give them a dopeslap for not keeping their data current.

      I mean, my first Mac had a 40MB hard drive, but I still have all the data from it - it's become easier and easier each generation to copy all my old data forward.

      Granted, there's always the odd lost-tape found behind a cabinet, but that's someone who didn't have a good data retention plan in place and didn't care about that data too much.

      There will always be a need for forensic recovery, but compared with just a few years ago, almost all the casual users I know keep all their data on a hard drive. The floppies and ZIP's are gone. Some of it is on CD-R, but that's the new backup media, not current storage.

      Now getting them to do a good backup so I don't have to go rescue their drives with dd_rescue - somebody let me know how to do that!

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    2. Re:Dead tree by angst_ridden_hipster · · Score: 1

      After I give them a dopeslap for not keeping their data current.

      When your starting point is a Mac that has a 40MB hard drive, your perspective is understandable.

      Some of us have/had data on old 8" floppies or 5.25" floppies, back from the days when a 5MB hard drive was $2,000 and the size of a Dell server.

      It wasn't so easy to keep data current in those days. If you were lucky, you had a serial port, and could tranfer to a later generation machine. But disk formats were not standardized the way they are today: just try reading a hard-sectored floppy on a drive that doesn't understand 'em.

      Unlike today, there wasn't a situation of three or four physical interface standards that have great backwards compatibility. Often, drive interfaces were customized to specific make/models, e.g., a Kaypro II, or a TRS-80 Model III, and wouldn't work with anything else.

      (If you or anyone knows a cheap service to grab data off of 20-year-old hard-sectored, 40-track, DSDD 5.25" floppies, let me know!)

      --
      Eloi, Eloi, lema sabachtani?
      www.fogbound.net
    3. Re:Dead tree by bill_mcgonigle · · Score: 1

      When your starting point is a Mac that has a 40MB hard drive, your perspective is understandable.

      For me it's not, but going forward the easy-to-read data model is standard. So _today_ you have no good excuse for keeping your data in a non-digital manner. Heck, I can hook up an IDE, SATA or SCSI drive via PCI controller, USB controller, or Firewire controller all with ease. Tell that to the MFM drive I still have sitting in the corner to dump.

      (If you or anyone knows a cheap service to grab data off of 20-year-old hard-sectored, 40-track, DSDD 5.25" floppies, let me know!)

      Too expensive if you have any volume. I just bought a CatWeasel Mark IV floppy controller which can read a thousand different formats and has linux drivers. It's still in the box but I've heard great reviews. $150 for a floppy controller sounds crazy by today's standards, but I've paid that for an IBMPCXT controller (with realtime clock!), and counting inflation and the Euro exchange rate it's not too bad.

      --
      My God, it's Full of Source!
      OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
    4. Re:Dead tree by 4of12 · · Score: 3, Insightful
      Dead tree lasts longer than computer media.

      Frightful.

      I was looking at 100 year old newspapers from a small town in AL about 5 years ago. Yellowed, brittle, crumbly. Half of those old newspapers were useless to amateur pawed geneologists who were slowly contributing to their demise by even attempting to turn the pages.

      Then, there's the tons of valuable paper records that get passed to random descendents of record keepers (eg, Grandpa has a bunch of records of marriages, births, baptisms from some old church from 60 years ago that doesn't exist).

      If you make dead tree records I'd recommend making multiple copies that get distributed to different people in different places, preferably on acid-free paper.

      The biggest enemies of geneological records, IMHO, are

      • uncaring descendents chucking out a bunch of "junk",
      • their antecedents who never even bother to tell them or, better, write down who the hell is in those old photos, etc., and
      • the odd house fire that consumes everything that can burn.
      --
      "Provided by the management for your protection."
    5. Re:Dead tree by studerby · · Score: 1
      The biggest enemies of geneological records, IMHO, are:

      Good choices, I'd add:

      • the jerk who has a complaint with the court/tax collector/local officials/adjacent land owner and burns down the courthouse
      • the jerk who steals original records for his "collection"
      • the record keepers that stores "that old stuff" in abysmal conditions, e.g. the basement that floods every 20 years...
      --

      .sig generation error:468(3)

    6. Re:Dead tree by Bombcar · · Score: 1

      Believe me, 20 year tapes are easy, if you used the QIC format. Just give Tandberg Data a call.

  4. Organization by poopdeville · · Score: 4, Insightful
    Are you sure there's actually a mess? Since your mom was a librarian, it seems to me that she would know how to organize this information. Go through it and make sure the information isn't structured before you start messing around.

    You might also want to ask the guys from rotten.com if they'll let you see the code behind the nndb.

    --
    After all, I am strangely colored.
    1. Re:Organization by poopdeville · · Score: 1

      ...see the code behind the nndb.

      --
      After all, I am strangely colored.
    2. Re:Organization by dexter+riley · · Score: 1

      Good point! But although Mom was a librarian by profession, she was a packrat by nature. A lot of the information is organized in binders by family, but there is a lot of looseleaf stuff in bins that doesn't have any apparent structure. Much of it may even be scrap paper, but it's not a call that I'm qualified to make at this stage of the project.

      I definitely want to find a document management software/structure that will let me maintain what order exists, while making it possible to add structure to the unorganized documents later on. I'd never seen the NNDB before; your comment and others makes me think that a system like it, or some other sort of wiki, might be very useful for linking the documents together.

      Thanks,
      Dexter Riley

  5. An Intern by bhima · · Score: 1
    You expect her to do all of this herself... quick if you don't have a younger sister get an intern!

    Oh... You will probably find a wiki sort of thing these easiest to deal with (sort of a DB front end for dummies)

    --
    Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.
    1. Re:An Intern by mcmonkey · · Score: 1
      You expect her to do all of this herself

      Of course he doesn't expect her to do it all herself. He'll get her to help.

  6. Mormoms can help by HowlinMad · · Score: 5, Informative

    The Mormon faith believes in tracing humans back to Adam and Eve. They have a hug geneaology library in Salt Lake City. There are several programs available that you can pu the information in, and submit it to them. They will keep it forever, and other can research it as well.

    1. Re:Mormoms can help by Anonymous Coward · · Score: 1, Interesting

      Not to turn this into a discussion on religion and whatnot, but it's probably worth mentioning that any dead relatives you turn over to the mormons will most likely be posthumously baptized into their faith. Being agnostic myself, I don't care one way or another, but some people do. I have a friend who is more into genealogy than I am who does research at one of the local LDS centers. He claims he is hounded everytime he goes there for his information (but won't give it for this very reason).

      This article: http://archives.cnn.com/2002/US/West/12/10/baptizi ng.the.dead.ap/
      talks about a problem Jewish descendants of holocaust victims have with said baptisms.

    2. Re:Mormoms can help by dexter+riley · · Score: 1

      Although I'm sure the Mormons would welcome a fixed, formalized family tree, most of the information I have is far lower-level than they might be interested in. There's a little data like, "Jehod begat Ezekial begat Fred", but most of it is like "John Smith owned 12 acres in Norfolk County in 1728." Information that by itself doesn't provide a definitive lineage, but is more like a circumstantial lead that a private investigator might follow; information that might help someone whose ancestor was married to someone who lived in Norfolk County in the 1700's, but they don't know exactly who it was...

      That's part of the reason that being able to index each document would be useful. If someone working on the Moores in 1820 in Michegan could type a keyword or two and pull up a document that would be useful to them, it would be a really nifty thing.

      So, I plan to offer the organized information to anyone interested in it, but I think there's a lot of classification that should be done first. Besides, when I think of the Mormons completing their genealogy project, I'm reminded of the Arthur C. Clarke story "The Ten Billion Names of God." I would prefer that the stars not start blinking out, one by one...

    3. Re:Mormoms can help by HowlinMad · · Score: 2, Informative

      You would be very surprised. My mother is big into this stuff. You can put in any kind of information about a person. They want it. When looking for ancestors, that kind of information is very handy, as it tells the story of where they where, went, etc. That help you to uncover where to look next. You can include it all, picture, land purchases, you name it.

    4. Re:Mormoms can help by menscher · · Score: 2, Interesting
      Although I'm sure the Mormons would welcome a fixed, formalized family tree, most of the information I have is far lower-level than they might be interested in. There's a little data like, "Jehod begat Ezekial begat Fred", but most of it is like "John Smith owned 12 acres in Norfolk County in 1728."

      Although they might not be able to take that level of detail into their central database, I think they would welcme info like that at a more local level. At the local levels, they often have small libraries of genealogical data relevant to that specific locality. This way, a researcher can simply contact the local family history library and look up any information they might want.

      Definitely look up The Church of Jesus Christ of Latter-day Saints in your phonebook and give them a call. Most areas have a local genealogy expert that can help you.

    5. Re:Mormoms can help by dexter+riley · · Score: 1

      I will definitely do so! Thanks, guys! -Dex

    6. Re:Mormoms can help by stupid_is · · Score: 1
      If you're not particularly keen on porting family history into a church's docs, and - if you're based in the right country - you might want to try the genesreunited sites. They can use the same format (GEDCOM) as the Mormons (they wrote it) and also allow searching around for matches in other peoples trees.

      --
      -- Intelligence is soluble in alcohol
  7. Mormons? by bill_mcgonigle · · Score: 3, Informative

    Maybe it's best to consider outsourcing it. Groups like the Mormons do this kind of work as one of their missions.

    This link may or may not be useful.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  8. Family history organization by beemishboy · · Score: 1

    The lds church does a lot of family history and genealogy work. They have family history centers around the world staffed with volunteers that work to help people preserve their genealogies and family history. Locations can be found here and other resources that might be helpful are found at Cyndi's List which compiles a list of family history resources.

    1. Re:Family history organization by Anonymous Coward · · Score: 1, Informative

      This is a good idea as long as you don't mind them baptising all your dead relatives.

    2. Re:Family history organization by Rude+Turnip · · Score: 1

      Does it really matter? If their ritual means nothing to you, then why should you care about any "mumbo jumbo" they might mutter?

    3. Re:Family history organization by Suppafly · · Score: 1

      This is a good idea as long as you don't mind them baptising all your dead relatives.

      I'm kinda interested in how that works, do you have any links?

    4. Re:Family history organization by Intrigued · · Score: 1
      See if this is what you are looking for.

      It is from the LDS Church site because I prefer to find out information from the perspective of the believers in an openminded way than to entertain every outraged misconception of the populace.
      (read: I'll never find out what a Toyota owner loves about his car from a Ford dealership)

      LDS baptisms for the dead

    5. Re:Family history organization by Suppafly · · Score: 1

      thank you.

      from reading that site, it doesn't seem entirely bad.. more like one of those things people mention without actually explaining to get a rise out of people.

  9. Google by Phillup · · Score: 3, Interesting

    Give it to google, let them do it.

    You'll have it forever, and anyone will be able to pull it up.

    (Too bad they don't offer this service... yet.)

    --

    --Phillip

    Can you say BIRTH TAX
    1. Re:Google by Anonymous Coward · · Score: 0

      I heard that Google's servers are mostly in California. This implies that when the Big Quake finally hits, you can kiss all that data (and the rest of Silicon Valley) good-bye. The fault that flattened Frisco back in 1906 is almost at its hundredth anniversary of being locked-and-straining. So, I'm going to invest ONLY in companies smart enough to keep a large percentage of their infrastructure out of there.

    2. Re:Google by Noksagt · · Score: 1

      Once the data is scanned, letting google do the indexing (either online, through Desktop Search, or through a Search Appliance isn't a bad idea.

      If there is significant enough value, pawning off some of the work to Google or the Internet Archive or something similar isn't a bad idea. In particular, many libraries and Universities already do this kind of work.

  10. Tools by ka9dgx · · Score: 2, Interesting
    Image formats: It appears that TIF is currently the gold standard in terms of archival storage of documents. JPEG2000 will be the way to go, once it becomes commonplace.

    Document Indexing/File Organization: A Wiki is the proper tool for this job, in my opinion. It makes it very easy to edit, and hyperlinking is instictive. You can easily attach documents to pages, you can usually export the whole thing as a directory tree. Most Wiki software also keeps track of all of the versions of a page, so you can worry less about making bad mistakes.

    I've used both MoinMoin, which is a traditional web based Wiki, and WikiDpad, which is an IDE environment for Windows that does Wiki-like things. Both of these programs are open source, Python based applications.

    You also might want to check out ThumbsPlus by Cerious Software, which stores thumbnails of images in a database (including SQL backends), along with keywords and user fields. It can help you as well.

    --Mike--

    1. Re:Tools by petermgreen · · Score: 1

      what variant of TIFF exactly do they reccomend and why do they reccomend it over options like png?

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  11. Scanner throughput - tests by Anonymous Coward · · Score: 0

    Is there anywhere that gives the performance of scanners?

    I've about 2500 pages to scan and need scan times for
    75 dpi
    150 dpi
    300 dpi (my preferred scan for image output)
    600 dpi (my preferred scan for OCR - text is about 7point serifed font newpapere and therefore hard to OCR at low resolution)
    1200 dpi

    1. Re:Scanner throughput - tests by McSnarf · · Score: 1

      Simplex or duplex ? 2500 is a huge amount of paper - so you will either need a dedicated scanner with a HUGE intray or make a number of batches, which is the recommended method.
      Scanning 2500 sheets at 600dpi with a MFP that my current emplyer builds (no names, sorry :) ) would take about two hours, max. However, at a compression ratio of 1:20, the resulting scans would result in about four gigs of scan data at 8 bits greyscale...

    2. Re:Scanner throughput - tests by hpavc · · Score: 1

      Use someone's ImageRunner

      --
      members are seeing something, your seeing an ad
  12. Storage? by VernonNemitz · · Score: 1

    You didn't indicate any question about what archival-quality digital storage medium might be best. Remember CD rot? Magnetic tape/floppy wear & tear? Hard drive crashes? That's why my personal choice is magneto-optical. Available in 5.25" and 3.5" sizes, the disks are removable like floppies and have protective shells. The disk itself is mostly polycarbonate ("Lexan") and the data layer can only be altered if the temperature rises near the Curie Point for the media (protect from fire, obviously!). They use a laser to get the heat to let data be written, inside a changing magnetic field (and a lower-power laser setting to read the data). Geologists studying the history of the Earth's magnetic field can find useful data saved/fixed in lava flows from millions of years ago. Data retention doesn't get much better than that (currently actually guaranteed for 40 or 50 years, but every few years when I check, the number goes up). So far each new higher-capacity generation of magneto-optical drive is backward compatible with all previous generations of disks, so if your drive dies, you can replace it and still access your data. Also, most disks are not WORM types (write once read many); they can be cleanly rewritten any number of times, as needed. [Ummm...a final note, the 3.5" drives only write on one side of a disk, max current capacity about 2.3Gb, while the 5.25" drives are 4 or 5 times as expensive, but can write on either side of a disk (which must be manually flipped) and max capacity is over 9Gb.]

  13. Fort Wayne Library Genealogy by anhyzer_mush · · Score: 2, Informative
    The Fort Wayne, Indiana library has an amazing genealogy section. From the website:

    The Fred J. Reynolds Historical Genealogy Department of the Allen County Public Library was organized in 1961 by the library director for whom it was named. The department's renowned collection contains more than 300,000 printed volumes and 314,000 items of microfilm and microfiche. This collection grows daily through department purchases and donations from appreciative genealogists and historians. Because of the collection's size and continuous growth, the information in the following holdings summary will necessarily be brief and representative in nature.

    Perhaps they would be interested in her collection. At the very least, they may be willing to help you with your project. Here's the website:

    http://www.acpl.lib.in.us/genealogy/

  14. In response to your questions... by Shimdaddy · · Score: 2, Interesting

    * Scanners: I would go with something basic. I'm a debater for a high school squad, and last year when we decided to digitize literally thousands of pages of evidence, we used one of these HP 5550 It's great, cheap (Only $300)and USB 2.0, the only thing I would say is an absolute requirement. * Image formats: I would use tif. They can be huge and higer resolutions, but scanning at 1bit (Black and White) seems to keep things under control. You also could try adobe's pdf, but then you are locked in with adobe. * OCR software: If your copies are clean, I would say go for OCR, but don't let it replace the images of the page. That's because OCR can now keep 99.9%ish of the text, but it loses all the formatting. So scan in the text for searching, but keep the images around for viewing. * Document Indexing: I would just index them by date, which I wwould make the filenames. * File organization software: Paperport is absolutely great for this task -- you can "stack" images together, put them into colored folders, conversion between formats is just drag and drop, I would highly reccommend using it, it's on verion 9 or 10 by now.

  15. Paper, then plain textfiles, then ... by waterbear · · Score: 1

    Other posters have already suggested you check whether the papers are already in some kind of useful order. Whatever you do, preserve them. To you, they are the originals and points of reference against which all errors fall to be compared.

    As for digital format, I suggest you consider plain (ASCII) text. It's likely to outlast more complex formats.

    I've tried some specialised genealogy software, and it looks like a mixed blessing. It usually forces the data into pre-ordained types, and quite a lot of data don't fit the types and get distorted in the effort to fit them in, generating eventual errors. (For example, suppose the paper data are silent on whether a family member was married or not. The data gets entered into software that renders 'no marriage data' as 'unmarried'. An instant source of error and misunderstanding.) Keep it simple. Narrative text documents are easily searchable.

    -wb-

    1. Re:Paper, then plain textfiles, then ... by Anonymous Coward · · Score: 0

      The problem is getting it converted to ascii. My company implemented a document scanning and indexing solution.
      We relized early on that OCR was not really reliable, so we scan to .tif images which are entered into a database with all pertainent information. Oracle was our choice, but im sure mysql can handle 10000 documents. but before

      The real chore is the indexing. Often it is very difficult to make sense of other's oganization. I would tackle the clearly marked groups of documents first. Then try to categorize the rest. Again expect to make mistakes. Most likely the loose docs were ones that she didnt get around to organizing herself. That is where you pickup where she left off.

    2. Re:Paper, then plain textfiles, then ... by petermgreen · · Score: 1

      it may be an idea to ocr it anyway purely for search perposes but i agree OCR is nowhere near reliable enough to jusitify throwing away scans.

      --
      note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
  16. Google says search, don't sort! by jimbro2k · · Score: 2, Interesting

    Meaning that you don't necessarily need to organize the data, just be able to search it quickly.
    If you agree with that philosophy, then, after you have it all in ASCII, just do a full text index of the data (which makes sense if the data is rarely or never updated) and it is quick to pull out anything you need.

    --
    There is not nearly enough love in the world, but there is far too much trust.
  17. Fire. by Chess_the_cat · · Score: 2, Funny

    And plenty of it!

    --
    Support the First Amendment. Read at -1
  18. File Organization by angst_ridden_hipster · · Score: 1

    For Genealogy data, the de-facto standard file format is GEDCOM. Originally created by the LDS Church (the Mormons), virtually any decent genealogy program will support the format.

    While it's a somewhat ugly text-based, flat-file format, it does permit organization of information in ways that will be useful to genealogists and researchers.

    --
    Eloi, Eloi, lema sabachtani?
    www.fogbound.net
    1. Re:File Organization by Chelloveck · · Score: 1
      While it's a somewhat ugly text-based, flat-file format, it does permit organization of information in ways that will be useful to genealogists and researchers.

      Even better, pretty much any genealogy software can read and write GEDCOM files. Think of GEDCOM as the CSV of the genealogy world. It even has rudimentary facilities for storing multimedia files in an encoding similar to Base64. I don't know if the various software packages commonly support it, though. (My wife's the family genealogist. I just occasionally do data conversion for her.)

      --
      Chelloveck
      I give up on debugging. From now on, SIGSEGV is a feature.
    2. Re:File Organization by angst_ridden_hipster · · Score: 1

      I don't know if the various software packages commonly support it, though.

      I think many of the recent ones do. (Of course, there are still the products that don't really support GEDCOM at all, but they're the minority.)

      Then there are a lot of hybrid solutions (like used by phpGedView, for example) which will support only "external" media.

      I suspect that, eventually, they will all support the embedded media.

      --
      Eloi, Eloi, lema sabachtani?
      www.fogbound.net
  19. Cyndis Lst by Gryphn · · Score: 1

    http://cyndislist.com/ is absolutely the best place to start looking for anything related to genealogy.

    --
    Fantasy and superstition should be used for entertainment purposes only.
  20. Familysearch = Mormons by Anonymous Coward · · Score: 0

    Familysearch is run by the Mormons. In the world of genealogy you can't get away from working with our resources, we have the best resources available, so any serious genealogist has to get over their anti-Mormon xenophobia. We will be glad to help anybody with their genealogical needs, whatever their intentions are.

  21. Here's what I do by malachid69 · · Score: 1

    Although I am sure that there are better models now, this is the scanner I am still using:
    http://avdcs.com/prod/cards/fb1200.html

    Although you won't usually print anything at higher than 300(ink jet) or 600(laser) dpi, I recommend that for archival purposes to go ahead and scan at full-color 1200dpi from within Photoshop or something.

    I would make sure to backup all your originals onto a DVD DL (Dual Layer) drive as you can now get the drives for about $50 and they can use CDRW, DVD+/-RW or DVD DL disks as your space/income allows.

    Storage really should be done inside some kind of database, even if it is a file-oriented database, instead of you trying to manage the hierarchy directly. This would also likely add the benefit of making searching and indexing easier.

    --
    http://www.google.com/profiles/malachid
  22. First things first by GCP · · Score: 2, Informative

    Others are suggesting answers to your scanning needs, and your mother may have already done this, but just in case:

    The most important thing you can do with a pile of data is turn it into useful information. I know that's what you're trying to do. In the case of genealogical data, the most important is not just an electronic version of paper sources, but something built from those sources: an electronic family tree with facts (birth date/place, death date/place/cause, graduations, marriage date/place, immigration, military service, etc.) attached to each person and SOURCES attached to each fact, and all of it exported to GEDCOM format.

    For each source, good software will let you include transcripts (not scans but transcribed excerpts that aren't too big) along with the bibliographic reference that tells others where they could find the information for themselves.

    I probably have as much of this sort of data as you have, but most of it has been "processed" by extraction. That leaves the rest as backup that I seldom need to refer to. Since 99% of the information (by VALUE, not by bits) has been extracted into a very useful form, the rest is referred to so rarely that paper is just fine (as long as there is another copy stored elsewhere).

    I'm not saying that having electronic copies of all of the original paper sources wouldn't be better. It would, but mostly for ease of backup. It's just that the most valuable thing to do (in my opinion) is the extraction and assembly of a really rich information structure (family tree, facts, sources, transcriptions of sections of sources) in a standard interchange format (GEDCOM).

    After doing so, the additional value in having electronic, searchable copies of the original documents gets much smaller, so if I were you I'd make sure the information extraction project was done first before embarking on the scan/OCR/indexing project. After you do the former, you may decide that the latter isn't worth the effort.

    --
    "Those who have never entered upon scientific pursuits know not a tithe of the poetry by which they are surrounded."
    1. Re:First things first by fw_dude · · Score: 1

      The one problem with GEDCOM is it does not currently support lots of notes and other extraneous information. My father uses a program called Legacy Family Tree. (http://www.legacyfamilytree.com/)

      This allow you to attach notes to each individual thus as much extra information as you want. You can also attach images, scanned docuemnts, pictures etc...

      the program is for windows, but is very inexpensive ~$20.00.

      this would allow you to organize all the information with all relations and family tree information as well.

      Legacy does support import/export with the GEDCOM file format as well.

  23. GEDCOM by D.A.+Zollinger · · Score: 2, Informative

    A lot of what you are looking for has already been figured out by others who do not want to be locked into a single format, vendor, or system. There are many geneology programs out there, such as Gramps or Brothers Keeper who use a standardized file format, GEDCOM, that allows information stored in those specific programs to be transferred to other programs. This allows for easy upgrades in software, as well as the possibility of moving from one package to another as the information can be archived to the GEDCOM file, and then read in again once the new software has been installed.

    As for hardware and other software, I would suggest you use what is familiar to you, or is compatible to the software package you have chosen.

    --
    I haven't lost my mind!
    It is backed up on disk...somewhere...
  24. Put it on the web. by LWATCDR · · Score: 1

    http://www.familysearch.org/
    It is run by the Church of Jesus Christ of Latter-Day Saints. Future access will pretty much be a given and it will help others their family history.

    --
    See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
  25. GEDCOM by sydbarrett74 · · Score: 1

    Make sure this wealth of information ends up in GEDCOM, which is pretty much the de facto standard for exchanging genealogical data electronically. Any genealogy software worth its salt can read from and write to this format.

    --
    'He who has to break a thing to find out what it is, has left the path of wisdom.' -- Gandalf to Saruman
  26. more specs for consideration by Anonymous Coward · · Score: 0

    I am trying to scan in:

    a. large format 12 inch by 14 inch newspaper.
    b. 7 point text
    c. Yellowed newsprint background (leads to fuzziness around letters as well as extra distortion in jpeg)
    d. Printed on both sides

    I scan in 600 dpi black and white with a custom threshold (to compensate for the yellow background).

    The main problems are:
    1. A 600dpi scan takes about 2 minutes per scan
    2. Each page is larger than the scanner and therefore requires 2 scans
    3. The newpaper cannot be sheet fed because it would easily tear (even if it could be sheet fed)

    Alternatives I've experimented with:
    1. Lower resolution scans. The image comes out decent quality but OCR quality goes down significantly (from 95% of words correct down to 80% of words correct - this is a big deal considering that most of the incorrect words are single character '1/4', '1/2', '2/3' and that they are much more important than other words in the article) .

    2. Document copy stand (a tripod mounted camera with dedicated external lighting - no flash, fixed focal length, connected to a computer).

    I tried with a 3 megapixel camera (2100 x 1500 resolution) or about 7 inches by 5 inches at 300dpi.

    The true resolution of the camera for a flat page of text is much lower that that of the scanner. The camera distorts the image via jpeg compression, lens distortion, and internal processing. This means that you have to take about 50% more dpi to get an image close to what the scanner generates.

    This means that the digital camera needs to be shot at 400 to 500 dpi to be effective. This lowers the page size of a 3 megapixel camera from 7 inches by 5 inches down to 5 by 3.5 inches.

    A higher resolution camera that approaches 300 dpi is well beyond my budget. 8mp is about 240dpi and $1000+

    1. Re:more specs for consideration by McSnarf · · Score: 1

      Yellowed paper can be scanned even with automated document feeders. The (sorry, have to put a product here, but will use one no longer on sale :) ) Sharp's AR-M350 for instance will gently pull the original and scan both sides at once, without bending. (A MFP that some people buy for that feature :) )
      While scanning is a very profitable thing to do professionally, it might still pay to shop around for a professional scanning service. If you do the indexing and preparation yourself, cost will be significantly less.

    2. Re:more specs for consideration by tonsofpcs · · Score: 1

      If you go the opto-digital imaging device route [what you call a camera], and it is directly connected to a PC, why use JPEG? You can use RAW data and download it directly with most newer 'digital cameras'. Also, cross polarize. Set a polarizer on the camera and one on an off-'camera' light at 90 degrees to it.

  27. Have it photographed by CXI · · Score: 1

    If you care about this paperwork at all, have it photographed to 16mm or 35mm roll film. Some companies will generate digital version as they create the film, or you can pay someone to scan the film afterwords. Film is the only reliable long term storage media that allows you to reduce storage space requirements. It also does not suffer from "tech rot" as all you need is a light and a lens to view it. From there you can build your digital collection safe in the knowledge that if you keep your film in a fireproof, waterproof box it will outlive you.

  28. DOCUMENT Scanners by Noksagt · · Score: 1

    I know you asked for flatbed scanners, but if you are seriously going to feed in 10,000s of pages, an actual document scanner would be better. Even the document feeders available on many flatbeds are deoptimal--they are slow and jam all of the time. Scanners which are meant to be used by workgroups to actually scan in documents typically dont curl the paper over itself, so are both quicker & jam less. But they are more expensive.

    I am extremely happy with my Canon Canon DR-2080C. Note: It is the only piece of hardware I've bought, knowing that it won't work with Linux. I ran windows SPECIFICALLY to use this document scanner. It looks like it has been discontinued & the DR-2050C is the model to get now. Looks like it does larger documents, which is nice. These do duplex scans in one pass, so you can get about 40 sides (so 20 2-sided pages) per minute. These will probably set you back ~$600 new.

    If you have more money to spend, there are even better document scanners available.

    Go ahead and spend a hundred or more on a good quality flatbed for anything that needs better than 600dpi or is awkwardly sized/bound, but a document scanner will save SO much time.

  29. File Archival by Noksagt · · Score: 1

    Archive it the same way most organizations still do: RAIDed (possibly networked) hard disks with (offsite) backup to tape. Probably still the cheapest & one of the most dependable ways to store all of this.

    As for organization, just come up with a simple, sane way to name things (like /smith/binder1/page1.tiff and /loosepage/page3.pdf (TIFF and PDF being two damn-fine formats for you to use--TIFF for lossless archives & PDF for the graphics+OCR-ed text) & rely on your indexing to actually make this more useable (knowing that if indexing ever fails, you will be no worse off than you are now until you rebuild the index.

    I would personally write a minimalist webapp to navigate through everything, but I'm sure there's already relatively good software to do this with genealogical data.

  30. Don't, unless you have the same goals by digitect · · Score: 1

    I think the best way for you to use this information is to first develop an interest in genealogy yourself. Genealogy is an art as much a science. What you see as only mountains of raw data the genealogist sees as potential relationships between any one piece to any other. Some examples:

    Census Forms. If you come across a series of census forms, how are you going to organize them? Are they by year? By location? By primary line? By associated families? (Which may be the only connection between the primary line.) Will you collect census record images with transcriptions?

    Photographs. Should these be organized by date or by content? Which family dictates the primary? What about wedding photos where two or more families appear together? At what point are images of a son no longer associated with his parents? When he has a family? What about images of four generations of extended family at an event? What about pictures of houses, horses, trees, or fields? What about a booklet of photographs taken on the same roll of film, isn't it more valuable to have them in this series so you can connect all the people within?

    Existing Genealogies. I have numerous genealogical references for the same family. They don't always match. Which one is correct? What if both have known errors that the other gets correct? What about how they fork into non-duplicate portions?

    Cemetery Records. Are these organized by place or by family? Should images of the tombstones be included with them or somewhere else? Should a century-old record be accepted as is or should the known errors be fixed first? What about eight editions of the same record, is the latest always correct?

    The best description I have ever read about genealogy is that it is like a court case in which you are trying to assemble facts and proof to make an argument. The only problem is that the same data can be used in several cases, and each individual is it's own case. Data proving one fact may disprove another. It is the entire assembly, through the skill of it's interpreter, that gives the structure real usefulness.

    In your case, your mother probably had stacks of information about certain topics. You are likely to come along and re-organize it, not according to the cases she was trying to prove, but some arbitrary method. I know if I died, nobody on the planet would be able to make sense of the folders and files I have, since it is organized primarily by my interests, my cases, my explorations, some mostly finished, some barely started. It is basically impossible to organize physical data in many-to-many relationships. But what would honor my efforts most would be for my heirs not to come along behind and "organize" it, but to embrace what I was trying to do and extend it. Re-organization to further the research would be a blessing, de-organization into tiny un-connected pieces of data is the very thing the genealogist is working against.

    The last important idea I can offer is that your mother's information isn't as important to another genealogist as it was to her. Research is always based on primary data, the actual physical record or statement like a birth certificate, family Bible, or tombstone. Good researchers are always attempting to connect primary sources. Swimming through someone else's secondary data is helpful only to the point that it directs one to the instigator. I usually ignore other's works these days unless they are published in book form. There are so many errors, false trails. mis-statements and lies that it is generally not worth the time of back-checking someone's statements as it would be to simply find the source yourself.

    If you are somehow able to perfectly scan and digitally abstract all this data, create the perfect database that connects it all together, and automate a little robot to go retrieve the physical item on command, you will have something to sell. Until then, you are doomed to swim in mountains of information like the rest of us, occasionally noting that one item bears a familiar handwriting resemblance to another seemingly unconnected document. ;)

    --
    There is no need to use a SlashDot sig for SEO...
  31. Hot on the heels... by thegrassyknowl · · Score: 1
    into a format that can be stored digitally

    Hot on the heels of this... I believe the linked thread resolved that digital formats are doomed to fail because they are unreliable over long term (ie > 5 or 10 years). Anywho, the thread might help you with the storage and backup aspect :)

    --
    I drink to make other people interesting!
    1. Re:Hot on the heels... by thegrassyknowl · · Score: 1

      *grrrr* Forgot the link:

      LINK

      --
      I drink to make other people interesting!
    2. Re:Hot on the heels... by Chuq · · Score: 1

      You're confusing media and format. A particular piece of digital storage media (eg. a hard disk, or a DVD) may not be reliable over 5 years. That's why most people are smart enough to keep copies of valuable data on more than one physical piece of media. (ie. multiple DVD backups, or IDE hard disks in a RAID formation).

      Once a newer/better/more reliable storage medium comes around, it takes an hour or two (usually) to transfer the data to the new media.

      Formats however are fine. GIFs, TIFFs and JPGs have been around around 15-25 years (not all that sure myself) and if they ever get superceded, we can be sure there will be a basic conversion program. Plain text files will be readable forever. GEDCOM has been a standard for a while and again, if updated, older files will be readable or convertable for some time to come.

      --
      - Chuq
  32. Your sig is beautifully ironic. nt by Anonymous Coward · · Score: 0

    .eet nE

  33. Proof oriented information gathering by Frans+Faase · · Score: 1
    The best description I have ever read about genealogy is that it is like a court case in which you are trying to assemble facts and proof to make an argument.

    This is very true, but there are few genealogist who are working according this principle, and even fewer software packages that are supporting this way of working. It seems that most programs are only about recording the conclusions of your research.

    Even the GEDCOM format is not really designed according to these principles. A good genealogic program should start with describing documents and sources, and from this derive certain facts. Given a birth certificate and a marriage certificate which mentions a person with the same name and bithdate, does not imply that these absolutely are the same person, especially if they are from a large city, like New York. Yet, in most cases, you would make such an assumption. That means that every combination of "facts" should be recorded. And maybe a certainty factor should be added. If the person was born in a small village, the change that there are two persons with the same name and birthdate, is much less likely.

    1. Re:Proof oriented information gathering by digitect · · Score: 1

      Very interesting thought. I occasionally pass along thoughts to the developer's of GRAMPS, I wonder how an existing/traditional app could be modified toward's this better methodology without stranding any of the traditional users.

      --
      There is no need to use a SlashDot sig for SEO...
  34. Digital preservation by mattpalmer1086 · · Score: 1

    Digital preservation is a pretty hard task, mostly because we haven't yet collectively acknowledged that our society now relies on a digital memory that is extremely fragile.

    Given that this problem is quite hard to solve in the long term (although much easier if just for the short term), it would probably be better to donate the material to an organisation with the resources and longevity to secure it.

    Over the long term, you will have to migrate storage media every few years. You will also have to migrate file formats, as software and standards becomes obsolete, unless you want to try emulation as a digital preservation technique , although most organisations in the field are going down the format migration route.

    As far as document formats go, OCRing to PDF, or OpenOffice might be your best bet, as these formats are widely readable.

    You could check out lizardtech's wavelet document format (www.lizardtech.com). It produces very small file sizes (e.g. 50Kb from a huge scan), has built-in indexing of text, and even has an open source toolkit, although the open source version doesn't do the OCR indexing.

    The Japanese archives are using this format to archive many of their documents, and we have explored it at the UK National Archives. The downside, of course, is it's not a very widely used format, so tool support will be patchy, but if you can roll your own solution, it may be perfect. 10,000 documents at 50Kb each - only 500Mb.

  35. GEDCOM by N8F8 · · Score: 1

    GEDCOM is the standard file format for geneology information. There are plenty of products that import or use this format. GEDCOM was developed by the LDS Church (Mormons). They also have a free program to manage your geneological data called Personal Ancestral File. Personally I use PHPGEDView to manage my family geneology data (view here

    --
    "God fights on the side with the best artillery." - Napoleon, Marshal of France - speaking truth to power
  36. You need Document Management by TakeArms · · Score: 1

    I have struggled with this same issue with having lots of (physical and digital) documents & photographs, and I want a way to store and track it. I have multiple "buckets" of info to store, from genealogy to projects to general tech dump areas.

    I use ScanSoft's (www.scansoft.com) PaperPort ($100), which is about as cheap as you can get for document management on a decent level... it has basic OCR software built-in, although they sell OmniPage for about $150 (I think), for better OCR capabilities.

    There are other companies that sell big iron solutions, like www.isysusa.com and www.onbase.com, but they are not cheap.

    To be useful as a research resource, what you should expect to get out of a decent document management system is the ability to query by keywords which are statically tagged to the files, index ability of filename and contents (OCR'ed text contents a big plus), and the ability to categorize everything.

    It really depends on what level of querying you want, how fast and well you want it OCR'ed, and how much you want to pay.

    Good luck!

    TakeArms

  37. Digital preservation issues, GEDCOM, etc. by TheLoneGundam · · Score: 1

    Many have pointed out that digital formats can become obsolete - the main thing to remember is that digital data can MORE EASILY be moved to new media, file formats, whatever. So the data doesn't become obsolete, at all (for example, old data on Apple II => csv => Access => MySQL is technoligically trivial). GEDCOM is a good format to use for the genealogical data, IF you have the time to index it that way in some program and output to GEDCOM. If not, just preserve it for some family member who wants to do the genealogy part. I can recommend Family Tree for Windows as a reasonably cheap genealogy program, that also supports GEDCOM, and often comes in bundles that include lots of government records that help the genealogist. As for TIFF/non-TIFF issues... the important part there is to make sure the image is readable, and hopefully readable when you zoom in - so that means the more resolution the better. I don't do genealogy but my wife's family does, and I can't tell you how many times they hold a magnifying glass up to some old picture because they're nosy about what's in it (like - is that the December 1942 issue of Life Magazine ol' Uncle Duffy is holding? That woulda been just after cousin Maybelle was born, ah reckon). So, image quality is important to the historian/genealogist coming behind you.