Domain: fromoldbooks.org
Stories and comments across the archive that link to fromoldbooks.org.
Comments · 24
-
Re:17th Century?
The images are better than average for project gutenberg. On my own site I generally scan at 2400dpi, http://www.fromoldbooks.org/ - although people have to ask me for the high resolution images. For one thing, a 2 gigabyte image can crash people's Web browsers
:-)Project Gutenberg has always been really sloppy with metadata - identifying exactly which edition of a work was transcribed (and which impression), describing its physical characteristics and so forth. They seem to be improving a little, slowly.
Google Books on the other hand has always been really bad with images and with the OCR. For some books I've had some luck making a "majority edition" by taking the text when Google scanned the same book multiple times. It turns out to be almost impossible to do that with images, unfortunately.
As I understand it, Google's method of scanning books also means fold-out or large-size illustrations tend to get lost altogether.
-
Re:These works were written between 40 - 60 years
"should" - either take it up with your representative (congress if you're in the US) or be aware that civil disobedience carries penalties.
At least some of these works are in fact in circulation, by the way. See the original article; there are stories that were first published in magazines and then in books.
60 years isn't actually very long as copyright laws go (sadly) - when I'm researching images or my Web site, http://www.fromoldbooks.org/, I frequently find images over 100 years old that are still in copyright. Sometimes even older.
As for "lost to the world," well, I agree, but note that there are "dark archives" (e.g. at the Library of Congress in the USA) where items are held until such time as copyright expires.
A difficulty with copyright law is that it's the publishers who make the money, and hence have the most representation at governmental levels. I'd guess that with wider representatoin, copyright terms could be simplified and shortened. However, in the US, you also have to remember the Disney Laws. Protectionism and corruption.
-
Re:We used to call them "Service Bureaus"
There's no reason you couldn't do this with a book from the 1830s; on http://words.fromoldbooks.org/ I have text from 18th century books that have been scanned like this, and on http://www.fromoldbooks.org/ some considerably older books.
IIt turns out that there are interesting old books too, you'll be pleased to know, although the futher back you go, the more likely you are to find a book in Latin. Well, until you get far enough back that scrolls are common, and then Greek and Hebrew/Aramaic are common
:DIn an antiquarian book fair I was once offered Etruscan tablets, probably the most seriosuly antiquarian "books" I've seen for sale
:-) -
Re:We used to call them "Service Bureaus"
There's no reason you couldn't do this with a book from the 1830s; on http://words.fromoldbooks.org/ I have text from 18th century books that have been scanned like this, and on http://www.fromoldbooks.org/ some considerably older books.
IIt turns out that there are interesting old books too, you'll be pleased to know, although the futher back you go, the more likely you are to find a book in Latin. Well, until you get far enough back that scrolls are common, and then Greek and Hebrew/Aramaic are common
:DIn an antiquarian book fair I was once offered Etruscan tablets, probably the most seriosuly antiquarian "books" I've seen for sale
:-) -
Simplify the Law
I run a Website for images (mostly) and text scanned from old books. When Google books started I thought at first I could just give up, but it turns out that the quality is so low for Google books that http://www.fromoldbooks.org/ and other sites like it continue to perform a valuable service.
I have had to spend a lot of time researching copyright law. I started out believing wikipedia, hah! And there are tons of Web sites with myths about copyright, e.g. that anything published before 1923 anywhere in the world is out of copyright in the US. Did you know that the UK copyright act has an exception specifically for works created in a hovercraft? Or that anonymous works have different copyright terms than ones that are credited, but e.g. if the name of a photographer becomes known (or knowable through any public means) after publication, it gets the longer term? And there's no central registry.
We're all getting screwed out of our heritage when a private corporation can control the world's library. To stop this, copyright law must be made simpler, and there must be online searchable registries. Copyright must eventually be harmonized between all countries, since digital information knows no borders. But it must be harmonized in such a way that some currently cpoyrighted works fall out of copyright, and as few as possible works that are out of copyright are placed back into cpoyright. The difficulty is that in corrupt regimes like the US, companies can pay politicians for their election campaigns, and hence special interests predominate politics. And I have idea how to end that corruption, of course.
-
What about the Quality?
The quality of scans from Google Books seems very low to me; much lower than I'd use on my own Web site, http://www.fromoldbooks.org/ - it's not uncommon for pages to be missed, and in one 19th century mechanics textbook I was looking at, the low scan resolution meant that most of the line drawings and diagrams vanished entirely.
It's obvious to me that the Google work will need to be done again by people who care about the content. Note, by the way, that most flat-bed scanners destroy the binding of books, although some people are now using e.g. a Canon 5D full-frame-sensor SLR camera instead. For illustrations, some simple mathematics (and experience) shows 600dpi to be an absolute minimum for a scan of an engraving, with fine steel engravings needing at least 1200dpi (I used 2400) in order to prevent the lines from being aliased into a blotchy gray. This is much higher than the Canon SLR gives, but Betterlight have a 500 megapixel backend to a medium-format professional camera that would give enough resolution for a good digital fac simile, and e.g. the University of Wisconsin uses that sort of equipment. But it's much slower and hence more costly, and the files are huge.
Here's a fragment of text from a Google book I've been working with:
ALLEN ^Anthony), an English lawyer and antiquaiy,
was born at Great Hadbam in Hertfordshire, about the end
of the seventeenth century, and was edu<?^ted at ffton;
whence he went to King's college, Cambridge, and took
his bachelor's degree in 1707, and his master's in 1711.
He afterwards studied law, was ciiJI^d.to' the bar, and bjr
the influence of Arthur OnsloW^ speaker of the house of
commons, became a roaster in chancery. His reputation
as a lawyer was inconsiderable, Jbiut he was Esteemed a good
classical scholar, and a man of Wit: and -convivial habits.The version I have at http://words.fromoldbooks.org/Chalmers-Biography/a/allen-anthony.html (I am still working on these) is based on scanning done at the University of Toronto, combined with four other digitisations, including two apparently independent ones by Google, both of the quality demonstrated here.
It might turn out that it would have been less work to have scanned this 32-volume encyclopedia myself (I have a copy) and so the OCR with commercial software that works 1,000 times better than Google's, but, for reprints, the important thing is the quality of the scanned images, not the OCR - and there too, the Google scans are really sucky.
-
What about the Quality?
The quality of scans from Google Books seems very low to me; much lower than I'd use on my own Web site, http://www.fromoldbooks.org/ - it's not uncommon for pages to be missed, and in one 19th century mechanics textbook I was looking at, the low scan resolution meant that most of the line drawings and diagrams vanished entirely.
It's obvious to me that the Google work will need to be done again by people who care about the content. Note, by the way, that most flat-bed scanners destroy the binding of books, although some people are now using e.g. a Canon 5D full-frame-sensor SLR camera instead. For illustrations, some simple mathematics (and experience) shows 600dpi to be an absolute minimum for a scan of an engraving, with fine steel engravings needing at least 1200dpi (I used 2400) in order to prevent the lines from being aliased into a blotchy gray. This is much higher than the Canon SLR gives, but Betterlight have a 500 megapixel backend to a medium-format professional camera that would give enough resolution for a good digital fac simile, and e.g. the University of Wisconsin uses that sort of equipment. But it's much slower and hence more costly, and the files are huge.
Here's a fragment of text from a Google book I've been working with:
ALLEN ^Anthony), an English lawyer and antiquaiy,
was born at Great Hadbam in Hertfordshire, about the end
of the seventeenth century, and was edu<?^ted at ffton;
whence he went to King's college, Cambridge, and took
his bachelor's degree in 1707, and his master's in 1711.
He afterwards studied law, was ciiJI^d.to' the bar, and bjr
the influence of Arthur OnsloW^ speaker of the house of
commons, became a roaster in chancery. His reputation
as a lawyer was inconsiderable, Jbiut he was Esteemed a good
classical scholar, and a man of Wit: and -convivial habits.The version I have at http://words.fromoldbooks.org/Chalmers-Biography/a/allen-anthony.html (I am still working on these) is based on scanning done at the University of Toronto, combined with four other digitisations, including two apparently independent ones by Google, both of the quality demonstrated here.
It might turn out that it would have been less work to have scanned this 32-volume encyclopedia myself (I have a copy) and so the OCR with commercial software that works 1,000 times better than Google's, but, for reprints, the important thing is the quality of the scanned images, not the OCR - and there too, the Google scans are really sucky.
-
Re:Lame
Actually, I just read the article. It looks like similar results could be gotten if the photos of people's faces were bump-mapped on top a wire-frame face modeled from the Da Vinci notebooks
... but I suppose using a 2D "warp field" sounds much cooler. I wonder if any of the Eigenfaces are in the "beautiful" set. -
another focus
In my spare time I scan old books, put the pictures online, and sometimes also make XML transcriptions, e.g. of the dictionaries of thieving slang.
I tend to use technologies from work too, but for me that makes work more interesting and more relevant to my life at the same time as making the spare time project move forward.
The site makes money from ads (a little) and I sell the high-res images on stock sites (although I also give them away free on request, or for the cost of shipping and so forth).
-
How I do it...
I run fromoldbooks.org, a Web site devoted to scanned pictures and text from old books -- some more than 500 years old.
I use an Epson Expression 1000XL flatbed scanner (A3+ resolution, approx 12x17.5" with colour calibration), Linux xsane and gimp, for most of the images, but this does involve damaging the binding of thicker books. I scan wood engravings usually at 2400dpi, but modern screened pictures at only 1200dpi or sometimes even lower. The idea that you only need to scan at twice your print resolution assumes (1) you know what printer you'll use 10 years from now, (2) that once you scale down by more than 50% there's no visible difference (false). For colour you will need to do some descreening, which will generally involve something like an 11 to 17 pixel radius gaussian blur followed by a sharpen.
I also use a Canon 450D (Digital Rebel) camera on a tripod, with a 50mm f/1.8 lens (you can get the lens for around $75 to $100 in US or Canada, less if used) and a remote control; use the mirror lockup function of the camera and the remote to minimise camera shake. I point the camera at the open book.
In either case if there are significant amounts of text I then use Abby FineReader OCR; the open source OCR programs (and most of the other commercial programs) are a waste of time by comparison, or at least that was true 2 years ago when I was last researching this.
Go and buy a couple of large USB external disk drives, e.g. 500GBytes or more, and also write DVD backups frequently. Use a consistent naming scheme; I use a separate directory (folder) for each book or magazine, and I include the page number in the filename, together with -raw for the origial scan and -cleaned for the processed version. I use PNG to save the files because it's lossless, an open standard, and widely supported; I'd suggest avoiding GIF (not enough colours), TIFF (portability problems) or JPEG (lossy).
Obviously if you want to put the magazines on the Web you'll need permission; in my case I am usually digitising out-of-copyright books, although copyright laws have changed since I started, and also my understanding of copyright has changed. E.g I started out believing Wkipedia
:-)It can be a big project, but a lot of fun!
-
Re:Funny
-
Re:What about my bandwidth?
the webmaster still gets credit for the 'view'
For many ads these days, the web publisher gets paid only for clicks.
With Google, the ratio of page views ("impressions" as they call them) to clicks determines the amount paid.
So if you download the ads from my Web site and don't ever click on them, you are actually lowering my revenue very slightly. It's better not to download ads if you aren't going to click on them.
I've worked hard (e.g. on http://www.fromoldbooks.org/ ) to try and have ads that are not too obtrusive, but when you're running an image Web site the bandwidth costs can be fairly high, and very few people will click on donate buttons.
Maybe it just means that some classes of Web site will go away, or will have to reinvent themselves, e.g. as "member-only pay to download images".
Liam -
Quality as well as quantity, please
The books I've looked at have been scanned at a resolution that's more or less adequate for OCR, but isn't really adequate for reproducing fine woodcuts, and is hopeless at metal engravings. I've found from my work on fromoldbooks.org that anything less than 1200 dpi generally produces pretty poor results for images, so that, for example, you can't read the signatures of the artist and engraver, still less compare engraving styles. It would be sort of like having a paraphrase of the text instead of the actual words.
It does, of course, vary a lot depending on the style of image. Bold illustrations for children's books, for example, do better at, say, 800dpi greyscale or colour. Fine steel engravings with lines at, say, less than a tenth of a degree from horizontal (they were done by hand after all) and that come out only a couple of pixels wide even at 1200dpi just turn into gray mush with weird banding artefacts until you go to a higher resolution (I use 2400dpi). There's a widely-cited study indicating that an "ultra-high" scan resolution of 400dpi is more than sufficient, based on an extremely small sample of images.
The damage that's done by poor quality digitization is that it makes it harder to justify doing a better job in the future. -
Please do a better job, not just a bigger job
Most of these people focus on English-language books printed in the 19th and early 20th centuries, because (1) it's usually easy to determine copyright status, and (2) if you go earlier you get the tall "s" ( in utf-8) which no OCR program today seems able to handle, so the scanning cost is increased.
Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.
There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.
Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.
I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??
Liam -
Maybe we could host some?
I experimentally put a couple of pages of sheet music on fromoldbooks.org yesterday. I'm not sure how useful they are, but I'm contemplating adding a lot more out-of-copyright sheet music.
I'd be willing to host good quality scans from other people, too, but it has to be demonstrably out of copyright -- I'm not interested in "legal loopholes" here. I'd suggest using 1200dpi greyscale and then adjusting "curves" to make a clear, sharp image. Both the music and the typeset score must be out of copyright, as well as the lyrics. In the US and Canada this is generally easy to determine, but for music produced in other countries it can be arbitrarily difficult; anything printed before 1820 or so is pretty safe though.
This doesn't really help the original poster very much unless I happen to have some specific piece of music, of course! -
I've tried this... stick with a scanner for now.
I routinely scan pages of old books (and other documents) for my Web site, from old books. I use an Epson Expression 10000 XL, which, as someone else noted, isn't cheap, but it does A3/11x17/tabloid at 2800dpi. At 400dpi grayscale it can scan a regular page in a few seconds.
I've also used a Casio Exilim camera to photograph pages.
The way that it's done for archival purposes is to have a mount that holds a book and also holds a medium-format camera about four feet away. To get good resolution for OCR you'll need something that's about an 11 megapixel camera or more, for a full page at (say) 7x10 inches of actual text. Hugin and ptstitcher and friends, the panorama tools, include software to correct for lens distortion. Phase One sells a camera mount (in Canada you can get it from Vistek, together with their 40 megapixel back end for a medium-format camera. Or you could make a suitable mount yourself. The trick is that it holds the book open half-way (or less, using mirrors) so that you don't get as much page distortion. Holding the book and the camera rock steady is absolutely necessary if you are photographing text.
For small items like a cheque (say), use a flatbed scanner, and scann at 400dpi grayscale. Project Gutenberg's guidelines are outdated (they use 300dpi black and white as I recall) and don't get such good results. If you go much higher than 400dpi, the OCR software starts having tantrums at you and the quality may actually degrade.
The best OCR software on the market today as far as I can tell is Abbyy Finereader. I tried several, and found this had, for example, at least two orders of magnitudes fewer errors than the GNU OCR package. You should expect errors, though, especially in digits.
Frankly I'd go with a scanner just because they're designed for this application, and you have less hassle. Transferring images from the camera to the computer twenty minutes after taking the photo means you need to keep a separate log of where each photo came from, or you'll muddle them up. I save images with filenames like Ball-Sussex/086-Pevensey-Castle.png so that the page number is in the filename. And the image quality with even a low-end scanner is much higher than you can get in practice with a camera without an elaborate set-up, and reliably better, comes out every time regardless of lighting, camera settings, wobbly hands, etc.
Having said all that, I do photograph pages sometimes to make manual transcriptions. Afterwards I do careful proof-reading against the original.
Liam -
What legacy does the domain name have?Is there a holding page there now? If so, what google pagerank does it have? If it has a pagerank of 0, and does not show up in google searches for text that's on the page, don't go near it: if it's in what Google calls a "bad neighbourhood", they won't list it or let you use AdSense ads on it until you've demonstrated in some way to them that it has changed. E.g. perhaps if it had porn ads on it, or used to be part of a "black-hat SEO link rink".
On the other hand, if it has pagerank of, say, 5 or more, or if there's some reason why you're set on that name, go ahead. Yes, once you get a few decent incoming links you'll get even a new domain to have good search results, especially with a simple (but syntactically correct) "home page" that doesn't rely on Flash, and has some content. Buying the domain name is either getting you a tiny jump start, or is getting something easy for people to remember.
If $1500 is a lot of money to you, offer them $500 and stop buying those expensive shoes :-) A startup is unlikely to get far in under two years (on average in most Western countries), so you need to have funds to go without other income and pay all your bills for at least that long.
You can try downloading the Google toolbar for Firefox (there's a link on my Website and thousands of others, or go to google.com) and it'll show you the pagerank of sites you visit. The numbers go from 0 (it goes last in search results or is not in the index) to 10 (pretty much always on top of the results; there are only a very few "10" sites).
Google pagerank isn't the only metric, especially as Microsoft starts to enter the search market, but right now it's at least half of it, and if you don't show up on the top half of the first page of google results (once you have some content) you'll be invisible to much of the world. Do a search for (1) your proposed company name, including with likely spelling mistakes, and (2) for the products, e.g. "lemon flavour socks", and then look at the first ten or so results in some detail. Are they high pagerank sites on Google? If they're, say, 7 or higher, you'll have a hard time displacing them. If they're large companies, or in a lucrative business (e.g. lawyers looking for people with scars from lemon picking seeking workers' compensation!) you won't compete well. Look also at who is sponsoring ads.
Of course, you can repeat this on Yahoo and MSN.
If you pick a name that has very few matches, but is memorable, $1500 is not very much to pay.
Some other useful Google-related links:- Google AdSense advertising programme;
- Webmaster Eyes to see the Google rank of individual links on a page;
- an unofficial google adsense FAQ
I'm saying all this because it seems to me that more people follow links or use search engines than type stuff speculatively into the address bar.
-
First think about what you need
As with most things, it's not really that one package is "better" than another so much as that one might be more useful to you at any given time.
I use my own package when a Web site is smaller (say, below a million hits per month) because I would rather sample some actual sessions and see where people went and what they were searching for than get an overview. If you see people are searching for Argyle Socks and are finding your page about the Duke of Argyll, you might want to add an extra page and link to it, "if you were looking for...".
The statistic you most want is the things people looked for that might have reached your Web site and didn't, and that's the one you can't easily find!
For a site getting under 1,000 hits per day, look at the server logs in detail at least once a week, and make navigation easier, add more content where it looks promising, think about why some areas don't get traffic, etc etc.
When you're getting 10,000 hits/day, unless most of them are for graphics, the data can become overwhelming. And if you're over 100,000 hits per day you probably need to go to the sorts of reports that give you a very broad overview.
A link checker and a 404 report can be useful -- Cool URIs don't change!
Oh -- for anyone interested, although I do have hololog set up on for example my words and pictures from old books Web site (in a private directory, sorry), the sourceforge page doesn't have a download, mea culpa. If it looks useful to anyone I've shared copies of "hololog" in the past. It could do with some cleaning up, alas!
Liam -
Re:Try using XML and XSL
I used XML Query templates for a project recently (there are quite a few implementations of XQuery, both proprietary and open source).
I found that development was fast (although I already knew both XQuery and XML in general) and that once my XML Query expressions passed through the compiler (static type checking is your friend) they mostly worked first time, so that I had something working in an hour and a reasonably robust site within a day.
There's a page about the image search in particular at http://www.fromoldbooks.org/Search/about.html although it's fairly high level.
Using XML as an integration language, and either XQuery or XSLT to generate HTML where needed, really does help to enforce separation. An advantage of XQuery is that the implementations are getting pretty fast, and are beginning to use indexes for multiple files or database access.
Disclaimer: I work full-time for W3C, and we publish both XML and XQuery :-) -
Re:Try using XML and XSL
I used XML Query templates for a project recently (there are quite a few implementations of XQuery, both proprietary and open source).
I found that development was fast (although I already knew both XQuery and XML in general) and that once my XML Query expressions passed through the compiler (static type checking is your friend) they mostly worked first time, so that I had something working in an hour and a reasonably robust site within a day.
There's a page about the image search in particular at http://www.fromoldbooks.org/Search/about.html although it's fairly high level.
Using XML as an integration language, and either XQuery or XSLT to generate HTML where needed, really does help to enforce separation. An advantage of XQuery is that the implementations are getting pretty fast, and are beginning to use indexes for multiple files or database access.
Disclaimer: I work full-time for W3C, and we publish both XML and XQuery :-) -
Google ad rates
I get about $200 to $300 per month with Google AdSense (the terms and conditions let me say that but not give you the click-through rate) from http://www.fromoldbooks.org/ -- it's been rising slightly each month over the past year. Since it's an image site, there's a relatively high bandwidth use, and this does pay for the hosting.
It's a trade-off. The Google text-only ads are not too distracting, and are relatively well targeted so they might actually be interesting. I've tried other advertising programmes, but those were best so far.
In many ways, like you, I'd rather not have ads at all. But it needs to cover its costs, I couldn't afford to run the Web site otherwise.
The people who say, ask the community, if it's community-run, are onto the right track. Of course, most of the people clicking on the ads will likely be visitors not part of the community, and the members will quickly learn to ignore the ads, as long as they are not too disruptive.
Google adsense is easier than having a shopping cart that accepts credit card payments for membership, and you don't have the trust issues. But if you already accept payments over SSL, you should consider "no ad" subscriptions. You could also consider saying that anyone who has been registered more than 3 months (say), or who has more than 6 gigapoints, or posts more than 30 times a day, or however you mark More Valued Contributors, doesn't need to see ads. They are busily making pages for you that will have ads on them and bring in revenue, so that's enough. And that way you encourage participation without charging anyone. -
It depend on your needs, as always
For my collection of images scanned from antiquarian books I am now using an Epson E10000 3200dpi scanner that does A3+ (18"x12" roughly) and am very happy with it. I generally scan in Windows because the Linux Sane interface doesn't know how to focus the lens.
For your little sister you might want something rugged, depending on how little she is :-) For sheet music, though, larger than letter size is worth considering: there are several A3/tabloid scanners around. You will need at least 300dpi (native, not interpolated) for OCR, and possibly higher.
A USB interface is the simplest, although if you have firewire on your computer that may be faster.
For graphic art work you need to be able to do colour calibration. For OCR, you probably will use grayscale most of the time. You can get some good solid greyscale sheet-fed scanners on ebay pretty cheaply, although make sure they're in your area: I wouldn't trust the shipping.
As others have said, look for TWAIN, and for scanners that work on multiple operating systems.
If you do a lot of scanning you'll need extra hard disk storage and a way to back it up, such as a DVD writer or a tape drive. -
some pointers (Linux, Windows)
Every now and then a map I drew about 15 years ago of a fragment of a quasi-mediaeval European town shows up. I was fed up of American maps of "mediaeval cities" in which there were perfectly square city blocks with a FedEx drop-off box on every corner.
Pro Fantasy used some of my pictures and plans of castles from my pictures and texts from old books Web site, so I link back to them, but as far as I can tell their products or for Microsoft Windows. They gave me a free Castles program, but I didn't try it under WINE.
On Linux today I'd probably look at using either Grass (a fairly complex GIS program for the hard-core enthusiast) or a vector-based drawing program such as Inkscape.
It's useful to have a drawing program that handles layers (Inkscape does these days), and a vector-based rather than bitmap program is good because (1) the maps print OK, (2) when you ditch that old 640x480 screen and go for 24,000 x 9,000 pixels :-) you can still find the map, and (3) you can zoom without it getting blocky, and (4) you can edit it later.
If you insist on using a raster/bitmap program like GIMP, use a separate layer for everything and keep text layers as text for as ong as possible, so you can edit them. Maps with spelling errors look really stupid. Plus it's neat to be able to go back and add detail during the campaign.
If you give the players a copy of the map file, export it to a bitmap first, with the layers containing your own notes well hidden! Or first save the file, then carefully delete the layers you don't want them to see, and then save a copy under a different name and send that. But that process is error-prone especially if you're tired.
I sometimes gave players incorrect maps, e.g. badly remembered or done with "poor cartography", and they'd end up piecing the truth up from the obvious contradictions. E.g. one had an entire country whose existence was censored :-)
There are a number of clip-art fonts around with map symbols. Some are commercial (I'm sure you respect commercial licences, since you want the GPL to be respected, right?) such as Adobe's Carta, but there are some free ones too. There are also some low-cost fonts especially for making RPG maps by David Nalle at Fontcraft's Scriptorium. I think they have some non-Free non-free software for Microsoft Windows too.
An alternative to clip art fonts is to make your own symbol library, e.g. by drawing pointy muontains and so forth with a pencil, colouring them with crayons, and scanning the result before and after adding colour. You could then trace these in a program like Inkscape, too.
Liam -
Images
I've been working on scanning images from antiquarian books for a few years, and recently started opening the process up so others can help out. The current state is at Pictures from old books; the new collaborative site will be fromoldbooks.org (since there are textual transcriptions as well as images), probably in a month or so.