Developing a Niche Online-Content Indexing System?
tebee writes "One of my hobbies has benefited for 20
years or so by the existence of an online index to all magazine
articles on the subject since the 1930s. It lets you list the
articles in any particular magazine or search for an article by
keyword, title or author, refining the search if necessary by
magazine and/or date. Unfortunately the firm which hosts the
index have recently pulled it from their website, citing security
worries and incompatibilities with the rest of their e-commerce
website: the heart of the system is a 20-year-old DOS program! They
have no plans to replace it as the original data is in an unknown
format. So we are talking about putting together
a team to build a open source replacement for this – probably using
PHP and MySQL. The governing body for the hobby has agreed to host
this and we are in negotiations to try and get the original data. We
hope that by volunteers crowd-sourcing the conversion, we will be
able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.
tebee continues:
"It occurs to me that there could be
existing open-source projects that do roughly what we want to do —
maybe something indexing academic papers. But two days of trawling
through script sites and googling has not produced any results.
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
If this isn't what you have in mind, please elaborate.
You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe, then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.
If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").
A bibliography indexer would probably be a better choice. Two good free ones are Refbase or Aigaion. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10