Slashdot Mirror


Developing a Niche Online-Content Indexing System?

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more. tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

9 of 134 comments (clear)

  1. Ask Pubmed guys by mapkinase · · Score: 2, Interesting

    Ask guys behind the Pubmed

    http://www.ncbi.nlm.nih.gov/pubmed

    The database of scientific articles in the field of medicine and biology.

    NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.

    I am surprised none of them reacted yet, I am sure they read ./

    --
    I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
  2. Re:Sphinx or Lucene by martin-boundary · · Score: 3, Interesting
    Even if you have only the title/author, you're still indexing text. Think of a tiny little text file containing two or three lines: title, author, keywords. You'll need a volunteer to type this in. Then you dump those files in a directory and run an indexer.

    If this isn't what you have in mind, please elaborate.

  3. Drupal, hands down. by Beltanin · · Score: 2, Interesting

    Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module , http://sf2010.drupal.org/conference/sessions/how-build-jobs-aggregation-search-engine-nutch-apache-solr-and-views-3-about , http://sf2010.drupal.org/conference/sessions/case-studies-non-profits-jane-goodall-and-musescore , http://sf2010.drupal.org/conference/sessions/case-studies-academia-drupal-asu-john-hopkins-knowledge-health

  4. Wayback by martin-boundary · · Score: 3, Interesting

    You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe, then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.

  5. Re:Sphinx or Lucene by Trepidity · · Score: 3, Interesting

    If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").

    A bibliography indexer would probably be a better choice. Two good free ones are Refbase or Aigaion. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.

  6. Re:Sphinx or Lucene by martin-boundary · · Score: 2, Interesting
    Yes, I was mainly trying to point out that his problem is still conceptually a text indexing problem even if he doesn't have the text of the articles. A scientific bibliography database can be a good choice, as some journals can have arcane numbering systems, so they should be able to cope with a magazine collection.

    Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to handle that.

  7. hoarding == massive replication by martin-boundary · · Score: 2, Interesting

    Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.

  8. Re:Sphinx or Lucene by rs79 · · Score: 2, Interesting

    I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.

    Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.

    --
    Need Mercedes parts ?
  9. Re:Just migrate it to VMware or KVM by b4dc0d3r · · Score: 2, Interesting

    If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.

    Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.

    Gimme just the DOS program at elast, I'll get you the format.