Slashdot Mirror


Developing a Niche Online-Content Indexing System?

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more. tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

4 of 134 comments (clear)

  1. Try Ruby on Rails by olyar · · Score: 4, Funny

    I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.

    The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.

    (Please Note: This post is a joke, and not an attempt to start a flame war).

    --
    Custom, hands-free Linux installs. Instalinux
    1. Re:Try Ruby on Rails by greg1104 · · Score: 4, Funny

      It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.

  2. Re:Sphinx or Lucene by tebee · · Score: 4, Informative

    Yes, you did misunderstand.

    We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.

    All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.

    It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.

    --
    N.B. this user is far too lazy to write a witty and intelligent sig.
  3. Re:It would help by tebee · · Score: 4, Informative

    OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe but was removed , without warning, last week so there is not a lot to see.

    --
    N.B. this user is far too lazy to write a witty and intelligent sig.