Developing a Niche Online-Content Indexing System?

← Back to Stories (view on slashdot.org)

Developing a Niche Online-Content Indexing System?

Posted by timothy on Saturday July 17, 2010 @09:04AM from the we-had-to-index-individual-clay-tablets dept.

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more. tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

7 of 134 comments (clear)

Min score:

Reason:

Sort:

Sphinx or Lucene by Anonymous Coward · 2010-07-17 09:09 · Score: 3, Informative

Or did I misunderstand the question?
1. Re:Sphinx or Lucene by tebee · 2010-07-17 09:46 · Score: 4, Informative
  
  Yes, you did misunderstand.
  We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.
  All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.
  It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.
  
  --
  N.B. this user is far too lazy to write a witty and intelligent sig.
2. Re:Sphinx or Lucene by OrangeCatholic · 2010-07-17 10:26 · Score: 3, Informative
  
  So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).
  Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.
  These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.
  For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.
  The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.
  Now this is where it gets interesting:
  >many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines
  Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).
  Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).
  Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:
  01 19 05
  02 19 07
  03 19 09
  When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).
  Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.
Just migrate it to VMware or KVM by RobiOne · 2010-07-17 09:13 · Score: 3, Informative

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

--
-- Robi
put the data online if you can by Anonymous Coward · 2010-07-17 09:17 · Score: 1, Informative

There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.
Yeah having some kind of online query interface can be useful and you should certainly implement one if you can. But much more important is the actual data. Make a zip file for download, no SQLor PHP needed. The SQL and PHP can be done later.
The binary file shouldn't be hard to read by bartonski · 2010-07-17 09:27 · Score: 2, Informative

I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.
Re:It would help by tebee · 2010-07-17 09:51 · Score: 4, Informative

OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe but was removed , without warning, last week so there is not a lot to see.

--
N.B. this user is far too lazy to write a witty and intelligent sig.