Developing a Niche Online-Content Indexing System?

← Back to Stories (view on slashdot.org)

Developing a Niche Online-Content Indexing System?

Posted by timothy on Saturday July 17, 2010 @09:04AM from the we-had-to-index-individual-clay-tablets dept.

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more. tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

26 of 134 comments (clear)

Min score:

Reason:

Sort:

Sphinx or Lucene by Anonymous Coward · 2010-07-17 09:09 · Score: 3, Informative

Or did I misunderstand the question?
1. Re:Sphinx or Lucene by tebee · 2010-07-17 09:46 · Score: 4, Informative
  
  Yes, you did misunderstand.
  We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.
  All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.
  It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.
  
  --
  N.B. this user is far too lazy to write a witty and intelligent sig.
2. Re:Sphinx or Lucene by martin-boundary · 2010-07-17 10:13 · Score: 3, Interesting
  
  Even if you have only the title/author, you're still indexing text. Think of a tiny little text file containing two or three lines: title, author, keywords. You'll need a volunteer to type this in. Then you dump those files in a directory and run an indexer.
  If this isn't what you have in mind, please elaborate.
3. Re:Sphinx or Lucene by OrangeCatholic · 2010-07-17 10:26 · Score: 3, Informative
  
  So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).
  Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.
  These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.
  For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.
  The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.
  Now this is where it gets interesting:
  >many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines
  Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).
  Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).
  Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:
  01 19 05
  02 19 07
  03 19 09
  When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).
  Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.
4. Re:Sphinx or Lucene by Trepidity · 2010-07-17 10:46 · Score: 3, Interesting
  
  If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").
  A bibliography indexer would probably be a better choice. Two good free ones are Refbase or Aigaion. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.
  
  --
  10 PRINT CHR$(205.5+RND(1)); : GOTO 10
5. Re:Sphinx or Lucene by martin-boundary · 2010-07-17 10:53 · Score: 2, Interesting
  
  Yes, I was mainly trying to point out that his problem is still conceptually a text indexing problem even if he doesn't have the text of the articles. A scientific bibliography database can be a good choice, as some journals can have arcane numbering systems, so they should be able to cope with a magazine collection.
  Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to handle that.
6. Re:Sphinx or Lucene by rs79 · 2010-07-17 12:24 · Score: 2, Interesting
  
  I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.
  Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.
  
  --
  Need Mercedes parts ?
It would help by Xamusk · 2010-07-17 09:10 · Score: 3, Insightful

if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.
1. Re:It would help by beakerMeep · 2010-07-17 09:34 · Score: 3, Funny
  
  Maybe it's the type of magazines that people used to read "for the articles?"
  
  --
  meep
2. Re:It would help by bsDaemon · 2010-07-17 09:47 · Score: 2, Funny
  
  I'm pretty sure porn indexing isn't niche... or a hobby. Its the true reason Google exists.
3. Re:It would help by tebee · 2010-07-17 09:51 · Score: 4, Informative
  
  OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe but was removed , without warning, last week so there is not a lot to see.
  
  --
  N.B. this user is far too lazy to write a witty and intelligent sig.
Developing a Niche Online-Content Indexing System? by omar.sahal · 2010-07-17 09:13 · Score: 3, Insightful

I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.
Just migrate it to VMware or KVM by RobiOne · 2010-07-17 09:13 · Score: 3, Informative

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

--
-- Robi
1. Re:Just migrate it to VMware or KVM by OzPeter · 2010-07-17 09:37 · Score: 2
  
  Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.
  That assumes that the original data is available to the OP. It may be that it is not.
  
  --
  I am Slashdot. Are you Slashdot as well?
2. Re:Just migrate it to VMware or KVM by b4dc0d3r · 2010-07-17 12:51 · Score: 2, Interesting
  
  If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.
  Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.
  Gimme just the DOS program at elast, I'll get you the format.
The binary file shouldn't be hard to read by bartonski · 2010-07-17 09:27 · Score: 2, Informative

I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.
Try Ruby on Rails by olyar · 2010-07-17 09:28 · Score: 4, Funny

I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.
The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.
(Please Note: This post is a joke, and not an attempt to start a flame war).

--
Custom, hands-free Linux installs. Instalinux
1. Re:Try Ruby on Rails by greg1104 · 2010-07-17 13:43 · Score: 4, Funny
  
  It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.
Ask Pubmed guys by mapkinase · 2010-07-17 09:57 · Score: 2, Interesting

Ask guys behind the Pubmed
http://www.ncbi.nlm.nih.gov/pubmed
The database of scientific articles in the field of medicine and biology.
NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.
I am surprised none of them reacted yet, I am sure they read ./

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
And a thousand Mac Fanbois ... by rueger · 2010-07-17 10:06 · Score: 2, Funny

... leap up and shout "Filemaker Pro! Cause it's so shiny and pretty!"

Oh, the number of times that I've heard that refrain... shudder ...

--
Three Squirrels
Drupal, hands down. by Beltanin · 2010-07-17 10:20 · Score: 2, Interesting

Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module , http://sf2010.drupal.org/conference/sessions/how-build-jobs-aggregation-search-engine-nutch-apache-solr-and-views-3-about , http://sf2010.drupal.org/conference/sessions/case-studies-non-profits-jane-goodall-and-musescore , http://sf2010.drupal.org/conference/sessions/case-studies-academia-drupal-asu-john-hopkins-knowledge-health
Wayback by martin-boundary · 2010-07-17 10:30 · Score: 3, Interesting

You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe, then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.
File format, not the implementation details by frisket · 2010-07-17 10:44 · Score: 2, Insightful

It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.
hoarding == massive replication by martin-boundary · 2010-07-17 11:21 · Score: 2, Interesting

Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.
Using a Howitzer to Hunt Squirrels by salesgeek · 2010-07-17 12:22 · Score: 2, Insightful

Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.
Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.

--
-- $G
No, no, NO! by RichiH · 2010-07-17 19:32 · Score: 3, Insightful

Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.
Read http://en.wikipedia.org/wiki/Database_normalization
You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.
* id is on auto-increase, not NULL
* article_id is a foreign key to the id of the article, not NULL
* name is the name of the tag, not NULL
* comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL
Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
etc pp.
If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:
http://en.wikipedia.org/wiki/Transitive_closure
Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.
One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.
http://en.wikipedia.org/wiki/SQL_injection
Richard
PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.