Slashdot Mirror


Developing a Niche Online-Content Indexing System?

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more. tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

5 of 134 comments (clear)

  1. It would help by Xamusk · · Score: 3, Insightful

    if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

  2. Developing a Niche Online-Content Indexing System? by omar.sahal · · Score: 3, Insightful

    I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.

  3. File format, not the implementation details by frisket · · Score: 2, Insightful

    It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.

  4. Using a Howitzer to Hunt Squirrels by salesgeek · · Score: 2, Insightful

    Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.

    Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.

    --
    -- $G
  5. No, no, NO! by RichiH · · Score: 3, Insightful

    Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.

    Read http://en.wikipedia.org/wiki/Database_normalization

    You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.

    * id is on auto-increase, not NULL
    * article_id is a foreign key to the id of the article, not NULL
    * name is the name of the tag, not NULL
    * comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL

    Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
    How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
    Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
    etc pp.

    If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:

    http://en.wikipedia.org/wiki/Transitive_closure

    Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.

    One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.

    http://en.wikipedia.org/wiki/SQL_injection

    Richard

    PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.