Slashdot Mirror


Developing a Niche Online-Content Indexing System?

tebee writes "One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. It lets you list the articles in any particular magazine or search for an article by keyword, title or author, refining the search if necessary by magazine and/or date. Unfortunately the firm which hosts the index have recently pulled it from their website, citing security worries and incompatibilities with the rest of their e-commerce website: the heart of the system is a 20-year-old DOS program! They have no plans to replace it as the original data is in an unknown format. So we are talking about putting together a team to build a open source replacement for this – probably using PHP and MySQL. The governing body for the hobby has agreed to host this and we are in negotiations to try and get the original data. We hope that by volunteers crowd-sourcing the conversion, we will be able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more. tebee continues: "It occurs to me that there could be existing open-source projects that do roughly what we want to do — maybe something indexing academic papers. But two days of trawling through script sites and googling has not produced any results.

Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!

So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"

100 of 134 comments (clear)

  1. Sphinx or Lucene by Anonymous Coward · · Score: 3, Informative

    Or did I misunderstand the question?

    1. Re:Sphinx or Lucene by tebee · · Score: 4, Informative

      Yes, you did misunderstand.

      We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.

      All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.

      It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.

      --
      N.B. this user is far too lazy to write a witty and intelligent sig.
    2. Re:Sphinx or Lucene by martin-boundary · · Score: 3, Interesting
      Even if you have only the title/author, you're still indexing text. Think of a tiny little text file containing two or three lines: title, author, keywords. You'll need a volunteer to type this in. Then you dump those files in a directory and run an indexer.

      If this isn't what you have in mind, please elaborate.

    3. Re:Sphinx or Lucene by hsmyers · · Score: 1

      Not being up to speed on current open source that might prevent premature wheel re-invention my answer would be 'No'. That said, I don't see any particular trouble with the project itself. If I understand correctly, you've bare bones bibliographic information that you want to create an on-line index of. The notion of PHP and MySql seems sound although I suspect that Perl would work as well if not better, depending on the knowledge of your volunteer talent. I expose my bias here when I point out that text analysis is a particular strength of Perl. I'm currently involved with a project that does an enormous amount of semantic analysis which might be used to create key words on the fly for instance. Now that I think of it, there is no particular reason that the work couldn't be multi-lingual for that matter, leveraging the programmer base at your disposal. Continuing to think about it, I'd love to help--- reply to me at gmail.com if you are interested...

    4. Re:Sphinx or Lucene by symes · · Score: 1

      They have no plans to replace it as the original data is in an unknown format.

      Well there aren't that many obvious candidates... any of these look familiar?

    5. Re:Sphinx or Lucene by OrangeCatholic · · Score: 3, Informative

      So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).

      Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.

      These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.

      For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.

      The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.

      Now this is where it gets interesting:

      >many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines

      Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).

      Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).

      Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:

      01 19 05
      02 19 07
      03 19 09

      When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).

      Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.

    6. Re:Sphinx or Lucene by hsmyers · · Score: 1

      Just noticed an thread on Hacker News on http://www.gotapi.com/html which might be of interest...

    7. Re:Sphinx or Lucene by tolan-b · · Score: 1

      I think you should still have a look as Sphinx and Lucene. You can put whatever data you want into them, in whatever schema you want (at least with Lucene, I believe with Sphinx too). You can then easily create a UI as a front end and let the indexing engine do the hard work of slicing and dicing by your criteria. I believe the Zend Framework library has a Lucene API.

      Also if you do manage to go fulltext later then it'll mean less work.

    8. Re:Sphinx or Lucene by Trepidity · · Score: 3, Interesting

      If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").

      A bibliography indexer would probably be a better choice. Two good free ones are Refbase or Aigaion. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.

    9. Re:Sphinx or Lucene by martin-boundary · · Score: 2, Interesting
      Yes, I was mainly trying to point out that his problem is still conceptually a text indexing problem even if he doesn't have the text of the articles. A scientific bibliography database can be a good choice, as some journals can have arcane numbering systems, so they should be able to cope with a magazine collection.

      Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to handle that.

    10. Re:Sphinx or Lucene by rs79 · · Score: 2, Interesting

      I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.

      Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.

      --
      Need Mercedes parts ?
    11. Re:Sphinx or Lucene by banjo+D · · Score: 1

      I don't know about Sphinx but I agree that Lucene could be a good solution, for the reasons tolan-b lists. I work on a digital library cataloging project that indexes it's metadata with Lucene. We use PHP to generate the user-facing website, which queries our Lucene index via a Solr server. We do have a highly structured metadata schema and we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966" (which somebody in another comment suggested is not easy to do with Lucene, but in our experience was very easy). And adding a Solr server on top makes it easy to include features like faceted search.

    12. Re:Sphinx or Lucene by sporkboy · · Score: 1

      The standard /. IANAL applies here, but I'm pretty sure that if you have legal access to the copyrighted text (ie you or someone you know owns a copy of the magazine) then it is ok to create a derivative work for the purposes of searching that work. This is the loophole that Google (name your favorite search engine here) uses, and they go so far as to offer cached versions of some sites.

      Lucene, or a more friendly wrapper around it like SOLR, has the option of creating a search index based on an original text from which the original content cannot be extracted (indexed=true, stored=false on a field), so that would seem to cover the case of finding an article without violating the rights of the author or the publisher.

      As for not having the text online, I'd suggest either scraping the archive sites in the process of building your search index, it's pretty hard to search something that isn't digitized.

      Best of luck, as this sounds like a worthwhile project. I do think that the volume of data you're discussing would fit easily in a SOLR instance that would consume very modest amounts of server resources to operate.

    13. Re:Sphinx or Lucene by Jherico · · Score: 1

      Solr in front of Lucene is a perfectly reasonable way to index highly structured information and allows structured queries.

      --

      Jherico

      What can the average user can do to ensure his security? "Nothing, you're screwed"

    14. Re:Sphinx or Lucene by OrangeCatholic · · Score: 1

      It's really not a text indexing problem, unless you are going to throw out rdbms and use a flat text file.

      If you will use relational database, then it is a 3-table problem at most. Articles, sources, and articles to sources. If you can join those, you have the core of a classic content management system.

      From what I gather, they haven't even gotten that far. It is just a master index of articles that are available (which point to nothing in particular), so it is a 1-table problem.

      For 1-table problems I generally use Excel.

    15. Re:Sphinx or Lucene by OrangeCatholic · · Score: 1

      >we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966"

      SELECT * FROM banjo_articles WHERE title LIKE "%foo%", date BETWEEN "1950-01-01" AND "1966-12-31"

      You're bragging that your "system" has a single line of code?

      I've seen selects ten or twenty lines long, with multiple joins, and joins and selects within joins. Granted it's not fast, but it works, and it takes all of an hour (or less) to write such a query.

    16. Re:Sphinx or Lucene by AmiMoJo · · Score: 1

      Couldn't you just scan and OCR the magazines and then use that data to compile a searchable database? You could even supply extracts from the articles where the search terms appear, similar to what Google does. By not presenting the full text of the article I think you would be on safe ground copyright wise.

      It would also be a good way of archiving old paper documents which can degrade over time. I'm not sure what copyright terms are in your country but some of those magazines might be in the public domain anyway now. Even if they are not the copyright owner might have disappeared by now.

      --
      const int one = 65536; (Silvermoon, Texture.cs)
      SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
    17. Re:Sphinx or Lucene by Unequivocal · · Score: 1

      Check out: http://xtf.sourceforge.net/

      I think it uses lucene on the backend. It's designed to map meta-data sources to meta-data outputs via XSL templates. I talked with some of the developers recently and it sounds reasonable. If your inputs are binary then it's probably not much help but for XML-like inputs it might give you some of the capabilities you're looking for. HTH

  2. It would help by Xamusk · · Score: 3, Insightful

    if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

    1. Re:It would help by beakerMeep · · Score: 3, Funny

      Maybe it's the type of magazines that people used to read "for the articles?"

      --
      meep
    2. Re:It would help by bsDaemon · · Score: 2, Funny

      I'm pretty sure porn indexing isn't niche... or a hobby. Its the true reason Google exists.

    3. Re:It would help by tebee · · Score: 4, Informative

      OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe but was removed , without warning, last week so there is not a lot to see.

      --
      N.B. this user is far too lazy to write a witty and intelligent sig.
    4. Re:It would help by ZERO1ZERO · · Score: 1

      It's not a web page it's a DOS program, hence the ask slashdot....

    5. Re:It would help by tebee · · Score: 1

      It's a DOS program that runs on the server, rather like a CGI script. It's output is a web page.

      It is bit of a throw back to the dawn of the web when people thought up innovative ways to do things.

      --
      N.B. this user is far too lazy to write a witty and intelligent sig.
    6. Re:It would help by BitterOak · · Score: 1

      if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

      Maybe it's the type of magazines that people used to read "for the articles?"

      And that's precisely the type of magazine that would catch the interest of the Slashdot crowd.

      --
      If I can be modded down for being a troll, can I be modded up for being an orc, or a balrog?
    7. Re:It would help by Bungie · · Score: 1

      CGI works by having the server executes the program (passing the data to it from STDIN or the command line) and then retreiving the page's complete HTML code from STDOUT. You can use any file that can be executed and use STDIN/STDOUT in this manner that is located in a specified location(like cgi-bin). On Windows this would be any exe,com,pif,bat or cmd file, and the extension must be there for the operating system to determine that it is an executable. On Linux you can use any file that +x permissions, compiled binaries or scripts with a bang at the beginning, so it can have any extension you want (or none) for a CGI.

      People used to write a lot of CGI applications in perl because of it's text processing capabilities, but there were many CGI's that were compiled programs (written in languages like C). At one point Microsoft was really pushing the idea of easily writing CGI applications in Visual Basic and hosting them with IIS.

      CGI fell out of popularity in favor of embedded scripting like PHP and ASP which have much less overhead (they don't have to create a new process to service every user request and wait for it's output) and are much less complex for people to use (they don't require special directories or permissions).

      --
      The clash of honour calls, to stand when others fall.
  3. Developing a Niche Online-Content Indexing System? by omar.sahal · · Score: 3, Insightful

    I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.

  4. Just migrate it to VMware or KVM by RobiOne · · Score: 3, Informative

    Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

    --
    -- Robi
    1. Re:Just migrate it to VMware or KVM by pspahn · · Score: 1

      This could work and allows you enough time to not come up with something lame.

      --
      Someone flopped a steamer in the gene pool.
    2. Re:Just migrate it to VMware or KVM by OzPeter · · Score: 2

      Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

      That assumes that the original data is available to the OP. It may be that it is not.

      --
      I am Slashdot. Are you Slashdot as well?
    3. Re:Just migrate it to VMware or KVM by Threni · · Score: 1

      > That assumes that the original data is available to the OP. It may be that it is not.

      If only the article in some way made this clear.

      "we are in negotiations to try and get the original data."

      Oh, it does.

    4. Re:Just migrate it to VMware or KVM by tebee · · Score: 1

      As of now it is not available.

      We are putting pressure on the current owners to make it available, as they have suffered a certain amount of bad publicity over this, but so far to no avail. They did purchase the program for real money 10 years ago, but the fact that they are unable to run it should indicate to them it has little or no value now.

      My thoughts have been on the lines of running it on some old PC hanging off some ADSL line with dynamic DNS but virtualization may be a better idea. Does anyone offer virtual private servers that run Dos?

      --
      N.B. this user is far too lazy to write a witty and intelligent sig.
    5. Re:Just migrate it to VMware or KVM by OzPeter · · Score: 1

      "we are in negotiations to try and get the original data."

      In other words the OP does not have the data. And from the OP's reply below it may be that they never get it.

      --
      I am Slashdot. Are you Slashdot as well?
    6. Re:Just migrate it to VMware or KVM by OrangeCatholic · · Score: 1

      Well the program and the data are two different things. At least to me they are.

      All you need to do is run the program once, get a dump of the entire article list, and import it into your new MySQL table.

      And running the program requires, what, DOS? Come on. Forget the web, that's out of the picture now with regards to the old, expired system. You just need ONE copy of the data and you can re-build the web interface yourself with php.

      It sounds to me like the data is proprietary and they are being stingy with it. But what other use they have for it, I don't know. You could have all the private libraries index their own collections, and collate the results, but something tells me that would require and extensive level of participation.

    7. Re:Just migrate it to VMware or KVM by b4dc0d3r · · Score: 2, Interesting

      If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.

      Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.

      Gimme just the DOS program at elast, I'll get you the format.

    8. Re:Just migrate it to VMware or KVM by commanderfoxtrot · · Score: 1

      Mod parent up!

      Nothing necessarily wrong with the DOS program anyway- if it works, why break it?

      You should be able to run it pretty easily with either a virtual machine or an emulator- you can then look at extracting from it the data and migrating it to a flashier site. Sticking with the DOS program sounds like the simpler solution for now.

      --
      http://blog.grcm.net/
  5. put the data online if you can by Anonymous Coward · · Score: 1, Informative

    There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.

    Yeah having some kind of online query interface can be useful and you should certainly implement one if you can. But much more important is the actual data. Make a zip file for download, no SQLor PHP needed. The SQL and PHP can be done later.

    1. Re:put the data online if you can by martin-boundary · · Score: 1

      Very true. In fact, making the data available for download also solves the problem of bandwidth bills. After the initial bunch of people have downloaded their own copy, they can serve it from other websites, thus sharing the load.

  6. The binary file shouldn't be hard to read by bartonski · · Score: 2, Informative

    I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.

  7. Try Ruby on Rails by olyar · · Score: 4, Funny

    I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.

    The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.

    (Please Note: This post is a joke, and not an attempt to start a flame war).

    --
    Custom, hands-free Linux installs. Instalinux
    1. Re:Try Ruby on Rails by greg1104 · · Score: 4, Funny

      It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.

  8. That is a data convertion project by mrmeval · · Score: 1

    You could write a custom program that would scrape the the data from a website you setup to allow that program to run stand alone or you figure out what the data format is and write a program to convert that.

    If you want to recreate the data from scratch then you'd need to set up a website your group would access and enter data. That would be crowd sourcing but you'd probably want something specific to your needs but using easily maintainable code.

    As others have stated you could use virtualization. Inside the virtual machine you may even be able to run a LAMP stack and run the DOS program with dosbox running as as an unprivileged user. http://www.dosbox.com/ http://www.virtualbox.org/ http://www.vmware.com/.

    I would only consider the virtual solution a stop gap until you could get the database translated to something maintainable or recreate the data.

    --
    I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
  9. Screen Scrape the Site by mbone · · Score: 1

    See if you can get access to the site again, and screen scrape it. That should not be too hard (search for all articles beginning with "A", then "B", etc.). Then, it should be straightforward to enter it into MySQL or your database of choice.

    (It is just possible the search functionality is still there, with just the HTML being taken down. The WayBack Machine could be your friend here...)

    1. Re:Screen Scrape the Site by tebee · · Score: 1

      If you could scape the site, I would have done it years ago. Unfortunately the programmer built in anti-scraping technology to the program to "protect his data". If you issue too many sequential requests it locks your IP out - Permanently ! I discovered this about 8 years ago when I was doing some manual scraping and it did it to me.

      if you look at the site ( http://index.mrmag.com/ ) on the wayback machine you can see the strange error you get - it locked that out too!

      --
      N.B. this user is far too lazy to write a witty and intelligent sig.
    2. Re:Screen Scrape the Site by PerformanceDude · · Score: 1
      Tebee,

      My company has some pretty sophisticated data transformation tools that we use in forensics. You can connect with me via the /. friends system if you manage to get hold of the source data. We may be able to return it to you in something simple like CSV and then from there things should be easy.

      Not promising a result but happy to at least take a look

      --
      Meus subcriptio est nocens Latin quoniam bardus populus reputo is sanus callidus
  10. Ask Pubmed guys by mapkinase · · Score: 2, Interesting

    Ask guys behind the Pubmed

    http://www.ncbi.nlm.nih.gov/pubmed

    The database of scientific articles in the field of medicine and biology.

    NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.

    I am surprised none of them reacted yet, I am sure they read ./

    --
    I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
    1. Re:Ask Pubmed guys by GumphMaster · · Score: 1

      Or perhaps the NASA Astrophysical Data Service http://adswww.harvard.edu/

      --
      Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
  11. And a thousand Mac Fanbois ... by rueger · · Score: 2, Funny

    ... leap up and shout "Filemaker Pro! Cause it's so shiny and pretty!"

    Oh, the number of times that I've heard that refrain... shudder ...

    1. Re:And a thousand Mac Fanbois ... by h4rr4r · · Score: 1

      Eww, the people responsible for that thing need to be lead into the street and shot.

      Until quite recently you could not even talk sql to it.

    2. Re:And a thousand Mac Fanbois ... by arcsimm · · Score: 1

      You know, I spent a semester of my life working for a department at my university that kept all of its operating information in FileMaker Pro databases. Of course, there were two of them, most of the data in one was replicated in the other, and if you actually wanted to *do* anything with the data in either, like have it show up on the departmental calendar or mailing list, you had to manually copy and paste it into still other databases. For most of that semester, my job was basically to function as a $10.00/hr database interface. Had I stayed on there any longer, my superiors would have probably showed up to work one day and discovered that all of their Filemaker DBs had mysteriously migrated into Postgres during the night...

    3. Re:And a thousand Mac Fanbois ... by Bungie · · Score: 1

      I also have spent a long time dealing with FileMaker too and it can be a huge PITA. Be thankful you didn't have to maintain a FileMaker Pro Server or web server for many people!

      It is very easy for non-tech savvy people to use to build a bunch of databases and start using them which is cool. The problem is that the databases have a very simple design and most people don't even know how to setup a relationship between two fields. They just drag and drop fields onto a form and let FileMaker figure out how to store and share the data.

      Those databases then tend to evolve and as they get more complex they are harder to manage using the simple interface that FileMaker Pro tries to provide. One person's quick inventory tracking database suddenly becomes a massive asset database used by the whole company years down the road, and you're left struggling to keep it running.FileMaker Pro and Lotus Domino are the worst for this kind of thing.

      IIRC there are a few ways to extract the data from FileMaker Pro databases. There is an ODBC driver that comes with the FileMaker Pro client (at least it did back in the 3.x and 4.x days). That would be the easiest way to extract the data for other applications to use. FileMaker Pro 4.0 used to also come with a web server plugin that would use CDML to generate dynamic web pages from the database (of course Claris HomePage was the best tool to nuild CDML apps at the time).

      --
      The clash of honour calls, to stand when others fall.
    4. Re:And a thousand Mac Fanbois ... by BiggerIsBetter · · Score: 1

      Strangely enough though, only had one customer with it on Mac. The rest have been fools running PC version under Windows... when they already have office installed with Access... or even an SQL Server on the network. ?!!?

      If you can show me a way to publish databases to the web that's as quick and easy as FileMaker Pro, I'd love to hear about it.

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
  12. Drupal, hands down. by Beltanin · · Score: 2, Interesting

    Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module , http://sf2010.drupal.org/conference/sessions/how-build-jobs-aggregation-search-engine-nutch-apache-solr-and-views-3-about , http://sf2010.drupal.org/conference/sessions/case-studies-non-profits-jane-goodall-and-musescore , http://sf2010.drupal.org/conference/sessions/case-studies-academia-drupal-asu-john-hopkins-knowledge-health

    1. Re:Drupal, hands down. by SpzToid · · Score: 1

      mod up seriously. Knowing what I know about Drupal + Solr, along with these fantastic examples, this is informative, truly.

      --
      You can't be ahead of the curve, if you're stuck in a loop.
  13. Built in to mySQL by Salamanders · · Score: 1

    MySQL 5's Fulltext index with the "natural language search" option might do everything you need with almost no overhead. That, plus PHP's PDO to connect to the database, and I think you might be done. How much data are we talking, anyhow? 10,000 magazine articles or less?

  14. Wayback by martin-boundary · · Score: 3, Interesting

    You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe, then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.

    1. Re:Wayback by Cylix · · Score: 1

      Definitely an easy re-write.

      Just going to be painful to re-enter all that data if they can't use the original binary blob.

      Long time ago I had a programming segment regarding binary blobs. Basically, unknown data structures within a binary. Provided they used no encryption it should be relatively painless to extract the data. It was trivial then and now I'm way better.

      --
      "You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
    2. Re:Wayback by Hognoxious · · Score: 1

      How will that help? As far as I understand, the pages are created on the fly, so without the "engine" behind them you won't get anything.

      --
      Confucius say, "Find worm in apple - bad. Find half a worm - worse."
  15. File format, not the implementation details by frisket · · Score: 2, Insightful

    It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.

    1. Re:File format, not the implementation details by jgrahn · · Score: 1

      It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.

      You captured the main point in by refer/BibTeX posting better than I did. Thanks.

      More than once I've had to salvage important data from obsolete database file formats. One instance was rare bird sightings in my area in the 1980s -- they had been painfully typed in over the years, but by 2005 they were sitting in RapidFile format on a half-dead 286 in someone's basement. People generally don't think about data preservation these days.

      Another betefit of using text file formats (surely the only preservable ones, if you think in decades) is that they are easily handled by version control tools, thus easy to author in a distributed effort, easy to audit for changes, and so on.

  16. Re:Silly question by Cylix · · Score: 1

    Bad idea.

    It's a bad idea for the same reason they don't want to host a a dos executable anymore.

    Even if some strange reason the text could not be retrieved from a binary blob (which is not likely) the application still works today.

    A single command line wild card search would re-dump the text which could be parsed and stored in a simple database.

    --
    "You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
  17. IMO it shouldnt be hard to re-parse the data by Wookie_CD · · Score: 1

    If you're talking, like most of the commenters above, about retrieving the data from the server through tm.exe, then this does become an exercise in scraping. wget has builtin recursive-fetching capabilities and if you can access a complete index that would be a logical starting point. With my background, if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql. I'd read the data file(s) looking for textual content in a linked structure, and the rest is just research and a bit of perl work (or php etc, if you prefer). Once you figure out which table structure would contain the data, and you come up with a conversion which will put the data into an importable format, the job's almost done and you just need to bring in or write a CMS to access it. I have source code which would go towards some individual bits of a project like this, contact me if you like. Good luck...

  18. hyperestraier by sugarmotor · · Score: 1

    Take a look at http://hyperestraier.sourceforge.net/ ... there might be something newer by the same author, Mikio Hirabayashi

    Extracting the text from whatever files you have would be a separate step.

    --
    http://stephan.sugarmotor.org
  19. Data hoarders by tepples · · Score: 1

    It also introduces the problem of people who download the whole data set just to collect it, with no intention of accessing the vast majority of it or serving it up to someone else.

    1. Re:Data hoarders by quickOnTheUptake · · Score: 1

      Just stick it on bittorrent, if there is a big demand.
      Realistically, though, I doubt the database is very large (moreover, I doubt there are all that many people who would want this data). I mean, if you are indexing 50 magazines, over 100 years, with an average of 10 articles in each one, that's 50k articles. Let's say each article has 200B of data, thats, what? ~2 meg uncompressed?

      --
      Mod points: Guaranteed to remove your sense of humor.
      Side effects may include gullibility and temporary retardation
  20. B& by tepples · · Score: 1

    wget has builtin recursive-fetching capabilities

    Which will get the IP address of the machine running the scraper permanently banned. See the post above.

    if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql

    It's likely that the raw data is encrypted. Based on the comment so far, I see no reliable indication of from what country tebee operates or whether this country has a DMCA-alike.

    1. Re:B& by Wookie_CD · · Score: 1

      This is all a bit academic until the content owner either agrees to reopen web access to a conversion team, or releases the source data for analysis.

    2. Re:B& by b4dc0d3r · · Score: 1

      It's not acedemic if we can show the poster some sort of very simple wiki-like CMS that people with 6 decades of back issues might volunteer to enter/edit information. If everyone were organized, 100 people could enter the data in a weekend. Allowing time to edit and refine keywords, without copying the actual content, would add some time. And the backend database could end up more valuable than the original.

      Scraping the data isn't possible, getting the data looks unlikely. So you recreate it. Have people claim an issue, and enter the data. People with few issues will claim the ones they have so people with more comprehensive coverage can focus on what no one else has. Bonus is, if no one else is interested, no one bothers to enter what they know, so the project self-immolates.

  21. hoarding == massive replication by martin-boundary · · Score: 2, Interesting

    Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.

  22. Re:Developing a Niche Online-Content Indexing Syst by Tablizer · · Score: 1

    WikiPedia's search stinks in my opinion. It's gotten better of late, but still not the Gold Standard by any stretch.

  23. Fancy that by Anonymous Coward · · Score: 1, Funny

    > One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. [...] The governing body for the hobby has agreed to host this

    Huh, I didn't realize that porn had a governing body.

  24. It's a library catalog. by oneiros27 · · Score: 1

    Don't ask generic nerds -- ask library nerds : code4lib . They have a pretty active mailing list.

    Also, there's oss4lib which is specifically for open source software, but I haven't seen much activity on their list in a while, and I think most of us are on both lists. (there's also a few cataloging specific lists, but they get to be all library-sciencey, with discussions of RDA and FRBR and cataloging aggregates).

    --
    Build it, and they will come^Hplain.
    1. Re:It's a library catalog. by dangitman · · Score: 1

      Denholm: It's settled. I've got a good feeling about you Jen and they need a new manager.
      Jen: Fantastic! So, the people I'll be working with, what are they like?
      Denholm: Standard nerds!

      [Note: Not to be confused with standards nerds]

      --
      ... and then they built the supercollider.
  25. Using a Howitzer to Hunt Squirrels by salesgeek · · Score: 2, Insightful

    Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.

    Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.

    --
    -- $G
  26. This is the ModelRR mag. database by codeaholic · · Score: 1
    The description suggests that this is the Model RR magazine DB. Checking Kalmbach (the company that hosted it), shows that, indeed, it is off line. (http://index.mrmag.com/) The DB was a very simple (by today's standards) index of articles.

    As many posters have said, it should be easy (for a programmer) to pull the data from the DB -- if you can get the original data files from Kalmbach. The data was not complex, and 80's DBs tended to have simple file formats. As many suggested, a C++, Java, Python or other script can pull the data out and dump it to XML, MySQL, CSV files, etc. From there, it is easy to rehost it wherever needed.

    My suggestion is to simply replicate the old (very dated, but simple) UI: both for searching and for data entry. That can be done very easily in PHP & MySQL. These tools are readily available on any web host making the task fairly simple (for someone familiar with these tools.) It also means that the site's webmaster should know what needs to be done to secure the app.

    Getting a straight replacement up validates the whole process, and restores the existing functionality. Only at that point should you consider extending the system, perhaps using many of the good ideas noted above. Obvious extensions are to license the full text of articles to provide a full-text index (rather than just hand-entered keywords as in the current system.) Perhaps provide links to publishers sell them online. Lots of ways to go.

    Good luck. As a user of the DB, I'd love to see it back online & better than ever!

    1. Re:This is the ModelRR mag. database by codeaholic · · Score: 1

      If this is done as a volunteer effort, I'd be happy to help, esp. with extracting the old data. Contact me using my Slashdot user name + att dot net. (I hope THAT fools the spambots!)

  27. Why PHP? by WhiteHorse-The+Origi · · Score: 1

    Why use PHP? I would think Python would be better because you can cross-compile the code to run on any machine using Jython(in case they stop hosting for you). Personally, I would do a full scrape of the data and put it in BibTex .bib files or xml and then make your search page pass parameters to the python program. That's what NASA and Google Scholar use(they may use Perl instead). I'm not sure about the database...

  28. Dspace by ericlondaits · · Score: 1

    Check out Dspace (http://www.dspace.org/). I'm by no means an expert in the area but it seems it might be what you need.

    --
    As a Slashdot discussion grows longer, the probability of an analogy involving cars approaches one.
  29. Hypercard 2.0? by AHuxley · · Score: 1

    Something like an open source hypercard stack?
    Anyone can understand a card system, enter unique data per card and save.
    Humans are good at that.
    Bring them all together and you have a huge digital stack to be sorted, searched or as the backend to a nice simple topic interface.
    Computers are great at that now.
    That would help your crowd sourcing if its open source no MS closed issues later on.

    --
    Domestic spying is now "Benign Information Gathering"
  30. Re:for my bunghole by OrangeCatholic · · Score: 1

    >Well until he does get it, any consideration of how to process it is somewhat moot.

    Not quite. He was clear enough to construct a data model. This customer knows what he wants. Problem is, it will take his own efforts to fill in the gaps (in terms of getting access).

    "Hi, I want you to install a refrigerator in my apartment. It needs to fit in a hole 30 inches wide by 30 inches deep."

    "Will you take a refrigerator 28 inches wide by 26 inches deep?"

    "Sure but....lemme talk to my landlord first."

    If Zeus descended from the sky and said, "I'll do whatever it takes to get this index online..."

    Would Zeus succeed, or would the customer say to him, "I'm not ready?"

    P.S. I'm NOT for hire on this job. I am not.even.a.programmer.anymore.

    I will, however, take queries as far as I check my email (which is unreliable) and as far as I check this page (until tomorrow at the least).

    You asked, you got your answers. 88 comments, perhaps 10 of them were useful. Anyone who says to "use X" is dumb. By the time you figure out how to use it, you could have written your own.

    This is 1-3 tables, which for a real-world analogy is like 1-3 sheets of paper. Customer says what? Landlord? Landlord rules 1-3 sheets of paper. Good luck with that access.

  31. DOS Data by nospam007 · · Score: 1

    If it's 20 years old DOS, chances are that it's either Paradox or dBASE or any xBASE format, which could be easily opened with Access or even Winword.

  32. No, no, NO! by RichiH · · Score: 3, Insightful

    Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.

    Read http://en.wikipedia.org/wiki/Database_normalization

    You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.

    * id is on auto-increase, not NULL
    * article_id is a foreign key to the id of the article, not NULL
    * name is the name of the tag, not NULL
    * comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL

    Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
    How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
    Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
    etc pp.

    If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:

    http://en.wikipedia.org/wiki/Transitive_closure

    Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.

    One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.

    http://en.wikipedia.org/wiki/SQL_injection

    Richard

    PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.

    1. Re:No, no, NO! by RichiH · · Score: 1

      The second statement should have read

      SELECT article_id FROM article_tags WHERE name = "foo" AND name = "bar"

      for obvious reasons.

    2. Re:No, no, NO! by ultranova · · Score: 1

      No, OR is correct here. AND doesn't find any rows because field "name" has only a single value, so 'name = "foo" AND name = "bar"' can't ever be true for any row. You want something like

      SELECT article_id FROM article_tags WHERE name = "foo" INTERSECT SELECT article_id FROM article_tags WHERE name = "bar"

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

    3. Re:No, no, NO! by RichiH · · Score: 1

      Ah, yes thanks... I should not be allowed to post before coffee...

  33. BibTeX by GeniusDex · · Score: 1

    It may not be a complete solution, but have you looked at BibTeX? BibTeX itself is only a format for nicely stating the information you have available (which magazine, article title, which pages in the magazine, authors, etc), but in the entire BibTeX ecosystem a number of indexing systems are built. Quite a lot of them are for desktop use (so you can manage your own BibTeX entries), but I'd imagine there would be some web-based system for this as well.

  34. Sure About DOS? by Bungie · · Score: 1

    Are you sure it's a true DOS application and not a Win32 Console App? I know it is entirely possible for someone to write a CGI in DOS but it seems really weird to me that they would use DOS since it didn't have anything that would server CGI, and coding a hand rolled database format would be a lot of extra work.

    If it is using Win32 it might just be accessing a DAO database without using the mdb extension, which many companies do to make it look like a proprietary format you can't just open with MS Access. If you look at the raw data it might look crazy and unusable because JET databases use XOR to obfuscate the contents of the database file (and prevent you from extracting the strings inside).

    --
    The clash of honour calls, to stand when others fall.
    1. Re:Sure About DOS? by tebee · · Score: 1

      Your right it is. A visit to the wayback machine found this page- http://web.archive.org/web/20070626092758/www.index.mrmag.com/tm.exe?tmpl=tm_info

      Nn which is written -

      The TM application is written in "C", and is based on an ISAM/Network database manager I wrote in the late 1980's. The code is highly portable, and versions exist for MS-DOS, Windows NT and several flavors of UNIX. I also run it on my HP Palmtop. The version running on this site is a Win32 console application.

      Which just goes to show I should not trust either my memory or what people tell me. Sadly it also indicates the data is in an unknown and probably unique implantation of ISAM.

      --
      N.B. this user is far too lazy to write a witty and intelligent sig.
  35. There is at least one misunderstanding. by Jane+Q.+Public · · Score: 1

    It simply cannot be the case that the original data is in an "unknown" format. If it were, it would never have been retrievable. The format might not be known to YOU, at this particular time, but that is not the same as unknown.

    Your first priority is to find out how the original data is stored and accessed. If as you say it is about 20 years old, I strongly suspect it is stored in a C-ISAM or D-ISAM database, and known code libraries are used to access it.

    You should then be able to lean heavily on existing code for retrieval of that information, and using a modern scripting language, transfer the data to a new, normalized relational database.

    You make it sound like this is some kind of archeological adventure into some kind of untranslatable code hieroglyphics of the distant past. But as I say that cannot be so, unless it was completely designed from the ground up by a single eccentric individual. The clues and the tools are there, if you know where to find them.

  36. Nail down the file format first by jgrahn · · Score: 1
    It seems to me that your core problem is data preservation for long periods of time; if you can save that 1930--present data, you don't want to lose it again. You should go for a plaintext file format, and be aware that you *are* using a file format.

    There are file formats for this. Probably there are XML languages if you like that kind of thing, but either of two older ones would serve you well I think: the refer(1) format for bibliographic databases, and the BibTeX format. At least the latter is still in use and you can download such indices for various journals -- see the Wikipedia BibTeX entry.

    When you have fixed the file format, *then* you can decide on indexing strategies and software. Probably you'll find that someone has already done that for your file format. Mike Lesk did it for refer back in the 1970s ...

  37. Semantic web by sogon · · Score: 1

    Just make one xhtml document with semantic annotations There are plenty of solutions for indexing this information. Also a simple google site: will often suffice for most queries.

  38. Brewster Kahle's Digital Library by satellitedirect · · Score: 1

    Brewster Kahle is building a truly huge digital library -- every book ever published, every movie ever released, all the strata of web history ... It's all free to the public. A video describing his project can be found at : http://www.ted.com/talks/lang/eng/brewster_kahle_builds_a_free_digital_library.html Good luck with your project.

  39. A library catalog system is needed by Edgester · · Score: 1

    This sounds almost exactly like a library catalog system. If the system doesn't index articles, then just treat each article as a book in a multi-volume set. I know that several open source library system exist. Look into those.

  40. Backup everything by Antique+Geekmeister · · Score: 1

    Seriously, first step, back up *EVERYTHING*. This includes your programs and your data.

    Then see if your ancient programs can be run inside a useful modern emulation enviornment, like "dosbox" or "freedos". That can buy you another 10 years.

    It also buys you access to the data without using your ancient hardware: you can read the backups and play with the data much more safely, to try and decode the format. Given the software's age, it's unlikely to be more sophistated than a very simply index and tables that may be decodable with a good editor.

  41. Re: ISAM/Network database manager by gregben · · Score: 1

    If you or someone can get me the database files (from Kalmbach?) I am willing
    to try to extract useful data from them, into simple ASCII text file(s), suitable
    for loading into a relational database like Postgresql, for free.

  42. Use a blog by dgriff · · Score: 1

    Feed all the data into a blog, one magazine edition per blog entry. Some blog software lets you set the date of the entry - use that to set the date to the edition publication date. Enter the keywords as blog tags. You can expand the blog entry to contain such things as (e.g.) a picture of the cover, some short descriptive text. You then also get a free forum where people can discuss the edition in question.

  43. Zebra is great of bibliographic data by kylemhall · · Score: 1

    The data could easily be converted into MARC bibliographic records and indexed with Zebra. You could then use zebra has a stand-alone Z29.50 server, or run Koha on top for easy searching. Zebra can search millions of records in seconds, so it would be ideal, considering this is essentially bibliographic data. I am a public library IT guy, and would seriously be willing to help out if can use me. Just send me a email ( kyle DOT m DOT hall AT gmail.com ). I can take the raw data and convert it to a bulk marc record set for you, and probably even offer alternative hosting if you'd like.

  44. Re:I work for a Library by SEWilco · · Score: 1
    Yes, book or periodical indexing software may be suitable. Just pick one with keyword support and an assortment of search fields.

    If the individual article summaries are also made available on individual pages, let Googlebot index them and people will be able to discover relevant individual articles through Google as well. Then anyone looking for something covered by an article will discover your index, as well as which issue of the magazine they need.

  45. CWIS Open Source Solution by Snowdog · · Score: 1

    It sounds like CWIS may be what you're seeking. It's a free web-based turnkey package, developed at the University of Wisconsin - Madison and funded in part by NSF under the National Science Digital Library initiative. CWIS is written in PHP/MySQL, includes a search engine, a recommender engine, and a raft of other features, and is currently in use in a wide array of contexts.

  46. Re:Where do we help? by tebee · · Score: 1

    At the moment things are a little fragmented but we seem to be congregating on this thread http://model-railroad-hobbyist.com/discountinued_mag_index

    I hope things will get a little more organized once we have a clear idea whether the original data is going to be available to us

    I too have a website we could use , I'm currently putting up versions of some of the things that have been suggested here. It's at http://pc-cafe.co.uk/mr

    --
    N.B. this user is far too lazy to write a witty and intelligent sig.