Developing a Niche Online-Content Indexing System?

Sphinx or Lucene by Anonymous Coward · 2010-07-17 09:09 · Score: 3, Informative

Or did I misunderstand the question?

Re:Sphinx or Lucene by Anonymous Coward · 2010-07-17 09:35 · Score: 0

Yes, you misunderstood it. In fact, it's pretty clear that you didn't even bother to read anything he wrote. He clearly doesn't want to index a bunch of text for searching purposes. He basically just wants to build a directory web site, much like Yahoo! or the Open Directory Project, but targeted to his subject's niche.
Re:Sphinx or Lucene by tebee · 2010-07-17 09:46 · Score: 4, Informative

Yes, you did misunderstand.
We do not have the full text of the article online , all we have is its title, author and some manually created keywords. It's necessary to have access to the physical magazine to read the content of the article, but this is a hobby(model railroading) where many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines.
All the solutions I could find seemed to be based, like those two, on indexing the text of the articles.
It would be much better if we did have the text as well, but as I said there is the minor problem of copyright. The fact that the index has been run for the last 10 years by the major (dead tree) publisher is this field has also discouraged development in this direction.

--
N.B. this user is far too lazy to write a witty and intelligent sig.
Re:Sphinx or Lucene by martin-boundary · 2010-07-17 10:13 · Score: 3, Interesting

Even if you have only the title/author, you're still indexing text. Think of a tiny little text file containing two or three lines: title, author, keywords. You'll need a volunteer to type this in. Then you dump those files in a directory and run an indexer.
If this isn't what you have in mind, please elaborate.
Re:Sphinx or Lucene by hsmyers · 2010-07-17 10:18 · Score: 1

Not being up to speed on current open source that might prevent premature wheel re-invention my answer would be 'No'. That said, I don't see any particular trouble with the project itself. If I understand correctly, you've bare bones bibliographic information that you want to create an on-line index of. The notion of PHP and MySql seems sound although I suspect that Perl would work as well if not better, depending on the knowledge of your volunteer talent. I expose my bias here when I point out that text analysis is a particular strength of Perl. I'm currently involved with a project that does an enormous amount of semantic analysis which might be used to create key words on the fly for instance. Now that I think of it, there is no particular reason that the work couldn't be multi-lingual for that matter, leveraging the programmer base at your disposal. Continuing to think about it, I'd love to help--- reply to me at gmail.com if you are interested...
Re:Sphinx or Lucene by symes · 2010-07-17 10:22 · Score: 1

They have no plans to replace it as the original data is in an unknown format.
Well there aren't that many obvious candidates... any of these look familiar?
Re:Sphinx or Lucene by OrangeCatholic · 2010-07-17 10:26 · Score: 3, Informative

So let me get this straight: This is a single table? You have one table (spreadsheet), where each row represents one article. The columns would be title, author, and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever).
Then you need the standard row_id and whatever other crufty columns creep in. If this is all you need, you can do this in Excel (har har). Or install MySQL, create the table (we'll call it mr_article_list), then write the standard php scripts to add, edit, delete, and retrieve entries.
These scripts are basically just web forms that pass through the entered values into the database. You're talking a single code page for each of the inputs, and then a page each for the output/result, or 8 pages total.
For example, the mr_add.php script (mr_ stands for model railroad) retrieves a new row_id from the db. Then it presents a web form with input fields for the title, author, and keywords. Then it does db_insert(mr_article_list, $title, $author, $keywords). Then it calls mr_add2.php, which is either success or failure.
The edit, delete, and retrieve scripts are similarly simple. All you need is a linux box to do this, and the basic scripts could be written in two evenings (or one long one) - assuming you hired someone who does this for a living.
Now this is where it gets interesting:
>many clubs and individuals have vast libraries often spanning 5 or 6 decades of monthly magazines
Do you want to store this information as well, so that people know who to call to get the issue? I assume this would be the real useful feature. So now you need a second table, mr_sources, which is basically a list of clubs/people, so the columns in this table would be like row_id, name, address, phone number (standard phone book shit).
Then you need a third table, mr_article_sources, which is real simple, it just matches up the rows in the article list to the rows in the source list. It's columns are simply row_id, article_row_id, source_row_id. This is a long and narrow table that cross-indexes the two shorter, fatter tables (the list of articles, and the list of sources).
Example, article_id #19 is "How to shoot your electric engine off the tracks in under three seconds." Source_id #5 is Milwaukee Railroad Club, #7 is San Jose Railroad Surfers, and #9 is Bill Gates Private Book Collection. All three of them have this article. So your cross-index table would look like this:
01 19 05
02 19 07
03 19 09
When you search for article #19, it finds sources 5, 7, and 9 in the cross-index table, then queries the source table for the names and phone numbers of those three clubs (and displays them).
Finally, if you're wondering how to query three different tables at the same time, well, databases were made to do exactly this.
Re:Sphinx or Lucene by hsmyers · 2010-07-17 10:35 · Score: 1

Just noticed an thread on Hacker News on http://www.gotapi.com/html which might be of interest...
Re:Sphinx or Lucene by tolan-b · 2010-07-17 10:39 · Score: 1

I think you should still have a look as Sphinx and Lucene. You can put whatever data you want into them, in whatever schema you want (at least with Lucene, I believe with Sphinx too). You can then easily create a UI as a front end and let the indexing engine do the hard work of slicing and dicing by your criteria. I believe the Zend Framework library has a Lucene API.
Also if you do manage to go fulltext later then it'll mean less work.
Re:Sphinx or Lucene by Trepidity · 2010-07-17 10:46 · Score: 3, Interesting

If you have relatively little but highly structured data, running it through a general search engine like Lucene or Sphinx doesn't seem like the ideal solution, because it doesn't make it easy to do structured queries ("give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966").
A bibliography indexer would probably be a better choice. Two good free ones are Refbase or Aigaion. Both are targeted mainly at databases of scientific literature, so might need some tweaking for this purpose, though.

--
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
Re:Sphinx or Lucene by martin-boundary · 2010-07-17 10:53 · Score: 2, Interesting

Yes, I was mainly trying to point out that his problem is still conceptually a text indexing problem even if he doesn't have the text of the articles. A scientific bibliography database can be a good choice, as some journals can have arcane numbering systems, so they should be able to cope with a magazine collection.
Like someone else pointed out, though, if at some point he expects to get access to the full text or even just scans of the articles, he'd better have chosen a system that can easily expand to handle that.
Re:Sphinx or Lucene by rs79 · 2010-07-17 12:24 · Score: 2, Interesting

I do the same thing for tropical fish and wrote a shitload of C code. If this is an old DOS program it should port to C/UNIX really stupid easy.
Drop me a line if you want to and I'll ask you to send me some sample data. This might be really easy.

--
Need Mercedes parts ?
Re:Sphinx or Lucene by banjo+D · 2010-07-17 12:36 · Score: 1

I don't know about Sphinx but I agree that Lucene could be a good solution, for the reasons tolan-b lists. I work on a digital library cataloging project that indexes it's metadata with Lucene. We use PHP to generate the user-facing website, which queries our Lucene index via a Solr server. We do have a highly structured metadata schema and we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966" (which somebody in another comment suggested is not easy to do with Lucene, but in our experience was very easy). And adding a Solr server on top makes it easy to include features like faceted search.
Re:Sphinx or Lucene by sporkboy · 2010-07-17 15:06 · Score: 1

The standard /. IANAL applies here, but I'm pretty sure that if you have legal access to the copyrighted text (ie you or someone you know owns a copy of the magazine) then it is ok to create a derivative work for the purposes of searching that work. This is the loophole that Google (name your favorite search engine here) uses, and they go so far as to offer cached versions of some sites.
Lucene, or a more friendly wrapper around it like SOLR, has the option of creating a search index based on an original text from which the original content cannot be extracted (indexed=true, stored=false on a field), so that would seem to cover the case of finding an article without violating the rights of the author or the publisher.
As for not having the text online, I'd suggest either scraping the archive sites in the process of building your search index, it's pretty hard to search something that isn't digitized.
Best of luck, as this sounds like a worthwhile project. I do think that the volume of data you're discussing would fit easily in a SOLR instance that would consume very modest amounts of server resources to operate.
Re:Sphinx or Lucene by Jherico · 2010-07-17 17:45 · Score: 1

Solr in front of Lucene is a perfectly reasonable way to index highly structured information and allows structured queries.

--
Jherico
What can the average user can do to ensure his security? "Nothing, you're screwed"
Re:Sphinx or Lucene by OrangeCatholic · 2010-07-17 18:19 · Score: 1

It's really not a text indexing problem, unless you are going to throw out rdbms and use a flat text file.
If you will use relational database, then it is a 3-table problem at most. Articles, sources, and articles to sources. If you can join those, you have the core of a classic content management system.
From what I gather, they haven't even gotten that far. It is just a master index of articles that are available (which point to nothing in particular), so it is a 1-table problem.
For 1-table problems I generally use Excel.
Re:Sphinx or Lucene by OrangeCatholic · 2010-07-17 18:37 · Score: 1

>we do run queries that include things such as "give me all articles in Magazine including 'foo' in the title, published between 1950 and 1966"
SELECT * FROM banjo_articles WHERE title LIKE "%foo%", date BETWEEN "1950-01-01" AND "1966-12-31"
You're bragging that your "system" has a single line of code?
I've seen selects ten or twenty lines long, with multiple joins, and joins and selects within joins. Granted it's not fast, but it works, and it takes all of an hour (or less) to write such a query.
Re:Sphinx or Lucene by Anonymous Coward · 2010-07-17 22:49 · Score: 0

Use solr then.
It stores and searches documents, each being a tuple of fields, with every field being defined by a type, which can be parsed/indexed/stored in a specific way.
It has a powerful "filter" contruct, so you can run searches against it with no search text but with "where conditions".
It has no db structure, so you can later alter your docs if, besides the author/date/title, get access to the text.
It is incorporated into many oss cms (ezpublish being the first one that come to my mind. Hey, why not use that for the site?)
Re:Sphinx or Lucene by Anonymous Coward · 2010-07-18 03:07 · Score: 0

and either five or so columns of keywords, or a single varchar column that would hold them all (comma-delineated or whatever.
You have just managed to violate the first normal form of database design in your first paragraph. Please hang your head in shame and then read up on normal forms, and until you do, you are not to be let anywhere near a database of any kind.
Re:Sphinx or Lucene by AmiMoJo · 2010-07-19 01:46 · Score: 1

Couldn't you just scan and OCR the magazines and then use that data to compile a searchable database? You could even supply extracts from the articles where the search terms appear, similar to what Google does. By not presenting the full text of the article I think you would be on safe ground copyright wise.
It would also be a good way of archiving old paper documents which can degrade over time. I'm not sure what copyright terms are in your country but some of those magazines might be in the public domain anyway now. Even if they are not the copyright owner might have disappeared by now.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Re:Sphinx or Lucene by Unequivocal · 2010-07-19 06:15 · Score: 1

Check out: http://xtf.sourceforge.net/
I think it uses lucene on the backend. It's designed to map meta-data sources to meta-data outputs via XSL templates. I talked with some of the developers recently and it sounds reasonable. If your inputs are binary then it's probably not much help but for XML-like inputs it might give you some of the capabilities you're looking for. HTH
Re:Sphinx or Lucene by Anonymous Coward · 2010-07-19 18:56 · Score: 0

Here's how to do that query:
q=title:foo&fq=periodical:Magazine+publishDate:(1950 TO 1966)&f=articleNumber
I'm not sure what's hard about that, lucene handles structured queries really well. The trick is to supply the "non-guessable" fields as field queries ("fq") that way they have to appear in the result set, there's no score associated with them. Plus they are cached so the next one will be quick.

It would help by Xamusk · 2010-07-17 09:10 · Score: 3, Insightful

if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.

Re:It would help by beakerMeep · 2010-07-17 09:34 · Score: 3, Funny

Maybe it's the type of magazines that people used to read "for the articles?"

--
meep
Re:It would help by bsDaemon · 2010-07-17 09:47 · Score: 2, Funny

I'm pretty sure porn indexing isn't niche... or a hobby. Its the true reason Google exists.
Re:It would help by tebee · 2010-07-17 09:51 · Score: 4, Informative

OK the hobby is model railroading and the index was at http://index.mrmag.com/tm.exe but was removed , without warning, last week so there is not a lot to see.

--
N.B. this user is far too lazy to write a witty and intelligent sig.
Re:It would help by Anonymous Coward · 2010-07-17 10:02 · Score: 0

Who the fuck uses the suffix ".exe" for a Web page? These guys have bats in the belfry in general, it seems.
Re:It would help by ZERO1ZERO · 2010-07-17 10:09 · Score: 1

It's not a web page it's a DOS program, hence the ask slashdot....
Re:It would help by tebee · 2010-07-17 10:24 · Score: 1

It's a DOS program that runs on the server, rather like a CGI script. It's output is a web page.
It is bit of a throw back to the dawn of the web when people thought up innovative ways to do things.

--
N.B. this user is far too lazy to write a witty and intelligent sig.
Re:It would help by Anonymous Coward · 2010-07-17 10:37 · Score: 0

Wrong crowd. He's hoping for people interested in online-content indexing systems, not a thread full of duffers commenting on the niche.
Re:It would help by BitterOak · 2010-07-17 14:04 · Score: 1

if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.
Maybe it's the type of magazines that people used to read "for the articles?"
And that's precisely the type of magazine that would catch the interest of the Slashdot crowd.

--
If I can be modded down for being a troll, can I be modded up for being an orc, or a balrog?
Re:It would help by Bungie · 2010-07-17 17:21 · Score: 1

CGI works by having the server executes the program (passing the data to it from STDIN or the command line) and then retreiving the page's complete HTML code from STDOUT. You can use any file that can be executed and use STDIN/STDOUT in this manner that is located in a specified location(like cgi-bin). On Windows this would be any exe,com,pif,bat or cmd file, and the extension must be there for the operating system to determine that it is an executable. On Linux you can use any file that +x permissions, compiled binaries or scripts with a bang at the beginning, so it can have any extension you want (or none) for a CGI.
People used to write a lot of CGI applications in perl because of it's text processing capabilities, but there were many CGI's that were compiled programs (written in languages like C). At one point Microsoft was really pushing the idea of easily writing CGI applications in Visual Basic and hosting them with IIS.
CGI fell out of popularity in favor of embedded scripting like PHP and ASP which have much less overhead (they don't have to create a new process to service every user request and wait for it's output) and are much less complex for people to use (they don't require special directories or permissions).

--
The clash of honour calls, to stand when others fall.

pubmed by Anonymous Coward · 2010-07-17 09:12 · Score: 0

ask the pubmed people at NIH: http://www.ncbi.nlm.nih.gov/pubmed

Developing a Niche Online-Content Indexing System? by omar.sahal · 2010-07-17 09:13 · Score: 3, Insightful

I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.

Just migrate it to VMware or KVM by RobiOne · 2010-07-17 09:13 · Score: 3, Informative

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.

--
-- Robi

Re:Just migrate it to VMware or KVM by Anonymous Coward · 2010-07-17 09:26 · Score: 0

without more info from OP, this is the end of the discussion.
Re:Just migrate it to VMware or KVM by pspahn · 2010-07-17 09:28 · Score: 1

This could work and allows you enough time to not come up with something lame.

--
Someone flopped a steamer in the gene pool.
Re:Just migrate it to VMware or KVM by OzPeter · 2010-07-17 09:37 · Score: 2

Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.
That assumes that the original data is available to the OP. It may be that it is not.

--
I am Slashdot. Are you Slashdot as well?
Re:Just migrate it to VMware or KVM by Threni · 2010-07-17 10:09 · Score: 1

> That assumes that the original data is available to the OP. It may be that it is not.
If only the article in some way made this clear.
"we are in negotiations to try and get the original data."
Oh, it does.
Re:Just migrate it to VMware or KVM by tebee · 2010-07-17 10:17 · Score: 1

As of now it is not available.
We are putting pressure on the current owners to make it available, as they have suffered a certain amount of bad publicity over this, but so far to no avail. They did purchase the program for real money 10 years ago, but the fact that they are unable to run it should indicate to them it has little or no value now.
My thoughts have been on the lines of running it on some old PC hanging off some ADSL line with dynamic DNS but virtualization may be a better idea. Does anyone offer virtual private servers that run Dos?

--
N.B. this user is far too lazy to write a witty and intelligent sig.
Re:Just migrate it to VMware or KVM by Anonymous Coward · 2010-07-17 10:26 · Score: 0

It seems to me that you don't need a VPS that runs DOS. Just signup for a Linux-based VPS and run DOSbox on top of it. The performance hit will be minimal.
Re:Just migrate it to VMware or KVM by Anonymous Coward · 2010-07-17 10:31 · Score: 0

I believe network solutions offers a virtual server with Domain/DNS that runs linux, but I am sure you could get one with DOS.
Re:Just migrate it to VMware or KVM by OzPeter · 2010-07-17 10:35 · Score: 1

"we are in negotiations to try and get the original data."
In other words the OP does not have the data. And from the OP's reply below it may be that they never get it.

--
I am Slashdot. Are you Slashdot as well?
Re:Just migrate it to VMware or KVM by OrangeCatholic · 2010-07-17 10:46 · Score: 1

Well the program and the data are two different things. At least to me they are.
All you need to do is run the program once, get a dump of the entire article list, and import it into your new MySQL table.
And running the program requires, what, DOS? Come on. Forget the web, that's out of the picture now with regards to the old, expired system. You just need ONE copy of the data and you can re-build the web interface yourself with php.
It sounds to me like the data is proprietary and they are being stingy with it. But what other use they have for it, I don't know. You could have all the private libraries index their own collections, and collate the results, but something tells me that would require and extensive level of participation.
Re:Just migrate it to VMware or KVM by b4dc0d3r · 2010-07-17 12:51 · Score: 2, Interesting

If you do get the original data, I'll volunteer to either disassemble the exe or RE the data format or preferably both. Just for the fun of it. Contact me at the /. nick over in the google mail system.
Offer to let them host a redirect if they want - interstitial advert page with a 'we have moved', and offer to redirect to that page if they are not the referrer for a certain timeframe. They get some advert money, you get the data, I have something to entertain myself with.
Gimme just the DOS program at elast, I'll get you the format.
Re:Just migrate it to VMware or KVM by Anonymous Coward · 2010-07-17 19:28 · Score: 0

The problem isn't how to handle the data from the legacy solution, he wants to know what would be good modern solution for indexing and searching the article information with their specific constraints. The question wasn't related to using the legacy platform at all, he might not even have access to the old system. Even if he did he wants to migrate to something newer but first he needs to know what will do the job (thus the ask Slashdot)...
Re:Just migrate it to VMware or KVM by commanderfoxtrot · 2010-07-17 21:04 · Score: 1

Mod parent up!
Nothing necessarily wrong with the DOS program anyway- if it works, why break it?
You should be able to run it pretty easily with either a virtual machine or an emulator- you can then look at extracting from it the data and migrating it to a flashier site. Sticking with the DOS program sounds like the simpler solution for now.

--
http://blog.grcm.net/

put the data online if you can by Anonymous Coward · 2010-07-17 09:17 · Score: 1, Informative

There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.

Yeah having some kind of online query interface can be useful and you should certainly implement one if you can. But much more important is the actual data. Make a zip file for download, no SQLor PHP needed. The SQL and PHP can be done later.

Re:put the data online if you can by martin-boundary · 2010-07-17 10:19 · Score: 1

Very true. In fact, making the data available for download also solves the problem of bandwidth bills. After the initial bunch of people have downloaded their own copy, they can serve it from other websites, thus sharing the load.

The binary file shouldn't be hard to read by bartonski · 2010-07-17 09:27 · Score: 2, Informative

I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.

Try Ruby on Rails by olyar · 2010-07-17 09:28 · Score: 4, Funny

I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.

The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.

(Please Note: This post is a joke, and not an attempt to start a flame war).

--
Custom, hands-free Linux installs. Instalinux

Re:Try Ruby on Rails by greg1104 · 2010-07-17 13:43 · Score: 4, Funny

It's data for model railroading magazine, so not only are they used to rails, they already have protocols to serialize access to shared resources and prevent collisions.

Silly question by Anonymous Coward · 2010-07-17 09:32 · Score: 0

This may seem like a silly question, but if the data is in an unknown format and it's handled by an existing DOS program, why not just keep using that old DOS program? It still works, probably has low resources and is dealing with (I gather) mostly fixed data. Maybe bring in some people to try to reserve engineer the data format? But, really, just move the DOS program and data to a different server and save yourself months of effort.

Re:Silly question by Cylix · 2010-07-17 10:55 · Score: 1

Bad idea.
It's a bad idea for the same reason they don't want to host a a dos executable anymore.
Even if some strange reason the text could not be retrieved from a binary blob (which is not likely) the application still works today.
A single command line wild card search would re-dump the text which could be parsed and stored in a simple database.

--
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
Re:Silly question by Anonymous Coward · 2010-07-17 14:47 · Score: 0

Who says it supports wildcards?

That is a data convertion project by mrmeval · 2010-07-17 09:37 · Score: 1

You could write a custom program that would scrape the the data from a website you setup to allow that program to run stand alone or you figure out what the data format is and write a program to convert that.

If you want to recreate the data from scratch then you'd need to set up a website your group would access and enter data. That would be crowd sourcing but you'd probably want something specific to your needs but using easily maintainable code.

As others have stated you could use virtualization. Inside the virtual machine you may even be able to run a LAMP stack and run the DOS program with dosbox running as as an unprivileged user. http://www.dosbox.com/ http://www.virtualbox.org/ http://www.vmware.com/.

I would only consider the virtual solution a stop gap until you could get the database translated to something maintainable or recreate the data.

--
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty

Security anyone? by Anonymous Coward · 2010-07-17 09:47 · Score: 0

Running 20 years old code is a security disaster - now you want to replace it in what... php? Few people would call that an improvement.

Screen Scrape the Site by mbone · 2010-07-17 09:53 · Score: 1

See if you can get access to the site again, and screen scrape it. That should not be too hard (search for all articles beginning with "A", then "B", etc.). Then, it should be straightforward to enter it into MySQL or your database of choice.

(It is just possible the search functionality is still there, with just the HTML being taken down. The WayBack Machine could be your friend here...)

Re:Screen Scrape the Site by tebee · 2010-07-17 10:45 · Score: 1

If you could scape the site, I would have done it years ago. Unfortunately the programmer built in anti-scraping technology to the program to "protect his data". If you issue too many sequential requests it locks your IP out - Permanently ! I discovered this about 8 years ago when I was doing some manual scraping and it did it to me.
if you look at the site ( http://index.mrmag.com/ ) on the wayback machine you can see the strange error you get - it locked that out too!

--
N.B. this user is far too lazy to write a witty and intelligent sig.
Re:Screen Scrape the Site by PerformanceDude · 2010-07-17 10:57 · Score: 1

Tebee,
My company has some pretty sophisticated data transformation tools that we use in forensics. You can connect with me via the /. friends system if you manage to get hold of the source data. We may be able to return it to you in something simple like CSV and then from there things should be easy.
Not promising a result but happy to at least take a look

--
Meus subcriptio est nocens Latin quoniam bardus populus reputo is sanus callidus

um... google. by Anonymous Coward · 2010-07-17 09:55 · Score: 0

1 put each entry in text form
2 let google see it for a minute or two.
3 there is no step 3.

Ask Pubmed guys by mapkinase · 2010-07-17 09:57 · Score: 2, Interesting

Ask guys behind the Pubmed

http://www.ncbi.nlm.nih.gov/pubmed

The database of scientific articles in the field of medicine and biology.

NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.

I am surprised none of them reacted yet, I am sure they read ./

--
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.

Re:Ask Pubmed guys by GumphMaster · 2010-07-17 10:51 · Score: 1

Or perhaps the NASA Astrophysical Data Service http://adswww.harvard.edu/

--
Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button

Discogs as model by Anonymous Coward · 2010-07-17 10:01 · Score: 0

Discogs is a reasonably good example of a community effort.

Sadly. it and others, like Foobar, are still controlled by selfish people.

And a thousand Mac Fanbois ... by rueger · 2010-07-17 10:06 · Score: 2, Funny

... leap up and shout "Filemaker Pro! Cause it's so shiny and pretty!"

Oh, the number of times that I've heard that refrain... shudder ...

--
Three Squirrels

Re:And a thousand Mac Fanbois ... by h4rr4r · 2010-07-17 10:20 · Score: 1

Eww, the people responsible for that thing need to be lead into the street and shot.
Until quite recently you could not even talk sql to it.
Re:And a thousand Mac Fanbois ... by arcsimm · 2010-07-17 11:08 · Score: 1

You know, I spent a semester of my life working for a department at my university that kept all of its operating information in FileMaker Pro databases. Of course, there were two of them, most of the data in one was replicated in the other, and if you actually wanted to *do* anything with the data in either, like have it show up on the departmental calendar or mailing list, you had to manually copy and paste it into still other databases. For most of that semester, my job was basically to function as a $10.00/hr database interface. Had I stayed on there any longer, my superiors would have probably showed up to work one day and discovered that all of their Filemaker DBs had mysteriously migrated into Postgres during the night...
Re:And a thousand Mac Fanbois ... by Bungie · 2010-07-17 20:48 · Score: 1

I also have spent a long time dealing with FileMaker too and it can be a huge PITA. Be thankful you didn't have to maintain a FileMaker Pro Server or web server for many people!
It is very easy for non-tech savvy people to use to build a bunch of databases and start using them which is cool. The problem is that the databases have a very simple design and most people don't even know how to setup a relationship between two fields. They just drag and drop fields onto a form and let FileMaker figure out how to store and share the data.
Those databases then tend to evolve and as they get more complex they are harder to manage using the simple interface that FileMaker Pro tries to provide. One person's quick inventory tracking database suddenly becomes a massive asset database used by the whole company years down the road, and you're left struggling to keep it running.FileMaker Pro and Lotus Domino are the worst for this kind of thing.
IIRC there are a few ways to extract the data from FileMaker Pro databases. There is an ODBC driver that comes with the FileMaker Pro client (at least it did back in the 3.x and 4.x days). That would be the easiest way to extract the data for other applications to use. FileMaker Pro 4.0 used to also come with a web server plugin that would use CDML to generate dynamic web pages from the database (of course Claris HomePage was the best tool to nuild CDML apps at the time).

--
The clash of honour calls, to stand when others fall.
Re:And a thousand Mac Fanbois ... by Anonymous Coward · 2010-07-17 21:51 · Score: 0

Filemaker pro is absolutely horrific! Control codes for colourising text?!?! Thankfully I only come across it occasionally when converting into a 'proper' format, but it's a running joke amongst the team when bespoking new systems that the existing data is in FM format.
Strangely enough though, only had one customer with it on Mac. The rest have been fools running PC version under Windows... when they already have office installed with Access... or even an SQL Server on the network. ?!!?
Re:And a thousand Mac Fanbois ... by BiggerIsBetter · 2010-07-18 00:23 · Score: 1

Strangely enough though, only had one customer with it on Mac. The rest have been fools running PC version under Windows... when they already have office installed with Access... or even an SQL Server on the network. ?!!?
If you can show me a way to publish databases to the web that's as quick and easy as FileMaker Pro, I'd love to hear about it.

--
Forget thrust, drag, lift and weight. Airplanes fly because of money.

Drupal, hands down. by Beltanin · 2010-07-17 10:20 · Score: 2, Interesting

Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module , http://sf2010.drupal.org/conference/sessions/how-build-jobs-aggregation-search-engine-nutch-apache-solr-and-views-3-about , http://sf2010.drupal.org/conference/sessions/case-studies-non-profits-jane-goodall-and-musescore , http://sf2010.drupal.org/conference/sessions/case-studies-academia-drupal-asu-john-hopkins-knowledge-health

Re:Drupal, hands down. by SpzToid · 2010-07-17 22:13 · Score: 1

mod up seriously. Knowing what I know about Drupal + Solr, along with these fantastic examples, this is informative, truly.

--
You can't be ahead of the curve, if you're stuck in a loop.
Re:Drupal, hands down. by Anonymous Coward · 2010-07-20 14:09 · Score: 0

Seconded Drupal - it's great for this sort of thing.

Built in to mySQL by Salamanders · 2010-07-17 10:25 · Score: 1

MySQL 5's Fulltext index with the "natural language search" option might do everything you need with almost no overhead. That, plus PHP's PDO to connect to the database, and I think you might be done. How much data are we talking, anyhow? 10,000 magazine articles or less?

Wayback by martin-boundary · 2010-07-17 10:30 · Score: 3, Interesting

You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe, then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.

Re:Wayback by Cylix · 2010-07-17 10:52 · Score: 1

Definitely an easy re-write.
Just going to be painful to re-enter all that data if they can't use the original binary blob.
Long time ago I had a programming segment regarding binary blobs. Basically, unknown data structures within a binary. Provided they used no encryption it should be relatively painless to extract the data. It was trivial then and now I'm way better.

--
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
Re:Wayback by Hognoxious · 2010-07-17 15:02 · Score: 1

How will that help? As far as I understand, the pages are created on the fly, so without the "engine" behind them you won't get anything.

--
Confucius say, "Find worm in apple - bad. Find half a worm - worse."

Those bastards... by Anonymous Coward · 2010-07-17 10:31 · Score: 0

I am guessing the index in question is this one:

http://index.mrmag.com/

They just devalued a bookcase full of magazines in my basement...

File format, not the implementation details by frisket · 2010-07-17 10:44 · Score: 2, Insightful

It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.

Re:File format, not the implementation details by jgrahn · 2010-07-17 23:24 · Score: 1

It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.
You captured the main point in by refer/BibTeX posting better than I did. Thanks.
More than once I've had to salvage important data from obsolete database file formats. One instance was rare bird sightings in my area in the 1980s -- they had been painfully typed in over the years, but by 2005 they were sitting in RapidFile format on a half-dead 286 in someone's basement. People generally don't think about data preservation these days.
Another betefit of using text file formats (surely the only preservable ones, if you think in decades) is that they are easily handled by version control tools, thus easy to author in a distributed effort, easy to audit for changes, and so on.
Re:File format, not the implementation details by Anonymous Coward · 2010-07-18 03:00 · Score: 0

No, no, no, nooooooo!!!! Your suggestion would kill searchability and render the major benefits of using a database useless. We all like XML, it's a great language but it is not a panacea, it must be used where appropriate not used everywhere possible because it is "cool". What is needed here is searchable fields in a database. Wrapping this text up in XML at a per record level will kill performance stone dead. It's either database (the right choice) or XML file format (the wrong choice). Suggesting both is madness and suggesting using an XML file in preference to a database is just a bad design choice.

IMO it shouldnt be hard to re-parse the data by Wookie_CD · 2010-07-17 11:03 · Score: 1

If you're talking, like most of the commenters above, about retrieving the data from the server through tm.exe, then this does become an exercise in scraping. wget has builtin recursive-fetching capabilities and if you can access a complete index that would be a logical starting point. With my background, if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql. I'd read the data file(s) looking for textual content in a linked structure, and the rest is just research and a bit of perl work (or php etc, if you prefer). Once you figure out which table structure would contain the data, and you come up with a conversion which will put the data into an importable format, the job's almost done and you just need to bring in or write a CMS to access it. I have source code which would go towards some individual bits of a project like this, contact me if you like. Good luck...

hyperestraier by sugarmotor · 2010-07-17 11:11 · Score: 1

Take a look at http://hyperestraier.sourceforge.net/ ... there might be something newer by the same author, Mikio Hirabayashi

Extracting the text from whatever files you have would be a separate step.

--
http://stephan.sugarmotor.org

Data hoarders by tepples · 2010-07-17 11:12 · Score: 1

It also introduces the problem of people who download the whole data set just to collect it, with no intention of accessing the vast majority of it or serving it up to someone else.

Re:Data hoarders by quickOnTheUptake · 2010-07-17 13:58 · Score: 1

Just stick it on bittorrent, if there is a big demand.
Realistically, though, I doubt the database is very large (moreover, I doubt there are all that many people who would want this data). I mean, if you are indexing 50 magazines, over 100 years, with an average of 10 articles in each one, that's 50k articles. Let's say each article has 200B of data, thats, what? ~2 meg uncompressed?

--
Mod points: Guaranteed to remove your sense of humor.
Side effects may include gullibility and temporary retardation

B& by tepples · 2010-07-17 11:16 · Score: 1

wget has builtin recursive-fetching capabilities

Which will get the IP address of the machine running the scraper permanently banned. See the post above.

if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql

It's likely that the raw data is encrypted. Based on the comment so far, I see no reliable indication of from what country tebee operates or whether this country has a DMCA-alike.

Re:B& by Wookie_CD · 2010-07-17 12:04 · Score: 1

This is all a bit academic until the content owner either agrees to reopen web access to a conversion team, or releases the source data for analysis.
Re:B& by b4dc0d3r · 2010-07-17 15:52 · Score: 1

It's not acedemic if we can show the poster some sort of very simple wiki-like CMS that people with 6 decades of back issues might volunteer to enter/edit information. If everyone were organized, 100 people could enter the data in a weekend. Allowing time to edit and refine keywords, without copying the actual content, would add some time. And the backend database could end up more valuable than the original.
Scraping the data isn't possible, getting the data looks unlikely. So you recreate it. Have people claim an issue, and enter the data. People with few issues will claim the ones they have so people with more comprehensive coverage can focus on what no one else has. Bonus is, if no one else is interested, no one bothers to enter what they know, so the project self-immolates.

hoarding == massive replication by martin-boundary · 2010-07-17 11:21 · Score: 2, Interesting

Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.

Re:Developing a Niche Online-Content Indexing Syst by Tablizer · 2010-07-17 11:23 · Score: 1

WikiPedia's search stinks in my opinion. It's gotten better of late, but still not the Gold Standard by any stretch.

--
Table-ized A.I.

Alfresco + Drupal by Anonymous Coward · 2010-07-17 11:30 · Score: 0

Alfresco + Drupal

Fancy that by Anonymous Coward · 2010-07-17 11:34 · Score: 1, Funny

> One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. [...] The governing body for the hobby has agreed to host this

Huh, I didn't realize that porn had a governing body.

It's a library catalog. by oneiros27 · 2010-07-17 11:46 · Score: 1

Don't ask generic nerds -- ask library nerds : code4lib . They have a pretty active mailing list.

Also, there's oss4lib which is specifically for open source software, but I haven't seen much activity on their list in a while, and I think most of us are on both lists. (there's also a few cataloging specific lists, but they get to be all library-sciencey, with discussions of RDA and FRBR and cataloging aggregates).

--
Build it, and they will come^Hplain.

Re:It's a library catalog. by dangitman · 2010-07-17 12:50 · Score: 1

Denholm: It's settled. I've got a good feeling about you Jen and they need a new manager.
Jen: Fantastic! So, the people I'll be working with, what are they like?
Denholm: Standard nerds!
[Note: Not to be confused with standards nerds]

--
... and then they built the supercollider.

Using a Howitzer to Hunt Squirrels by salesgeek · 2010-07-17 12:22 · Score: 2, Insightful

Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.

Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.

--
-- $G

This is the ModelRR mag. database by codeaholic · 2010-07-17 12:32 · Score: 1

The description suggests that this is the Model RR magazine DB. Checking Kalmbach (the company that hosted it), shows that, indeed, it is off line. (http://index.mrmag.com/) The DB was a very simple (by today's standards) index of articles.

As many posters have said, it should be easy (for a programmer) to pull the data from the DB -- if you can get the original data files from Kalmbach. The data was not complex, and 80's DBs tended to have simple file formats. As many suggested, a C++, Java, Python or other script can pull the data out and dump it to XML, MySQL, CSV files, etc. From there, it is easy to rehost it wherever needed.

My suggestion is to simply replicate the old (very dated, but simple) UI: both for searching and for data entry. That can be done very easily in PHP & MySQL. These tools are readily available on any web host making the task fairly simple (for someone familiar with these tools.) It also means that the site's webmaster should know what needs to be done to secure the app.

Getting a straight replacement up validates the whole process, and restores the existing functionality. Only at that point should you consider extending the system, perhaps using many of the good ideas noted above. Obvious extensions are to license the full text of articles to provide a full-text index (rather than just hand-entered keywords as in the current system.) Perhaps provide links to publishers sell them online. Lots of ways to go.

Good luck. As a user of the DB, I'd love to see it back online & better than ever!

Re:This is the ModelRR mag. database by codeaholic · 2010-07-17 14:03 · Score: 1

If this is done as a volunteer effort, I'd be happy to help, esp. with extracting the old data. Contact me using my Slashdot user name + att dot net. (I hope THAT fools the spambots!)

How about a database? by Anonymous Coward · 2010-07-17 13:12 · Score: 0

You don't need niche software for this. You just need a simple database. It sounds like that's really all the existing solution is. Your data schema is simple enough that it would probably fit nicely in a couple of tables (Authors, Issues, and Articles come to mind). I think you're making this harder than it needs to be.

Why PHP? by WhiteHorse-The+Origi · 2010-07-17 13:38 · Score: 1

Why use PHP? I would think Python would be better because you can cross-compile the code to run on any machine using Jython(in case they stop hosting for you). Personally, I would do a full scrape of the data and put it in BibTex .bib files or xml and then make your search page pass parameters to the python program. That's what NASA and Google Scholar use(they may use Perl instead). I'm not sure about the database...

I work for a Library by Anonymous Coward · 2010-07-17 14:52 · Score: 0

It may be overkill for what you want to do, but you should look at Evergreen, the open-source Integrated Library System (think card-catalog) used by the public libraries in the state of Georgia: http://www.open-ils.org/dokuwiki/doku.php?id=faqs:evergreen_faq_2 . It can certainly do what you want done, and a whole lot more. You can just ignore the parts about circulation (or strip them out). You may run into problems with library-specific jargon and standard practices that you don't necessarily need, but surely there's a librarian or two out there in the model railroad world.

A project very similar to Evergreen is Koha: http://koha.org/about

You may also want to look at LibraryThing: http://www.librarything.com/tour/. It's focused on books, but it may be possible to make it work with articles as well.

Re:I work for a Library by SEWilco · 2010-07-19 05:54 · Score: 1

Yes, book or periodical indexing software may be suitable. Just pick one with keyword support and an assortment of search fields.
If the individual article summaries are also made available on individual pages, let Googlebot index them and people will be able to discover relevant individual articles through Google as well. Then anyone looking for something covered by an article will discover your index, as well as which issue of the magazine they need.

for my bunghole by Anonymous Coward · 2010-07-17 14:58 · Score: 0

Well until he does get it, any consideration of how to process it is somewhat moot.

Which makes me wonder why he bothered asking, the fucking twerp.

Re:for my bunghole by OrangeCatholic · 2010-07-17 19:15 · Score: 1

>Well until he does get it, any consideration of how to process it is somewhat moot.
Not quite. He was clear enough to construct a data model. This customer knows what he wants. Problem is, it will take his own efforts to fill in the gaps (in terms of getting access).
"Hi, I want you to install a refrigerator in my apartment. It needs to fit in a hole 30 inches wide by 30 inches deep."
"Will you take a refrigerator 28 inches wide by 26 inches deep?"
"Sure but....lemme talk to my landlord first."
If Zeus descended from the sky and said, "I'll do whatever it takes to get this index online..."
Would Zeus succeed, or would the customer say to him, "I'm not ready?"
P.S. I'm NOT for hire on this job. I am not.even.a.programmer.anymore.
I will, however, take queries as far as I check my email (which is unreliable) and as far as I check this page (until tomorrow at the least).
You asked, you got your answers. 88 comments, perhaps 10 of them were useful. Anyone who says to "use X" is dumb. By the time you figure out how to use it, you could have written your own.
This is 1-3 tables, which for a real-world analogy is like 1-3 sheets of paper. Customer says what? Landlord? Landlord rules 1-3 sheets of paper. Good luck with that access.

Postgresql has good text indices by Anonymous Coward · 2010-07-17 15:17 · Score: 0

Postgresql is pretty easy to set up full text indexes, if you're trying to make it a database-ish application. It's really flexible for stuff like this.

SWISH is a good, non-database index as well.

Of course, someone else already said lucene.

PHP might be "ok" for the web interface (especially if nothing else is available) but I wouldn't even think of using it to populate the index initially.

Dspace by ericlondaits · 2010-07-17 16:16 · Score: 1

Check out Dspace (http://www.dspace.org/). I'm by no means an expert in the area but it seems it might be what you need.

--
As a Slashdot discussion grows longer, the probability of an analogy involving cars approaches one.

Hypercard 2.0? by AHuxley · 2010-07-17 16:31 · Score: 1

Something like an open source hypercard stack?
Anyone can understand a card system, enter unique data per card and save.
Humans are good at that.
Bring them all together and you have a huge digital stack to be sorted, searched or as the backend to a nice simple topic interface.
Computers are great at that now.
That would help your crowd sourcing if its open source no MS closed issues later on.

--
Domestic spying is now "Benign Information Gathering"

DOS Data by nospam007 · 2010-07-17 19:21 · Score: 1

If it's 20 years old DOS, chances are that it's either Paradox or dBASE or any xBASE format, which could be easily opened with Access or even Winword.

No, no, NO! by RichiH · 2010-07-17 19:32 · Score: 3, Insightful

Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.

Read http://en.wikipedia.org/wiki/Database_normalization

You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.

* id is on auto-increase, not NULL
* article_id is a foreign key to the id of the article, not NULL
* name is the name of the tag, not NULL
* comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL

Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
etc pp.

If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:

http://en.wikipedia.org/wiki/Transitive_closure

Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.

One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.

http://en.wikipedia.org/wiki/SQL_injection

Richard

PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.

Re:No, no, NO! by RichiH · 2010-07-17 19:35 · Score: 1

The second statement should have read
SELECT article_id FROM article_tags WHERE name = "foo" AND name = "bar"
for obvious reasons.
Re:No, no, NO! by ultranova · 2010-07-17 22:06 · Score: 1

No, OR is correct here. AND doesn't find any rows because field "name" has only a single value, so 'name = "foo" AND name = "bar"' can't ever be true for any row. You want something like
SELECT article_id FROM article_tags WHERE name = "foo" INTERSECT SELECT article_id FROM article_tags WHERE name = "bar"

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:No, no, NO! by RichiH · 2010-07-18 00:18 · Score: 1

Ah, yes thanks... I should not be allowed to post before coffee...

BibTeX by GeniusDex · 2010-07-17 19:45 · Score: 1

It may not be a complete solution, but have you looked at BibTeX? BibTeX itself is only a format for nicely stating the information you have available (which magazine, article title, which pages in the magazine, authors, etc), but in the entire BibTeX ecosystem a number of indexing systems are built. Quite a lot of them are for desktop use (so you can manage your own BibTeX entries), but I'd imagine there would be some web-based system for this as well.

let me do what I do best: nitpicking. by Anonymous Coward · 2010-07-17 20:46 · Score: 0

"probably using php and mysql"

That's nice. Why are you asking then?

It seems to me that the data is a static index that is entirely read-only except for casual updating -- how often a year? Does the system need to stay online during updating then? Meaning that you don't really need an RDBMS or the poor imitation mysql gives you (yes yes it's shiny and oracle owns it, now shup). Some SQL interface might seem useful, but how many different queries are you going to write? How large are you expecting your userbase to scale?

Just to give an example of a different approach: You could probably write a couple small shell scripts to generate lists of things sorted a couple different ways into static html pages from some master file. That's duplicating the data alright, but how much is it really? Plus it'll compress really well and serving up gzipped content saves a lot on bandwidth too. Serving static html is almost always going to be faster and easier to maintain and so on than executing a scripting language that accesses a database backend for each pageview. Mind, you don't need to stick to static pages for everything, but with a little scripting you'd have something to show for your efforts within minutes, and you can expand in your copious free time later.

This brought to you by a someone who wrote a static pages "cms" complete with custom markup and an index generator in a couple hundred lines awk. Want to do it in perl? Show me the one-liner. Point is, the most well-trodden path isn't always the best for any given problem. In the case of assuming PHP+MySQL, it's the obvious choice for people who don't know any better. And ignoring the nature of the problem and the properties of the data is a bit of a pity.

Sure About DOS? by Bungie · 2010-07-17 21:24 · Score: 1

Are you sure it's a true DOS application and not a Win32 Console App? I know it is entirely possible for someone to write a CGI in DOS but it seems really weird to me that they would use DOS since it didn't have anything that would server CGI, and coding a hand rolled database format would be a lot of extra work.

If it is using Win32 it might just be accessing a DAO database without using the mdb extension, which many companies do to make it look like a proprietary format you can't just open with MS Access. If you look at the raw data it might look crazy and unusable because JET databases use XOR to obfuscate the contents of the database file (and prevent you from extracting the strings inside).

--
The clash of honour calls, to stand when others fall.

Re:Sure About DOS? by tebee · 2010-07-17 22:32 · Score: 1

Your right it is. A visit to the wayback machine found this page- http://web.archive.org/web/20070626092758/www.index.mrmag.com/tm.exe?tmpl=tm_info
Nn which is written -
The TM application is written in "C", and is based on an ISAM/Network database manager I wrote in the late 1980's. The code is highly portable, and versions exist for MS-DOS, Windows NT and several flavors of UNIX. I also run it on my HP Palmtop. The version running on this site is a Win32 console application.
Which just goes to show I should not trust either my memory or what people tell me. Sadly it also indicates the data is in an unknown and probably unique implantation of ISAM.

--
N.B. this user is far too lazy to write a witty and intelligent sig.

There is at least one misunderstanding. by Jane+Q.+Public · 2010-07-17 22:06 · Score: 1

It simply cannot be the case that the original data is in an "unknown" format. If it were, it would never have been retrievable. The format might not be known to YOU, at this particular time, but that is not the same as unknown.

Your first priority is to find out how the original data is stored and accessed. If as you say it is about 20 years old, I strongly suspect it is stored in a C-ISAM or D-ISAM database, and known code libraries are used to access it.

You should then be able to lean heavily on existing code for retrieval of that information, and using a modern scripting language, transfer the data to a new, normalized relational database.

You make it sound like this is some kind of archeological adventure into some kind of untranslatable code hieroglyphics of the distant past. But as I say that cannot be so, unless it was completely designed from the ground up by a single eccentric individual. The clues and the tools are there, if you know where to find them.

Nail down the file format first by jgrahn · 2010-07-17 22:43 · Score: 1

It seems to me that your core problem is data preservation for long periods of time; if you can save that 1930--present data, you don't want to lose it again. You should go for a plaintext file format, and be aware that you *are* using a file format.

There are file formats for this. Probably there are XML languages if you like that kind of thing, but either of two older ones would serve you well I think: the refer(1) format for bibliographic databases, and the BibTeX format. At least the latter is still in use and you can download such indices for various journals -- see the Wikipedia BibTeX entry.

When you have fixed the file format, *then* you can decide on indexing strategies and software. Probably you'll find that someone has already done that for your file format. Mike Lesk did it for refer back in the 1970s ...

Semantic web by sogon · 2010-07-17 23:44 · Score: 1

Just make one xhtml document with semantic annotations There are plenty of solutions for indexing this information. Also a simple google site: will often suffice for most queries.

Brewster Kahle's Digital Library by satellitedirect · 2010-07-18 00:50 · Score: 1

Brewster Kahle is building a truly huge digital library -- every book ever published, every movie ever released, all the strata of web history ... It's all free to the public. A video describing his project can be found at : http://www.ted.com/talks/lang/eng/brewster_kahle_builds_a_free_digital_library.html Good luck with your project.

A library catalog system is needed by Edgester · 2010-07-18 02:27 · Score: 1

This sounds almost exactly like a library catalog system. If the system doesn't index articles, then just treat each article as a book in a multi-volume set. I know that several open source library system exist. Look into those.

Backup everything by Antique+Geekmeister · 2010-07-18 03:02 · Score: 1

Seriously, first step, back up *EVERYTHING*. This includes your programs and your data.

Then see if your ancient programs can be run inside a useful modern emulation enviornment, like "dosbox" or "freedos". That can buy you another 10 years.

It also buys you access to the data without using your ancient hardware: you can read the backups and play with the data much more safely, to try and decode the format. Given the software's age, it's unlikely to be more sophistated than a very simply index and tables that may be decodable with a good editor.

Re: ISAM/Network database manager by gregben · 2010-07-18 05:58 · Score: 1

If you or someone can get me the database files (from Kalmbach?) I am willing
to try to extract useful data from them, into simple ASCII text file(s), suitable
for loading into a relational database like Postgresql, for free.

Why MySQL? by Anonymous Coward · 2010-07-18 09:09 · Score: 0

Given its in the hands of a software firm not known for its openness - Oracle, why choose MySQL?

Zotero? - and previous, slightly related thread by Anonymous Coward · 2010-07-18 09:21 · Score: 0

I checked to see if Zotero (http://www.zotero.org/ - Firefox bibliographic plugin) had previously been mentioned, and came up with this thread on building a personal DB of references. http://ask.slashdot.org/story/09/04/08/1939248/Building-a-Searchable-Literature-Archive-With-Keywords

As of version 2.0, Zotero (Firefox plugin) has features allowing sharing between individuals and creation of interest groups, and its import filters mean that if some of your material is already indexed, eg in WorldCat or in Google Scholar, you won't have to create entries from scratch (though you'll probably have to clean and add keywords).

Perpetuate the solution using FreeDOS by Anonymous Coward · 2010-07-18 12:27 · Score: 0

Why change the system? Just switch to FreeDOS and change your ISP to one who runs Linux.

Re:Developing a Niche Online-Content Indexing Syst by Anonymous Coward · 2010-07-18 13:43 · Score: 0

"I fooled you, I fooled you. I've got pig iron, I've got pig iron. I've got all pig iron"

EBSCO has Model Railroader full text by Anonymous Coward · 2010-07-18 17:15 · Score: 0

A little off topic. This is already done by for profit companies.

For Model Railroader, full text is available, online, in the database "Masterfile Premier", for 2001 to present. This database has bibliographic information (title/author, etc) from 1996-present. In other words, you can search inside a very complete interface, and get a fair amount of full text. This is likely free if your public library, your school, or your state consortium subscribes to Masterfile Premier. (My public library does).

(Yes, I know you want data from 1930 on. Well, that might come in future years... they are always indexing more data).

You need to research how you can access this... EBSCO tries to make it arcane to access their website. To see what you can access from your own public library, go to your public library web site and check out "electronic resources" or "databases". All you need is a library card. You will likely find free online text of journals from companies such as GALE, EBSCO, and ProQuest.

A bit more background.... EBSCO sells subscriptions for journal databases to State-wide consortia, public libraries, state libraries, and schools. (Junior to universities). These subscriptions are paid for by your taxes. Use the service - you likely already pay for it. An example of a state consortium is Indiana's INCOLSA.

If you don't use these services, these companies are getting free money.

Use a blog by dgriff · 2010-07-18 20:00 · Score: 1

Feed all the data into a blog, one magazine edition per blog entry. Some blog software lets you set the date of the entry - use that to set the date to the edition publication date. Enter the keywords as blog tags. You can expand the blog entry to contain such things as (e.g.) a picture of the cover, some short descriptive text. You then also get a free forum where people can discuss the edition in question.

Library software by Anonymous Coward · 2010-07-18 21:09 · Score: 0

Your need seems to be similar (but on a smaller scale) to what a public library uses:
- index books and magazines by title, publication date, author, keywords
- query the index
Koha is the major open source software in this area but probably too big for you. However the Wikipedia page links to others software which may be of interest:
- PhpMyBibli
- Alexandria
- OpenBiblio
- GCstar

Also I suggest you to have a look into dedicated modules for CMS platforms such as Drupal. Drupal has the eXtensible Catalog Drupal Toolkit which probably has the features you need.

eXtensible Catalog Drupal Toolkit by Anonymous Coward · 2010-07-18 21:51 · Score: 0

The need of the poster is like a small library software.

The eXtensible Catalog Drupal Toolkit is probably a better fit.

Where do we help? by Anonymous Coward · 2010-07-19 02:28 · Score: 0

Okay, so I'm a model railroader and an officer in the NMRA's Midwest Region who heard all the uproar at the convention this weekend. I have a large stack of magazines and can help with typing index items from them. Is there a website or some location where work is being coordinated to get this going again? I have some webspace available (like maybe a subdomain of my model railroad website, http://riptrack.net) for such a coordination page if needed.

Re:Where do we help? by tebee · 2010-07-19 07:59 · Score: 1

At the moment things are a little fragmented but we seem to be congregating on this thread http://model-railroad-hobbyist.com/discountinued_mag_index
I hope things will get a little more organized once we have a clear idea whether the original data is going to be available to us
I too have a website we could use , I'm currently putting up versions of some of the things that have been suggested here. It's at http://pc-cafe.co.uk/mr

--
N.B. this user is far too lazy to write a witty and intelligent sig.

Zebra is great of bibliographic data by kylemhall · 2010-07-19 02:47 · Score: 1

The data could easily be converted into MARC bibliographic records and indexed with Zebra. You could then use zebra has a stand-alone Z29.50 server, or run Koha on top for easy searching. Zebra can search millions of records in seconds, so it would be ideal, considering this is essentially bibliographic data. I am a public library IT guy, and would seriously be willing to help out if can use me. Just send me a email ( kyle DOT m DOT hall AT gmail.com ). I can take the raw data and convert it to a bulk marc record set for you, and probably even offer alternative hosting if you'd like.

CWIS Open Source Solution by Snowdog · 2010-07-19 06:33 · Score: 1

It sounds like CWIS may be what you're seeking. It's a free web-based turnkey package, developed at the University of Wisconsin - Madison and funded in part by NSF under the National Science Digital Library initiative. CWIS is written in PHP/MySQL, includes a search engine, a recommender engine, and a raft of other features, and is currently in use in a wide array of contexts.

Slashdot Mirror

Developing a Niche Online-Content Indexing System?

134 comments