Developing a Niche Online-Content Indexing System?
tebee writes "One of my hobbies has benefited for 20
years or so by the existence of an online index to all magazine
articles on the subject since the 1930s. It lets you list the
articles in any particular magazine or search for an article by
keyword, title or author, refining the search if necessary by
magazine and/or date. Unfortunately the firm which hosts the
index have recently pulled it from their website, citing security
worries and incompatibilities with the rest of their e-commerce
website: the heart of the system is a 20-year-old DOS program! They
have no plans to replace it as the original data is in an unknown
format. So we are talking about putting together
a team to build a open source replacement for this – probably using
PHP and MySQL. The governing body for the hobby has agreed to host
this and we are in negotiations to try and get the original data. We
hope that by volunteers crowd-sourcing the conversion, we will be
able to do what was commercially impossible." Tebee is looking for ideas about the best way to go about this, and for leads to existing approaches; read on for more.
tebee continues:
"It occurs to me that there could be
existing open-source projects that do roughly what we want to do —
maybe something indexing academic papers. But two days of trawling
through script sites and googling has not produced any results.
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
Remember that here we only point to the original article, we don't have the text of it online, though it has been suggested that we expand to do this. Unfortunately I think copyright considerations will prevent us from doing it, unless we can get our own version of the Google book agreement!
So does anyone know of anything that will save us the effort of writing our system or at least provide a starting point for us to work on?"
Or did I misunderstand the question?
if you said what hobby and index is that. Doing so would surely catch more interest from the Slashdot crowd.
ask the pubmed people at NIH: http://www.ncbi.nlm.nih.gov/pubmed
I don't know if this would be helpful, but the people of Wikipedia must know a far amount about running crowed sourced sites. Even if you can't talk with the higher ups there would be contributors who would know about best practices. Also when you deal with people they would be a lot more helpful if they benefit from helping you.
Leverage the power of virtualization to run your legacy platform for now, and have time to come up with other solutions.
-- Robi
There is an annoying "business model" that drives most commercial websites for greed reasons, and spreads from them to non-commercial websites for no good reason at all except lemming effect. That is when the site has an interesting chunk of data but instead of putting it online to download, wraps a web application around it to deal it out in dribs and drabs, so that users have to keep returning, clicking ads, and so forth.
Yeah having some kind of online query interface can be useful and you should certainly implement one if you can. But much more important is the actual data. Make a zip file for download, no SQLor PHP needed. The SQL and PHP can be done later.
I would run the unix commands 'file' (you might get lucky and get a file type that it understands), 'strings' (to find any ASCII strings within the data) and 'hd' (hex dump) to figure out the structure of the data. My guess is that the data format isn't very complicated. If you figure out how the file is structured, you should be able to use C, or something akin to the 'pack' function found in Perl or Ruby to extract data, which you can load into a database.
I'm sure that Ruby on Rails could have a fully functional web site made from this data in about half an hour.
The downside is that if more than two people try to access the data, it will display a whale suspended by balloons.
(Please Note: This post is a joke, and not an attempt to start a flame war).
Custom, hands-free Linux installs. Instalinux
This may seem like a silly question, but if the data is in an unknown format and it's handled by an existing DOS program, why not just keep using that old DOS program? It still works, probably has low resources and is dealing with (I gather) mostly fixed data. Maybe bring in some people to try to reserve engineer the data format? But, really, just move the DOS program and data to a different server and save yourself months of effort.
You could write a custom program that would scrape the the data from a website you setup to allow that program to run stand alone or you figure out what the data format is and write a program to convert that.
If you want to recreate the data from scratch then you'd need to set up a website your group would access and enter data. That would be crowd sourcing but you'd probably want something specific to your needs but using easily maintainable code.
As others have stated you could use virtualization. Inside the virtual machine you may even be able to run a LAMP stack and run the DOS program with dosbox running as as an unprivileged user. http://www.dosbox.com/ http://www.virtualbox.org/ http://www.vmware.com/.
I would only consider the virtual solution a stop gap until you could get the database translated to something maintainable or recreate the data.
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
Running 20 years old code is a security disaster - now you want to replace it in what... php? Few people would call that an improvement.
See if you can get access to the site again, and screen scrape it. That should not be too hard (search for all articles beginning with "A", then "B", etc.). Then, it should be straightforward to enter it into MySQL or your database of choice.
(It is just possible the search functionality is still there, with just the HTML being taken down. The WayBack Machine could be your friend here...)
1 put each entry in text form
2 let google see it for a minute or two.
3 there is no step 3.
Ask guys behind the Pubmed
http://www.ncbi.nlm.nih.gov/pubmed
The database of scientific articles in the field of medicine and biology.
NCBI has the most generous software code licensing that is possible: the code is absolutely free, absolutely no restriction for distributing, changing, selling, even closing it. All because we, taxpayers, paid for it already.
I am surprised none of them reacted yet, I am sure they read ./
I do not believe in karma. "Funny"=-6. Do good and forbid evil. Yours, Oft-Offtopic Flamebaiting Troll.
Discogs is a reasonably good example of a community effort.
Sadly. it and others, like Foobar, are still controlled by selfish people.
... leap up and shout "Filemaker Pro! Cause it's so shiny and pretty!"
...
Oh, the number of times that I've heard that refrain... shudder
Three Squirrels
Use Drupal (http://drupal.org), with Apache Solr (http://lucene.apache.org/solr/ and http://drupal.org/project/apachesolr) for indexing. At the last Drupalcon (SF 2010), there were even presentations by library staff related to article indexing, etc. Some handy resources, but there are far more, this was just a 1m search based on the conference alone... http://sf2010.drupal.org/conference/sessions/build-powerful-site-search-user-friendly-easy-install-search-lucene-api-module , http://sf2010.drupal.org/conference/sessions/how-build-jobs-aggregation-search-engine-nutch-apache-solr-and-views-3-about , http://sf2010.drupal.org/conference/sessions/case-studies-non-profits-jane-goodall-and-musescore , http://sf2010.drupal.org/conference/sessions/case-studies-academia-drupal-asu-john-hopkins-knowledge-health
MySQL 5's Fulltext index with the "natural language search" option might do everything you need with almost no overhead. That, plus PHP's PDO to connect to the database, and I think you might be done. How much data are we talking, anyhow? 10,000 magazine articles or less?
You can use the Wayback Machine to get a partial snapshot of the site. Try http://web.archive.org/web/*/http://index.mrmag.com/tm.exe, then follow the links on the archived page. If you vary the URL a bit, you might see even more missing data.
I am guessing the index in question is this one:
http://index.mrmag.com/
They just devalued a bookcase full of magazines in my basement...
It doesn't matter a damn what you use to serve the stuff; what matters is that the data is stored in something preservable and long-lasting like XML, otherwise you'll be back here in a few years. By all means use PHP and MySQL to make it available, but don't confuse the mechanisms used to serve the information with the file format in which it is stored under the hood.
If you're talking, like most of the commenters above, about retrieving the data from the server through tm.exe, then this does become an exercise in scraping. wget has builtin recursive-fetching capabilities and if you can access a complete index that would be a logical starting point. With my background, if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql. I'd read the data file(s) looking for textual content in a linked structure, and the rest is just research and a bit of perl work (or php etc, if you prefer). Once you figure out which table structure would contain the data, and you come up with a conversion which will put the data into an importable format, the job's almost done and you just need to bring in or write a CMS to access it. I have source code which would go towards some individual bits of a project like this, contact me if you like. Good luck...
Take a look at http://hyperestraier.sourceforge.net/ ... there might be something newer by the same author, Mikio Hirabayashi
Extracting the text from whatever files you have would be a separate step.
http://stephan.sugarmotor.org
It also introduces the problem of people who download the whole data set just to collect it, with no intention of accessing the vast majority of it or serving it up to someone else.
wget has builtin recursive-fetching capabilities
Which will get the IP address of the machine running the scraper permanently banned. See the post above.
if at all possible I would bypass the exe and just look at importing the raw data into a relational database like mysql
It's likely that the raw data is encrypted. Based on the comment so far, I see no reliable indication of from what country tebee operates or whether this country has a DMCA-alike.
Short term it's true that can eat some bandwith, but long term that's the solution of the problem you're facing right now. If you could ask a data hoarder to give you a copy of the website which just disappeared, then you wouldn't be asking today about how to recreate it from scratch.
WikiPedia's search stinks in my opinion. It's gotten better of late, but still not the Gold Standard by any stretch.
Table-ized A.I.
Alfresco + Drupal
> One of my hobbies has benefited for 20 years or so by the existence of an online index to all magazine articles on the subject since the 1930s. [...] The governing body for the hobby has agreed to host this
Huh, I didn't realize that porn had a governing body.
Don't ask generic nerds -- ask library nerds : code4lib . They have a pretty active mailing list.
Also, there's oss4lib which is specifically for open source software, but I haven't seen much activity on their list in a while, and I think most of us are on both lists. (there's also a few cataloging specific lists, but they get to be all library-sciencey, with discussions of RDA and FRBR and cataloging aggregates).
Build it, and they will come^Hplain.
Lots of people here are recommending using tools that are built for very large scale projects. Based on the fact you have a DOS based system that likely used a pretty common library for storing the data (something like c-tree, btrieve, a dbase library or simply saving binary data using whatever language the app was written in), using any RDBMS like MySQL or even SQLite probably would do the job. PHP, Python, Ruby and Perl would probably make writing the actual application a snap - and be able to handle more of a load that the DOS app could.
Here's to hoping you can get the data. Hopefully the vendor that pulled the database down realizes how important to marketing it is and reverses course.
-- $G
As many posters have said, it should be easy (for a programmer) to pull the data from the DB -- if you can get the original data files from Kalmbach. The data was not complex, and 80's DBs tended to have simple file formats. As many suggested, a C++, Java, Python or other script can pull the data out and dump it to XML, MySQL, CSV files, etc. From there, it is easy to rehost it wherever needed.
My suggestion is to simply replicate the old (very dated, but simple) UI: both for searching and for data entry. That can be done very easily in PHP & MySQL. These tools are readily available on any web host making the task fairly simple (for someone familiar with these tools.) It also means that the site's webmaster should know what needs to be done to secure the app.
Getting a straight replacement up validates the whole process, and restores the existing functionality. Only at that point should you consider extending the system, perhaps using many of the good ideas noted above. Obvious extensions are to license the full text of articles to provide a full-text index (rather than just hand-entered keywords as in the current system.) Perhaps provide links to publishers sell them online. Lots of ways to go.
Good luck. As a user of the DB, I'd love to see it back online & better than ever!
You don't need niche software for this. You just need a simple database. It sounds like that's really all the existing solution is. Your data schema is simple enough that it would probably fit nicely in a couple of tables (Authors, Issues, and Articles come to mind). I think you're making this harder than it needs to be.
Why use PHP? I would think Python would be better because you can cross-compile the code to run on any machine using Jython(in case they stop hosting for you). Personally, I would do a full scrape of the data and put it in BibTex .bib files or xml and then make your search page pass parameters to the python program. That's what NASA and Google Scholar use(they may use Perl instead). I'm not sure about the database...
It may be overkill for what you want to do, but you should look at Evergreen, the open-source Integrated Library System (think card-catalog) used by the public libraries in the state of Georgia: http://www.open-ils.org/dokuwiki/doku.php?id=faqs:evergreen_faq_2 . It can certainly do what you want done, and a whole lot more. You can just ignore the parts about circulation (or strip them out). You may run into problems with library-specific jargon and standard practices that you don't necessarily need, but surely there's a librarian or two out there in the model railroad world.
A project very similar to Evergreen is Koha: http://koha.org/about
You may also want to look at LibraryThing: http://www.librarything.com/tour/. It's focused on books, but it may be possible to make it work with articles as well.
Well until he does get it, any consideration of how to process it is somewhat moot.
Which makes me wonder why he bothered asking, the fucking twerp.
Postgresql is pretty easy to set up full text indexes, if you're trying to make it a database-ish application. It's really flexible for stuff like this.
SWISH is a good, non-database index as well.
Of course, someone else already said lucene.
PHP might be "ok" for the web interface (especially if nothing else is available) but I wouldn't even think of using it to populate the index initially.
Check out Dspace (http://www.dspace.org/). I'm by no means an expert in the area but it seems it might be what you need.
As a Slashdot discussion grows longer, the probability of an analogy involving cars approaches one.
Something like an open source hypercard stack?
Anyone can understand a card system, enter unique data per card and save.
Humans are good at that.
Bring them all together and you have a huge digital stack to be sorted, searched or as the backend to a nice simple topic interface.
Computers are great at that now.
That would help your crowd sourcing if its open source no MS closed issues later on.
Domestic spying is now "Benign Information Gathering"
If it's 20 years old DOS, chances are that it's either Paradox or dBASE or any xBASE format, which could be easily opened with Access or even Winword.
Your suggestions make sense, but suggesting to store comma-delimited plain text in a SQL table is wrong by any and all database standards & best practises. You fail to reach even the first normalized form.
Read http://en.wikipedia.org/wiki/Database_normalization
You want to define a table "tags" or something with id, article_id, name, comment. Make the combination of id, parent_id, name unique.
* id is on auto-increase, not NULL
* article_id is a foreign key to the id of the article, not NULL
* name is the name of the tag, not NULL
* comment is an optional comment explaining the tag (for example in the mouse-over or on the site listing everything with that tag), may be NULL
Not only is that easier to maintain in the long run (think of parsing plain text out of a VARCHAR. argh!), but all of a sudden, you have the data you _store_ available to _access_.
How many artcles are tagged electric? SELECT count (1) FROM article_tags WHERE name = "electric";
Give me a list of all article relating to foo and bar? SELECT article_id FROM article_tags WHERE name = "foo" OR name = "bar".
etc pp.
If you want to go really fancy with multi-level tags, replace article_id with parent_id (referring to the id in the same table) and create a relation table as glue. If you want all upper levels to apply, throw in a transitive closure:
http://en.wikipedia.org/wiki/Transitive_closure
Generally speaking, you want a table for magazines with their names, publication dates, publisher, whatnot; and only refer to them via foreign keys. Same goes for train models (which you could cross-ref via tags. Yay for clean db design!), authors, collectors, train clubs and and pretty much everything else.
One last word of advice: No matter what anyone tells you: Either you use a proper framework or you _ALWAYS_ use prepared statements. You get some performance benefits and SQL injection becomes impossible, for free! Repeat: Even if you ignore all the other tips above, you _MUST heed this.
http://en.wikipedia.org/wiki/SQL_injection
Richard
PS: You are more than welcome to reply to this post once you have your DB design hammered out. I will have a look & optimize, if you want.
It may not be a complete solution, but have you looked at BibTeX? BibTeX itself is only a format for nicely stating the information you have available (which magazine, article title, which pages in the magazine, authors, etc), but in the entire BibTeX ecosystem a number of indexing systems are built. Quite a lot of them are for desktop use (so you can manage your own BibTeX entries), but I'd imagine there would be some web-based system for this as well.
"probably using php and mysql"
That's nice. Why are you asking then?
It seems to me that the data is a static index that is entirely read-only except for casual updating -- how often a year? Does the system need to stay online during updating then? Meaning that you don't really need an RDBMS or the poor imitation mysql gives you (yes yes it's shiny and oracle owns it, now shup). Some SQL interface might seem useful, but how many different queries are you going to write? How large are you expecting your userbase to scale?
Just to give an example of a different approach: You could probably write a couple small shell scripts to generate lists of things sorted a couple different ways into static html pages from some master file. That's duplicating the data alright, but how much is it really? Plus it'll compress really well and serving up gzipped content saves a lot on bandwidth too. Serving static html is almost always going to be faster and easier to maintain and so on than executing a scripting language that accesses a database backend for each pageview. Mind, you don't need to stick to static pages for everything, but with a little scripting you'd have something to show for your efforts within minutes, and you can expand in your copious free time later.
This brought to you by a someone who wrote a static pages "cms" complete with custom markup and an index generator in a couple hundred lines awk. Want to do it in perl? Show me the one-liner. Point is, the most well-trodden path isn't always the best for any given problem. In the case of assuming PHP+MySQL, it's the obvious choice for people who don't know any better. And ignoring the nature of the problem and the properties of the data is a bit of a pity.
Are you sure it's a true DOS application and not a Win32 Console App? I know it is entirely possible for someone to write a CGI in DOS but it seems really weird to me that they would use DOS since it didn't have anything that would server CGI, and coding a hand rolled database format would be a lot of extra work.
If it is using Win32 it might just be accessing a DAO database without using the mdb extension, which many companies do to make it look like a proprietary format you can't just open with MS Access. If you look at the raw data it might look crazy and unusable because JET databases use XOR to obfuscate the contents of the database file (and prevent you from extracting the strings inside).
The clash of honour calls, to stand when others fall.
It simply cannot be the case that the original data is in an "unknown" format. If it were, it would never have been retrievable. The format might not be known to YOU, at this particular time, but that is not the same as unknown.
Your first priority is to find out how the original data is stored and accessed. If as you say it is about 20 years old, I strongly suspect it is stored in a C-ISAM or D-ISAM database, and known code libraries are used to access it.
You should then be able to lean heavily on existing code for retrieval of that information, and using a modern scripting language, transfer the data to a new, normalized relational database.
You make it sound like this is some kind of archeological adventure into some kind of untranslatable code hieroglyphics of the distant past. But as I say that cannot be so, unless it was completely designed from the ground up by a single eccentric individual. The clues and the tools are there, if you know where to find them.
There are file formats for this. Probably there are XML languages if you like that kind of thing, but either of two older ones would serve you well I think: the refer(1) format for bibliographic databases, and the BibTeX format. At least the latter is still in use and you can download such indices for various journals -- see the Wikipedia BibTeX entry.
When you have fixed the file format, *then* you can decide on indexing strategies and software. Probably you'll find that someone has already done that for your file format. Mike Lesk did it for refer back in the 1970s ...
Just make one xhtml document with semantic annotations There are plenty of solutions for indexing this information. Also a simple google site: will often suffice for most queries.
Brewster Kahle is building a truly huge digital library -- every book ever published, every movie ever released, all the strata of web history ... It's all free to the public.
A video describing his project can be found at :
http://www.ted.com/talks/lang/eng/brewster_kahle_builds_a_free_digital_library.html
Good luck with your project.
This sounds almost exactly like a library catalog system. If the system doesn't index articles, then just treat each article as a book in a multi-volume set. I know that several open source library system exist. Look into those.
Seriously, first step, back up *EVERYTHING*. This includes your programs and your data.
Then see if your ancient programs can be run inside a useful modern emulation enviornment, like "dosbox" or "freedos". That can buy you another 10 years.
It also buys you access to the data without using your ancient hardware: you can read the backups and play with the data much more safely, to try and decode the format. Given the software's age, it's unlikely to be more sophistated than a very simply index and tables that may be decodable with a good editor.
If you or someone can get me the database files (from Kalmbach?) I am willing
to try to extract useful data from them, into simple ASCII text file(s), suitable
for loading into a relational database like Postgresql, for free.
Given its in the hands of a software firm not known for its openness - Oracle, why choose MySQL?
I checked to see if Zotero (http://www.zotero.org/ - Firefox bibliographic plugin) had previously been mentioned, and came up with this thread on building a personal DB of references. http://ask.slashdot.org/story/09/04/08/1939248/Building-a-Searchable-Literature-Archive-With-Keywords
As of version 2.0, Zotero (Firefox plugin) has features allowing sharing between individuals and creation of interest groups, and its import filters mean that if some of your material is already indexed, eg in WorldCat or in Google Scholar, you won't have to create entries from scratch (though you'll probably have to clean and add keywords).
Why change the system? Just switch to FreeDOS and change your ISP to one who runs Linux.
"I fooled you, I fooled you. I've got pig iron, I've got pig iron. I've got all pig iron"
A little off topic. This is already done by for profit companies.
For Model Railroader, full text is available, online, in the database "Masterfile Premier", for 2001 to present. This database has bibliographic information (title/author, etc) from 1996-present. In other words, you can search inside a very complete interface, and get a fair amount of full text. This is likely free if your public library, your school, or your state consortium subscribes to Masterfile Premier. (My public library does).
(Yes, I know you want data from 1930 on. Well, that might come in future years... they are always indexing more data).
You need to research how you can access this... EBSCO tries to make it arcane to access their website. To see what you can access from your own public library, go to your public library web site and check out "electronic resources" or "databases". All you need is a library card. You will likely find free online text of journals from companies such as GALE, EBSCO, and ProQuest.
A bit more background.... EBSCO sells subscriptions for journal databases to State-wide consortia, public libraries, state libraries, and schools. (Junior to universities). These subscriptions are paid for by your taxes. Use the service - you likely already pay for it. An example of a state consortium is Indiana's INCOLSA.
If you don't use these services, these companies are getting free money.
Feed all the data into a blog, one magazine edition per blog entry. Some blog software lets you set the date of the entry - use that to set the date to the edition publication date. Enter the keywords as blog tags. You can expand the blog entry to contain such things as (e.g.) a picture of the cover, some short descriptive text. You then also get a free forum where people can discuss the edition in question.
Your need seems to be similar (but on a smaller scale) to what a public library uses:
- index books and magazines by title, publication date, author, keywords
- query the index
Koha is the major open source software in this area but probably too big for you. However the Wikipedia page links to others software which may be of interest:
- PhpMyBibli
- Alexandria
- OpenBiblio
- GCstar
Also I suggest you to have a look into dedicated modules for CMS platforms such as Drupal. Drupal has the eXtensible Catalog Drupal Toolkit which probably has the features you need.
The need of the poster is like a small library software.
The eXtensible Catalog Drupal Toolkit is probably a better fit.
Okay, so I'm a model railroader and an officer in the NMRA's Midwest Region who heard all the uproar at the convention this weekend. I have a large stack of magazines and can help with typing index items from them. Is there a website or some location where work is being coordinated to get this going again? I have some webspace available (like maybe a subdomain of my model railroad website, http://riptrack.net) for such a coordination page if needed.
The data could easily be converted into MARC bibliographic records and indexed with Zebra. You could then use zebra has a stand-alone Z29.50 server, or run Koha on top for easy searching. Zebra can search millions of records in seconds, so it would be ideal, considering this is essentially bibliographic data. I am a public library IT guy, and would seriously be willing to help out if can use me. Just send me a email ( kyle DOT m DOT hall AT gmail.com ). I can take the raw data and convert it to a bulk marc record set for you, and probably even offer alternative hosting if you'd like.
It sounds like CWIS may be what you're seeking. It's a free web-based turnkey package, developed at the University of Wisconsin - Madison and funded in part by NSF under the National Science Digital Library initiative. CWIS is written in PHP/MySQL, includes a search engine, a recommender engine, and a raft of other features, and is currently in use in a wide array of contexts.