Slashdot Mirror


Building a Fast Wikipedia Offline Reader

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."

10 of 208 comments (clear)

  1. Re:2X by Brian+Gordon · · Score: 5, Informative

    Ahaha, 2.9GB? That's the text alone. Images will net you more than 200GB more. And yes, you do need a LAMP/WAMP and working mediawiki, but it wouldn't take 'days' it would take a few hours max. Also is this guy aware that wikipedia is available on DVD already?

  2. Re:Uh.... by stephanruby · · Score: 3, Informative

    Programming something new to some people is like playing a video game.
    Speaking of which, http://www.pyweek.org/ is coming up this first week of September. It's time to dust off that python book (or borrow one from someone) and do whatever you have to do to get some days off that week.
  3. Re:2X by TubeSteak · · Score: 5, Informative

    Also is this guy aware that wikipedia is available on DVD already? Are you aware that the link you pointed to (1) is not the same thing as the link (2) the author pointed to?
    (1) http://schools-wikipedia.org/
    (2) http://download.wikimedia.org/enwiki/latest/

    1 is 4625 articles hand picked for school age children, hence the website name
    2 is a straight dump of wikipedia

    Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
    --
    [Fuck Beta]
    o0t!
  4. It doesn't take days by BReflection · · Score: 4, Informative

    It only takes days if you use the php import script to import the sql dump, which was not designed for importing the entire dump.

    Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.

    http://meta.wikimedia.org/wiki/Xml2sql

    --
    python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
  5. Mass inserts into mysql... by Splab · · Score: 3, Informative

    is very very slow when you do it on a normal installation, the reason is MySQL comes with a "be nice to people who don't know what they are doing" setup. Go into the my.cnf and find the buffer settings, crank them up and restart the server. It can really do a lot (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database).

  6. SDHC? by tepples · · Score: 4, Informative

    Why? Have you seen the price of 4G flash cards recently? Yes, and it's possible that you may need a new PDA in order to use SD cards larger than 2 GB. The 4 GB ones use a different protocol called SDHC that older PDAs may not support. It's analogous to the old ATA hard disk size barriers, especially the 137 GB (128 GiB) barrier. Or are most PDAs capable of being upgraded to handle SDHC?
    1. Re:SDHC? by SCHecklerX · · Score: 2, Informative

      You can get a normal 4GB SD card from Transcend. I am using one in my sandisk sansa e140, which is definitely NOT SDHC compatible.

  7. better yet, a DS version by tepples · · Score: 3, Informative

    A PSP is very portable (fits in your sweater/backpack), hackable You have to buy a used PSP to be sure that you can hack it. New ones are likely to come with firmware version 3.51 or later, which is not cracked as of August 2007. The Nintendo DS, on the other hand, had its last major firmware update in September 2005 and is still cracked, with SLOT-1 modchips available at Wal-Mart for $30.

    and has up to 8Gb of storage So does a CompactFlash card in a GBA Movie Player in a Nintendo DS. It's a pity that the SLOT-1 adapters for DS haven't been shown to be compatible with SDHC.
  8. Re:Days? Please clarify by pla · · Score: 2, Informative

    Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)

    I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.

    DBs have their place. For a "real" Wiki, or more generally any data collection scenario where you can have a designated server, using a SQL store makes perfect sense.

    In most situations, however, the overhead of running a real database on the end-user's machine makes no sense (for the record, I consider this one of the biggest non-bug flaws with Vista, though I realize you can technically turn it off - With the resulting loss of functionality). The exact project mentioned in the FP forms the perfect example of this - He doesn't want to run a Wiki, he just wants to take a dump of it, do text-based searching, and extract pages in some reasonable form. Why would he want to even consider importing a nice single XML file into a memory-hungry form, from which he would still need a set of frontend tools to extract the desired data and convert it to a convenient viewable form?

  9. Re:There's a bug in TFA: Missing articles. by ttsiod · · Score: 4, Informative
    (Re-post: for some reason the response I sent some hours ago didn't appear) No, actually there is no bug. If you read the contents of the 'show.pl' script, you'll see that it adapts to a missing '' by reading from the next volume - the next recxxx...bz2.

    As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)