Slashdot Mirror


Building a Fast Wikipedia Offline Reader

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."

5 of 208 comments (clear)

  1. Re:2X by Brian+Gordon · · Score: 5, Informative

    Ahaha, 2.9GB? That's the text alone. Images will net you more than 200GB more. And yes, you do need a LAMP/WAMP and working mediawiki, but it wouldn't take 'days' it would take a few hours max. Also is this guy aware that wikipedia is available on DVD already?

  2. Re:2X by TubeSteak · · Score: 5, Informative

    Also is this guy aware that wikipedia is available on DVD already? Are you aware that the link you pointed to (1) is not the same thing as the link (2) the author pointed to?
    (1) http://schools-wikipedia.org/
    (2) http://download.wikimedia.org/enwiki/latest/

    1 is 4625 articles hand picked for school age children, hence the website name
    2 is a straight dump of wikipedia

    Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
    --
    [Fuck Beta]
    o0t!
  3. It doesn't take days by BReflection · · Score: 4, Informative

    It only takes days if you use the php import script to import the sql dump, which was not designed for importing the entire dump.

    Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.

    http://meta.wikimedia.org/wiki/Xml2sql

    --
    python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
  4. SDHC? by tepples · · Score: 4, Informative

    Why? Have you seen the price of 4G flash cards recently? Yes, and it's possible that you may need a new PDA in order to use SD cards larger than 2 GB. The 4 GB ones use a different protocol called SDHC that older PDAs may not support. It's analogous to the old ATA hard disk size barriers, especially the 137 GB (128 GiB) barrier. Or are most PDAs capable of being upgraded to handle SDHC?
  5. Re:There's a bug in TFA: Missing articles. by ttsiod · · Score: 4, Informative
    (Re-post: for some reason the response I sent some hours ago didn't appear) No, actually there is no bug. If you read the contents of the 'show.pl' script, you'll see that it adapts to a missing '' by reading from the next volume - the next recxxx...bz2.

    As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)