Slashdot Mirror


Building a Fast Wikipedia Offline Reader

ttsiod writes "An internet connection is not always at hand. I wanted to install Wikipedia on my laptop to be able to carry it along with me on business trips. After trying and rejecting the normal (MySQL-based) procedure, I quickly hacked a much better one over the weekend, using open source tools. Highlights: (1) Very fast searching. (2) Keyword (actually, title words) based searching. (3) Search produces multiple possible articles, sorted by probability (you choose amongst them). (4) LaTeX based rendering for mathematical equations. (5) Hard disk usage is minimal: space for the original .bz2 file plus the index built through Xapian. (6) Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL — which, if you want to enable keyword searching, takes days."

208 comments

  1. Wow! by ferrocene · · Score: 2, Funny

    After doing all that, I think you may have missed your flight! :)

    --
    Most folk'll never lose a toe, and then again some folk'll...
    1. Re:Wow! by Anonymous Coward · · Score: 0

      Duh... I guess this dipshit (and everyone else here) does not comprehend the difference between a full text table index on the article bodies versus an actual "index" of topic titles, and the implication on search functionality.

      I'm guessing this is the first time he has encountered the length of time it takes to create indices on text fields in a large db. You see, to get fast searches on huge amounts of text one must do a little work in advance.

      Also I'd bet that the default offline wikipedia scripts define the table with indices, then imports the data, forcing the index to be rebuilt with every query (will take a painfully long time). I bet that a simple modification of the scripts to define the tables without indices, import the data then add the indices would speed it up considerably. (though index creation on large text databases will always take time no matter how you cut it)

      A searchable zipped file of article titles? Whoopee fuckin doo.

      RTFM assholes:
      http://dev.mysql.com/doc/refman/4.1/en/indexes.htm l

    2. Re:Wow! by ivlianvs · · Score: 1

      Thanks! And now, what about a PDA version???

    3. Re:Wow! by CoolVibe · · Score: 1

      Be by guest... You do need that 2.9 GB file somewhere on your PDA first. That just *might* be an issue.

    4. Re:Wow! by Eunuchswear · · Score: 1

      Be by guest... You do need that 2.9 GB file somewhere on your PDA first. That just *might* be an issue.

      Why? Have you seen the price of 4G flash cards recently?
      --
      Watch this Heartland Institute video
    5. Re:Wow! by jj421 · · Score: 1

      There *IS* a PDA version using http://tomeraider.com/. The 2005 dump of the database weighs in at about 1.1GB without images (or less if you want a smaller, less complete set). For a 2007 version of the database, you can buy a version on DVD or process it yourself using the instructions here: http://infodisiac.com/Wikipedia/index.html.

    6. Re:Wow! by CoolVibe · · Score: 1

      Yes, in the USA... I live in the Netherlands where 2GB flash is just getting cheap.

    7. Re:Wow! by nschubach · · Score: 1

      Calm down... This is the world run by Microsoft. The gimme-now generation. As long as it's fast and easy to make, it doesn't matter if it takes a cluster of quad-core processors to run it. Just cut and paste any code you find online that works and be done with it.

      --
      Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
    8. Re:Wow! by Anonymous Coward · · Score: 0

      4GB is less than 40 Euro.

    9. Re:Wow! by Anonymous Coward · · Score: 0

      Man, do you have some sand in your vagina or something?

    10. Re:Wow! by slapout · · Score: 1

      I wonder if the 2.9 GB includes photos. If so, removing them might get it down to below 2GB and fit on a flash card.

      --
      Coder's Stone: The programming language quick ref for iPad
  2. Wow by jrwr00 · · Score: 1

    Great Job, that is the power of open source for ya.

    Now we need to work on porting that to over OS's and we will be set.

    1. Re:Wow by Dusty101 · · Score: 1

      Indeed. Any chance of a port of this for the Nokia Internet Tablets? This'd work nicely on one of those with a big SD card...

    2. Re:Wow by Anonymous Coward · · Score: 0

      Great job if it didn't have a fat bug. He uses bzip2recover to split the original 2.9GB data file into more manageable chunks without decompressing and recompressing. So each file boundary will almost certainly have a wikipedia article split across it. His code doesn't handle this case. With almost 14,000 files, that's a lot of articles that won't work properly.

      Easy fix: use a simple script to split the data into similar sized chunks that do align with article boundaries. Yeah, it will take a bit longer to install, but it will work right.

    3. Re:Wow by Anonymous Coward · · Score: 0

      I currently use sdictionary for my offline wikipedia solution.

  3. Re:Why? by bn557 · · Score: 3, Insightful

    This may seem like a stupid, trivial, and pointless project, but the programmer may have gained something from it that he could use later in something you don't feel that way about. If the programmer enjoyed doing it, that might have lead to a more productive coding session later in the day too.

    --
    Humans are slow, innaccurate, and brilliant; computers are fast, acurrate, and dumb; together they are unbeatable
  4. Re:Why? by rabblerabble · · Score: 5, Funny

    I'll bite...Unfortunately, I don't have a basement, so therefore there are times that I am required to venture into the outer realm that happens to be heated by the big ball of gas known as Sol, as opposed to a pump ;P Seriously though, this is exactly what I have been looking for. What better way to show up your friends when they cry "You're wrong, google it!" knowing that there is no connection possible within twenty miles. Next time i'm drunk at the beach and someone wants to pretend to know the history of coffee harvesting, it's on.

  5. Ho-Hum ... by jabberwock · · Score: 5, Funny
    What, no auto update? No User Agreement? No disabled features that are enabled by a mammoth key? No product registration?


    Let us know when you're ready for prime time ... ;-)

    1. Re:Ho-Hum ... by OzRoy · · Score: 4, Insightful

      Auto-update would be interesting. How do you keep the data up to date without downloading the entire 2.9G again? Is there some sort of diff file you can download?

    2. Re:Ho-Hum ... by Anonymous Coward · · Score: 1, Interesting

      I think Wikipedia does in fact release diffs, but I wouldn't swear it.

    3. Re:Ho-Hum ... by MikkoApo · · Score: 1
      This is a nice example of combining different tools to produce a working solution. Auto-update feature would be required though, because the process seems slightly broken and will loose parts of the data.

      He splits the bz2 file into 900kb blocks. The original XML's tags might get broken when a start tag ends up in different block then the end tag. When run, the bash script will ignore all the broken tags.

      Fixing that is probably pretty straightforward, but requires a bit more careful XML handling. Anyways, a nice effort and made me want to try the same in a different language :)

    4. Re:Ho-Hum ... by Anonymous Coward · · Score: 0

      Because of the way wikis work, I would not expect much data to change in the database, only to be added, i.e. addition of new articles, or new versions of articles (the most important data anyway). In SQL terms, there will be more INSERTs than UPDATEs. So in principle you could simply run a query to get all additions to the database after the last update time to keep you local copy in sync. Of course, I dont know if WP has any apis available to let you do this...

    5. Re:Ho-Hum ... by MichaelSmith · · Score: 2, Funny

      How do you keep the data up to date without downloading the entire 2.9G again?

      Not too hard if you have a sub-etha net connection handy. Better check that the article about The Earth which you have been working on hasn't been cut down to two words though.

    6. Re:Ho-Hum ... by Anonymous Coward · · Score: 0

      Check out www.webaroo.com. It satisfies all of them, and has better SNR, since a lot of spam is cleaned off.

    7. Re:Ho-Hum ... by gwern · · Score: 1

      Looking through WP:DUMP, there doesn't seem to be. However, since the dumps are provided, someone else could easily set up a rsync server or something. The toolserver even provides RSS feeds, so that someone could also set up a script to automatically download, decompress, and run diff (or whatever).

    8. Re:Ho-Hum ... by Anonymous Coward · · Score: 0
      Webaroo looks cool, though sadly it is MS-only.

      1) What are the system requirements for running Webaroo on desktop/laptop?
      To use Webaroo, you must have the following:
      • Windows Vista, Windows XP Service Pack 1+, or Windows 2000 Service Pack 4 (32-bit)
      • 1 GHz+ processor (Pentium, AMD Athlon or equivalent)
      • 512 MB RAM (To download large packs we recommend a 1GB RAM)
      • Web browser - Microsoft Internet Explorer 6.0+ or Mozilla Firefox 1.5+
      • Microsoft .NET 2.0 Framework (Downloaded during installation if required)
      • Broadband Internet connection to download and update content
      This article is about an open source solution that doesn't tie anyone to MS only (the tools used will work on Mac, Linux and MS Windows), sadly the same cannot be said of Webaroo.
    9. Re:Ho-Hum ... by superpulpsicle · · Score: 1

      It said the entire wikipedia package is Size : 4.37GB. All those articles and images only come out to this size?

    10. Re:Ho-Hum ... by JohnFluxx · · Score: 1

      That's only the text, and only in English.

      Images are another 17GB when I last checked a few years ago. Probably an order larger now :)

  6. Take that, Mr Obviously A. Troll! by ampathee · · Score: 5, Funny

    Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!
    Hah! I'm going to start work on (let's see..) a random lolcat generator now, just to piss you off.
    1. Re:Take that, Mr Obviously A. Troll! by Anonymous Coward · · Score: 0

      Those fuckers at lolcats just rip off the work of anonymous and plaster a watermark a la eBaum's on the images they didn't even create. They've been labeled an enemy by anonymous and all those who proudly fly the banner of the one true Longcat.

    2. Re:Take that, Mr Obviously A. Troll! by Anonymous Coward · · Score: 0

      tacgnol is clearly superior.

    3. Re:Take that, Mr Obviously A. Troll! by Anonymous Coward · · Score: 0

      What the hell are Lolcats? It's Caturday, fucker. Get it right.

    4. Re:Take that, Mr Obviously A. Troll! by SoapDish · · Score: 2, Funny

      Make sure to write it in LOL-CODE! (http://lolcode.com/)

    5. Re:Take that, Mr Obviously A. Troll! by MarkRose · · Score: 4, Funny

      You mean something like lolcatgenerator.com? Looks like someone already tackled that important project! lol

      --
      Be relentless!
    6. Re:Take that, Mr Obviously A. Troll! by SpooForBrains · · Score: 1

      This makes me want to tackle that Global Toilet Database project I've been planning since I was 12 (at which point it was going to be a book)

      --
      "The dew has clearly fallen with a particularly sickening thud this morning"
    7. Re:Take that, Mr Obviously A. Troll! by imsabbel · · Score: 1

      If you REALLY want to piss him off, try writing the generator in LOLCODE (http://lolcode.com/)

      --
      HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
  7. 2X by Anonymous Coward · · Score: 0

    "Up to now, installing a local copy of Wikipedia is not for the faint of heart: it requires a LAMP or WAMP installation"

    I'm sorry it requires WHAT in order to display HTML pages?

    "Right now (August, 2007) the file is a 2.9GB download, always available from here. I"

      So an unmassaged file would fit onto a standard DVD. And a fully indexed file wouldn't take much more. Plus your equations could be images.

    1. Re:2X by computerman413 · · Score: 1

      Wikipedia is a little more than a bunch of HTML pages. From what little I understand, there's a massive database which composes Wikipedia, and there's a script which puts everything together into the Wikipedia we know and love.

    2. Re:2X by deftcoder · · Score: 1
      --
      Peace sells, but who's buying?
    3. Re:2X by Brian+Gordon · · Score: 5, Informative

      Ahaha, 2.9GB? That's the text alone. Images will net you more than 200GB more. And yes, you do need a LAMP/WAMP and working mediawiki, but it wouldn't take 'days' it would take a few hours max. Also is this guy aware that wikipedia is available on DVD already?

    4. Re:2X by TubeSteak · · Score: 5, Informative

      Also is this guy aware that wikipedia is available on DVD already? Are you aware that the link you pointed to (1) is not the same thing as the link (2) the author pointed to?
      (1) http://schools-wikipedia.org/
      (2) http://download.wikimedia.org/enwiki/latest/

      1 is 4625 articles hand picked for school age children, hence the website name
      2 is a straight dump of wikipedia

      Just imagine my surprise when the schools-wikipedia website didn't have the wiki article on Goatse!
      --
      [Fuck Beta]
      o0t!
    5. Re:2X by Gracenotes · · Score: 1

      Sort of, yes... Wikipedia is little more than a bunch of revisions of wikitext stored in the database, which is either retrieved from a server cache or parsed into HTML by the "script" when it needs to accessed. Wikitext is a bit more light-weight than HTML, although you may not "know and love" it.

  8. Next up: by Anonymous Coward · · Score: 0

    Building Cory Doctorow a better haircut.

  9. Just settle it the old way by EmbeddedJanitor · · Score: 4, Funny

    Kick sand in their face!

    --
    Engineering is the art of compromise.
    1. Re:Just settle it the old way by rabblerabble · · Score: 3, Funny

      The goggles would work then. Your logic is flawed.

    2. Re:Just settle it the old way by morari · · Score: 1

      I only kick sand in the face of ninety-seven pound weaklings!

      --
      "He who can destroy a thing, controls a thing." --Paul Atreides, Dune
  10. Re:Why? by Short+Circuit · · Score: 1

    I'm on 14.4Kbps dial-up, you insensitive clod.

    And that's no joke. Noisy phone lines suck. It could be worse; I could have been on an OLPC machine in Africa.

  11. Uh.... by VonSkippy · · Score: 1

    Why?

    1. Re:Uh.... by dhwebb · · Score: 5, Interesting

      Programming something new to some people is like playing a video game. I love programming useless things just for the challenge. People who don't understand that have never had a true love for programming.

      --
      Only two things are infinite, the universe and human stupidity, and I'm not sure about the former.
    2. Re:Uh.... by hobbesmaster · · Score: 1

      So you can settle trivial arguments with your friends when away from an internet connection, duh!

      (Or to always have something to read on your laptop while traveling - this is what I would use it for)

    3. Re:Uh.... by Tablizer · · Score: 2, Insightful

      I love programming useless things just for the challenge.

      Have you ever worked on a project called "Clippey", by chance?

    4. Re:Uh.... by stephanruby · · Score: 3, Informative

      Programming something new to some people is like playing a video game.
      Speaking of which, http://www.pyweek.org/ is coming up this first week of September. It's time to dust off that python book (or borrow one from someone) and do whatever you have to do to get some days off that week.
    5. Re:Uh.... by Gazzonyx · · Score: 4, Funny

      I love programming useless things just for the challenge.

      Have you ever worked on a project called "Clippey", by chance?
      No, he said he has a love for programming; not a seething hatred for users. Besides, everyone knows programmers only hate admins. ;) On behalf of the programmers, I'd like to say that this isn't true we love our admins. Who else makes sure that our connections*&#^$: Connection Reset By Peer
      --

      If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

    6. Re:Uh.... by Gazzonyx · · Score: 2, Funny

      So you can settle trivial arguments with your friends when away from an internet connection, duh!

      (Or to always have something to read on your laptop while traveling - this is what I would use it for) I bet you're quite the ladies man, huh?
      Sorry, I couldn't resist!
      --

      If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

    7. Re:Uh.... by Jugalator · · Score: 1

      First, I don't think this tool was useless -- it was a quick way of achieving his goal of off-line Wikipedia browsing. Second, when I program I prefer to make something useful to me, and I don't think it has to do with a lack of passion for programming. It's just that I'd rather see my time come to good use, even if I enjoy the process by itself.

      --
      Beware: In C++, your friends can see your privates!
    8. Re:Uh.... by LittleBigLui · · Score: 1

      No, he said he has a love for programming; not a seething hatred for users.


      As if that was possible.
      --
      Free as in mason.
    9. Re:Uh.... by nschubach · · Score: 1

      Hey baby! Did you know the male three toed sloth stays in the same tree his whole life and the females tend to move around from tree to tree a lot? So, can I buy you a refreshing Ballz drink?

      --
      Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
    10. Re:Uh.... by fotbr · · Score: 2, Interesting

      I was that way, once. Then other hobbies came along, and now I rarely do any programming thats not work related.

      Its funny how time changes you.

    11. Re:Uh.... by Gazzonyx · · Score: 1

      No, he said he has a love for programming; not a seething hatred for users.


      As if that was possible. Touche, my good man, touche.
      --

      If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.

  12. Re:Why? by RuBLed · · Score: 1

    Programmers shouldn't be wasting time on these trivial, pointless projects. We need their work in other more important projects!

    Ironically, You're already reading slashdot. You had just wasted your time.















    You're reading it again, wasting more time ehh??....

    But the point is, if programming an offline wikipedia makes you happy and you don't need the money then you would understand....
  13. I hope by Nikron · · Score: 4, Funny
    That you don't dump the wiki at a bad time.

    George W Bush

    Is a dick head!!!!11

    --
    Disclaimer: Disregard the above post.
    1. Re:I hope by Anonymous Coward · · Score: 5, Funny

      You mean before someone makes it inaccurate again?

      Oh, nevermind, I see the problem:

      George W Bush

      Is a dick head!!!!11

      should be

      George W Bush

      Is a dick head!!!!!!

      Man, those out to mess with the content are getting more and more subtle...

    2. Re:I hope by Ours · · Score: 1

      Looks fine by me.

      --
      "You superiour intellect is no match for our puny weapons" - The Simpsons
  14. But... by Anonymous Coward · · Score: 2, Funny

    What's the point of it if there are no vandals or flame wars to make it interesting?

  15. Hitchhiker's guide here we come! by Brietech · · Score: 5, Funny

    Combine this and one of the new E-ink ebook readers, make it pretty rugged, slap a solar panel on the back and man. . . you have something really close to a genuine hitchhiker's guide to the galaxy. Ah, I love where technology is heading =)

    --
    I'm perfect in every way, except for my humility.
    1. Re:Hitchhiker's guide here we come! by Sneftel · · Score: 4, Funny

      As long as hitchhikers primarily need to know how to evolve a Pikachu into a Raichu, and how Benjamin Disraeli has been referenced in pop culture.

      --
      The opinions stated herein do not necessarily represent those of anybody at all. Deal with it.
    2. Re:Hitchhiker's guide here we come! by dch24 · · Score: 1
      Don't forget to put this on the cover, in large reassuring letters:

      DON'T PANIC
    3. Re:Hitchhiker's guide here we come! by Anonymous Coward · · Score: 1, Interesting

      how to evolve a Pikachu into a Raichu

      Well, if that isn't something a hitchhiker needs to know, it at least sounds like something they would need to know!

      Does it also give the probability of the Pikachu turning into a bulldozer?

    4. Re:Hitchhiker's guide here we come! by RandomWhiteMan · · Score: 5, Funny

      You laugh now, but just wait until you're stranded in the middle of Blackheath England, needing a ride from a conservative British History Scholar who has his son with him playing Pokemon Gold. Won't be so smug then, will you. I bet you won't even have your towel on you when this all goes down.

    5. Re:Hitchhiker's guide here we come! by kars · · Score: 1

      Yeah, it brings a whole new meaning to the term information -highway-...

      --
      Take life easy: one bit at a time.
    6. Re:Hitchhiker's guide here we come! by Anonymous Coward · · Score: 0

      You don't even know who I am.

      -Benjamin Disraeli

    7. Re:Hitchhiker's guide here we come! by Gromius · · Score: 5, Funny

      Yes its a perfect fit. Particularly as Wikipedia has now supplanted the Encyclopedia Britannica in many places as the standard repository of all knowledge and wisdom. Although it has many omissions, contains much that is apocryphal, or at least widely inaccurate, it scores over the older more pedestrian work in two important ways.

              * 1. It is slightly cheaper
              * 2. It has the words "You can copy and edit me for free" inscribed in large friendly letters in the license.

      Also like the guide, although it cannot hope to be useful or informative on all matters, it does make the reassuring claim that where it is inaccurate, it is at least definitively inaccurate :)

    8. Re:Hitchhiker's guide here we come! by cowens · · Score: 4, Insightful

      Ah, but that is what the original HHGTTG was as well. Tons of info on alcohol and Eccentricea Gallumbits (the triple breasted whore of Eroticon Six), but the entry for Earth was: Harmless. Later it was expanded: Mostly Harmless.

    9. Re:Hitchhiker's guide here we come! by nstlgc · · Score: 5, Funny

      Just so we're clear, you can make Pikachu evolve into Raichu by using the Thunder Stone (which makes sense, since they're Electric Pokémon). However, due to the emotional value Pikachu has to trainers, most of them choose not to evolve him. Some Pokémon games even plain don't allow this. I hope this was helpful.

      --
      I'm Rocco. I'm the +5 Funny man.
    10. Re:Hitchhiker's guide here we come! by bacon_sarnie · · Score: 1

      I thought those large letters were friendly, rather than reassuring.

    11. Re:Hitchhiker's guide here we come! by MichaelSmith · · Score: 1

      You deserved a funny mod. Thanks for that. I feel happy and sad at the same time.

    12. Re:Hitchhiker's guide here we come! by creepynut · · Score: 1

      Re:Hitchhiker's guide here we come! (Score:3, Informative)
      This is why I love Slashdot. Even Pokemon trainers will take time out of their days to moderate Slashdot, proving the system works.
    13. Re:Hitchhiker's guide here we come! by ivlianvs · · Score: 1

      Share and enjoy!

    14. Re:Hitchhiker's guide here we come! by Aardpig · · Score: 1

      If you're stranded in the middle of Blackheath, I suggest a five minute walk over to the Hare and Billet, for a pint and some peanuts, would be best.

      --
      Tubal-Cain smokes the white owl.
    15. Re:Hitchhiker's guide here we come! by Frozen+Void · · Score: 1

      I have a nagging suspicion that people like you have finally made wikipedia to merge all pokemons in batch articles.

    16. Re:Hitchhiker's guide here we come! by msmiffy · · Score: 1

      Thank you for that - can't recall the last time that I read something that actually made me laugh out loud. Just as well I wasn't drinking my cup of something not quite entirely unlike a cup of tea, otherwise it would be coming out of my nose.

  16. Only 2 days huh by Anonymous Coward · · Score: 2, Funny

    I was able to build this in two days, most of which were spent searching for the appropriate tools. Simply unbelievable... toying around with these tools and writing less than 200 lines of code, and... presto!
    Give that man a job at Google.
    1. Re:Only 2 days huh by Anonymous Coward · · Score: 1, Funny

      Don't you mean ChaCha?

    2. Re:Only 2 days huh by Anonymous Coward · · Score: 0

      NO NO .. Give him a job at Microsoft

    3. Re:Only 2 days huh by dmdavis · · Score: 2, Funny

      Sorry, but he never states that his product is in beta.

  17. Days? Please clarify by Tablizer · · Score: 1

    compared to loading the 'dump' into MySQL -- which, if you want to enable keyword searching, takes days."

    Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)

    1. Re:Days? Please clarify by Speeple · · Score: 1

      Importing all that textual data into MySQL and building the full-text indexes is what I assume he means.

    2. Re:Days? Please clarify by pla · · Score: 2, Informative

      Do you mean searching takes days, or loading? Searching should be quick if you index the words. If you are duplicating a bunch of local clones of wiki, then simply copy down the raw MySql table data files rather than reload from delimited files etc. (One needs to make sure their version of MySql is compatible with the table file format.)

      I suspect the former, plus creating the index, plus the not inconsiderable overhead of running an SQL server.

      DBs have their place. For a "real" Wiki, or more generally any data collection scenario where you can have a designated server, using a SQL store makes perfect sense.

      In most situations, however, the overhead of running a real database on the end-user's machine makes no sense (for the record, I consider this one of the biggest non-bug flaws with Vista, though I realize you can technically turn it off - With the resulting loss of functionality). The exact project mentioned in the FP forms the perfect example of this - He doesn't want to run a Wiki, he just wants to take a dump of it, do text-based searching, and extract pages in some reasonable form. Why would he want to even consider importing a nice single XML file into a memory-hungry form, from which he would still need a set of frontend tools to extract the desired data and convert it to a convenient viewable form?

    3. Re:Days? Please clarify by Tablizer · · Score: 1

      Compared to bloated GUI's and fat device drivers, most database engine overhead is relatively minor in comparison these days. I suspect the author just didn't want to bother to tune MySql. There's also SqLite that may lighten the burden.

    4. Re:Days? Please clarify by pla · · Score: 1

      Compared to bloated GUI's and fat device drivers, most database engine overhead is relatively minor in comparison

      I certainly wouldn't go that far. In the memory it takes to tolerably run a Wiki starting with a real dump, you could easily run three or four entire virtual systems. A basic XP or RedHat/Gnome system runs decently in 256MB. Import a 2.5GB BZipped Wiki with MySQL limited to 256MB and tell me how responsive it feels.



      I suspect the author just didn't want to bother to tune MySql.

      Nor should he need to! He doesn't want to set up his own full-fledged Wiki, he just wants to search and display what amounts to a text file; And a text file already neatly broken into tidy organizational units thanks to XML.

      I don't argue that a good database has its uses. But instead of accusing that he "didn't want to bother", can you think of a single reason to bother, short of bringing the dump up as a writeable live Wiki?



      There's also SqLite that may lighten the burden.

      On this point, I agree with you - And it probably would have made the FP author's task easier. But even that reasonable compromise would count as massive overkill for what the FP wanted to do.

      Just because you can use a CNC to carve your name into a block of wood doesn't mean a jackknife or even a rusty nail won't do the same (depending on the precision you need). And taking this analogy even further, a jackknife will work in the middle of a post-apocalyptic wilderness, while the CNC requires most of the resources of modern civilazation just to sit there and hum quietly at you.

    5. Re:Days? Please clarify by Tablizer · · Score: 1

      He doesn't want to set up his own full-fledged Wiki, he just wants to search and display what amounts to a text file; And a text file already neatly broken into tidy organizational units thanks to XML.

      If it's not to be large and you want KISS, why not just use the file system then?

  18. Faster than a speeding slug... by kcbrown · · Score: 1

    "Orders of magnitude faster to install (a matter of hours) compared to loading the 'dump' into MySQL -- which, if you want to enable keyword searching, takes days."

    But....but....I thought MySQL was fast!

    :-)

    --
    Use 'slashdot stuff' in the subject line in any email you send me if you want to get past the spam filter.
    1. Re:Faster than a speeding slug... by larry+bagina · · Score: 1

      remember when slashdot's comment parent id index overflowed (24 bits ought to be enough for anybody!) and slashdot was broken for 36 hours or so while it reindexed?

      As an aside, postgresql would be slower to do the initial data load but it the table is accessible all the while. It's even accessible while indexing is taking place.

      --
      Do you even lift?

      These aren't the 'roids you're looking for.

    2. Re:Faster than a speeding slug... by Blackknight · · Score: 1

      Once data is imported MySQL is usually fast. Of course SQLite would be even faster, and no database server is needed, it's just a regular file. For a project like this SQLite would a perfect fit, I actually wrote my entire site using it.

  19. Re:Why? by rabblerabble · · Score: 0

    At least you'd have a line of young bucks waiting to crank the OLPC for a few minutes... -->interpret that however you'd like, it works.

  20. Re:Just hope you don't get an effed image. by Anonymous Coward · · Score: 2, Insightful

    And that doesn't happen offline? Only naive people like you need to be worried about reading Wikipedia.

    There are bastards of every academic, social, and financial background.

  21. Give it a thunderstone, and Family Guy... by Cyno01 · · Score: 1

    Its really sad that i know both of those.

    --
    "Sic Semper Tyrannosaurus Rex."
  22. Re:Just hope you don't get an effed image. by Tacvek · · Score: 4, Insightful

    My very serious question to you is how much better do you think things are at a "real" encyclopedia. They have many of the same problems, but they are just not public. "Real" encyclopedias can be just an inaccurate as the Wikipedia on many articles. For a quick first reference, Wikipedia is an ideal tool. Just be sure to take things with a grain of salt if you are not checking the sources for further information. Guess what though, the same applies to "real" encyclopedias too. One difference is that with "real" encyclopedias, you always lack revision information, and you often lack information about the sources used by the editors. (Some encyclopedias are better than others in that respect.)

    --
    Stylish sheet to fix many problems in Slashdot's D3: https://gist.github.com/801524
  23. local resource, better interface by Zork+the+Almighty · · Score: 1

    Because it might be useful to have something stored locally. I travel a lot with my laptop and I would like this. I would also appreciate the convenience of not having to fire up a web browser for wikipedia. You can search articles from the command line. You could also potentially write a better search feature, ie: bolt on some code to combine and summarize multiple related articles. The approach the guy used (a bunch of small bz2 files) is interesting and potentially useful. I'd say this was one of the better articles to hit Slashdot lately, and I'm glad I read it.

    --

    In Soviet America the banks rob you!
  24. Save some time by Anonymous Coward · · Score: 0

    Uhh, why not just save yourself a few hours and download the html pages from here http://static.wikipedia.org/downloads/April_2007/e n/ and search with Google Desktop indexer http://desktop.google.com/.

  25. Good part of the page: the explanation by phliar · · Score: 4, Insightful

    For a change it's not just a link to a .tar.gz somewhere, but an actual article where he goes through what he did, and (more important) why he did things that way. Good reading even if you don't want an off-line Wikipedia.

    --
    Unlimited growth == Cancer.
    1. Re:Good part of the page: the explanation by cpu88 · · Score: 1

      yeah, you are right. Checking programme information of the mini-tv in front of him on the flight.

  26. Re:Just hope you don't get an effed image. by Bombula · · Score: 1

    It might defend on the topic/field in question. The articles you reference seem to be focused on tech stuff. I use wikipedia primarily for socioeconomic reference material, and find it in general to be pretty solid. There are places where the depth is limited, but it's definitely my first-reach resource as long as I have an internet connection - mainly because many of the specific things I'm after might not be in a general encyclopedia like Britannica - intertemporal equilibrium, hedonic regression, Edgeworth's limit theorem, the Bertrand paradox etc, etc.

    --
    A-Bomb
  27. Re:Just hope you don't get an effed image. by Short+Circuit · · Score: 1

    And there's more, but you get the idea. Collusion to ruin people's lives when they run afoul of admins, corrupt editors doing and getting favors from the head honcho himself, pet pages that end up with incorrect information, speculation, or specious reasoning, and a general air of arrogance and groupthink reinforcing an internal idea that they can do no wrong. You missed a few, such as product placement pages and ancient "This page doesn't conform to {{?}} standards" tags. That, and obscure fields get limited attention.

    Why bother, seriously? Because the breadth of material covered in Wikipedia is unparalleled, as is the timeliness of information in many fields of interest. And it's a hell of a lot more compact than a 100 lb encyclopedia set, and cheaper to boot.
  28. Re:Just hope you don't get an effed image. by Bombula · · Score: 1

    Yikes, defend = depend

    --
    A-Bomb
  29. Re:Just hope you don't get an effed image. by Anonymous Coward · · Score: 0

    "ollusion to ruin people's lives when they run afoul of admins, corrupt editors doing and getting favors from the head honcho himself, pet pages that end up with incorrect information, speculation, or specious reasoning, and a general air of arrogance and groupthink reinforcing an internal idea that they can do no wrong."

    Plus, I hear that's all trippled in the last six months!

  30. Re:Just hope you don't get an effed image. by poopdeville · · Score: 1

    Worthington's Law is the only economics anybody needs to know.

    --
    After all, I am strangely colored.
  31. Re:Just hope you don't get an effed image. by gad_zuki! · · Score: 3, Funny

    Yes, the paper encyclopedias are missing all the anime trivia. Christ, its embarassing to see "references in pop culture" sections which just spell out every geeky guy stereotype. I dont know why those people dont get banned. Everything in existance has an anime reference. That is unsettling.

  32. Re:Why? by thePsychologist · · Score: 5, Insightful

    Realize that some of the greatest things done by humankind were from doing "pointless projects" as you call them. Prime numbers for instance were studied by mathematicians just for fun, and now look, they're used for cryptography. Try doing your banking without them.

    Complex numbers originated from something "useless" like trying to solve the quartic polynomial in radicals...try building a bridge without them. In fact all of science is built upon people going in random tangents doing things they enjoy, discovering seemingly "useless facts" but most of it becomes useful *and* gives us an idea of the universe in which we live.

    Only working on immediate practical problems is very shortsighted, and if mandated throughout the academic community, would mean the death of innovation and most discoveries.

    --
    "What lies behind us, and what lies before us are tiny matters compared to what lies within us." Ralph Waldo Emerson
  33. It doesn't take days by BReflection · · Score: 4, Informative

    It only takes days if you use the php import script to import the sql dump, which was not designed for importing the entire dump.

    Use the ANSI C implementation, which takes about 20 minutes to convert the XML to SQL and then takes a few hours to import into MySQL. Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram.

    http://meta.wikimedia.org/wiki/Xml2sql

    --
    python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
    1. Re:It doesn't take days by BReflection · · Score: 1

      By the way, he could have saved himself a lot of time if he would have just purchased a WikiStick http://www.wikistick.com/

      --
      python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
    2. Re:It doesn't take days by jschrod · · Score: 1
      OK, I bite.

      Your URL leads to a domain parking page. Google search for Wikistick didn't bring results on the first page either. AFAIK, full Wikipedia (text and images) is too large for a USB stick.

      What did you want to tell us?

      --

      Joachim

      People don't write Manifestos any more -- what's going on in this world? [Frank Zappa]

    3. Re:It doesn't take days by gbjbaanb · · Score: 1

      is it? I read the DB is 2.9Gb which is all he's taken. You don't *need* the images, but I image they'd take a fair amount of space!

    4. Re:It doesn't take days by Ant+P. · · Score: 1

      Please not that you need a properly configured MySQL server in order to efficiently run a local copy of Wikipedia, which must have at least 8GB of ram. In other words, your post is a complete waste of space given that he said in TFS that this is for a laptop.
  34. Linda Mack! by Anonymous Coward · · Score: 1, Funny

    I would be concerned that Slimvirgin and the other intelligence agent(s) might not be able to revert and ban the edits I would be making offline. Maybe Jimbo can give them authority to come rough me up at home and beat my lcd with a hammer.

    http://yro.slashdot.org/article.pl?sid=07/07/27/19 43254

  35. I know the feeling by aepervius · · Score: 5, Insightful

    They say to you that their hobby is painting/music/walking/repairing old car/gardening/making reduced model etc... And they seem to think that their hobby are perfectly acceptable. But as soon as you say you like to program stuff, they don't understand how this would be a hobby. They mostly fail to recognize that every one of us has something in common : the joy of act of creation. The fact that our hobby entail creating something immaterial and full of "logic" does not matter. It is still a joy.

    --
    C. Sagan : A demon haunted world:
    http://www.amazon.com/gp/product/0345409469/
    visit randi.org
    1. Re:I know the feeling by cpu88 · · Score: 1

      "I like painting" "Umm.. crafting" "Music" "Guitar" "Hiking" "PROGRAMMING" "WHAT?" They mostly even don't know what programming is. And after u explained what this hobby is about, they replied with an odd emotion. "Come on, don't stay in your room the whole day"

    2. Re:I know the feeling by nschubach · · Score: 1

      You forgot long walks on the beach, skiing, rock climbing and photography. (Which I find ironic since everyone loves these things and you rarely ever see photos of it happening...)

      --
      Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
    3. Re:I know the feeling by fotbr · · Score: 1

      Heh, I hate walking on the beach, skiing, and rock climbing. Give me sailing, woodworking, and metalsmithing, amongst many others.

      Photography I do enjoy, but I have no delusions that everyone and their brother wants to see my photos, so I don't slap them all over flikr. On the other hand, my house is decorated entirely with photos I've taken, since *I* like them.

    4. Re:I know the feeling by orgelspieler · · Score: 1
      Here you go! I couldn't find one of somebody photographing while skiing past a rock-climber on a beach, sorry.
  36. Mass inserts into mysql... by Splab · · Score: 3, Informative

    is very very slow when you do it on a normal installation, the reason is MySQL comes with a "be nice to people who don't know what they are doing" setup. Go into the my.cnf and find the buffer settings, crank them up and restart the server. It can really do a lot (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database).

    1. Re:Mass inserts into mysql... by Ant+P. · · Score: 1

      Neither is SQLite, but that doesn't mean nobody should use it ever.

  37. Xapian by paltemalte · · Score: 1

    For those who didn't know, Xapian, the search engine he used for this, is really awesome. Its very fast, stable, actively developed and packs some pretty impressive features. Its written in c++, but has bindings for Perl, Python, PHP, Java, Tcl, C#, and Ruby. If you need an embedded search function on a site, you should check it out.

    I've used it for over 2 years on various sites and am really pleased with it.

    --
    Sam has one liberty, which he sacrifices for one security. Can you tell me what Sam has now?
    1. Re:Xapian by mikeboone · · Score: 1

      I've been using Xapian too, and it works great. Mod Parent Up.

  38. What?? by icydog · · Score: 5, Funny

    TFA is:

    1. Not a thinly-veiled attempt to advertise a crappy product
    2. Not bashing Microsoft
    3. Not about somebody who is trolling open-source (i.e. SCO)
    4. Not about Bush taking away all our rights and ending freedom
    5. Not about voting fraud and the end of democracy/America/the world
    6. Not decrying Vista DRM and its ties to the MAFIAA
    7. Posted on Slashdot

    Furthermore, TFA is interesting and informative.

    Am I in heaven?

    1. Re:What?? by mosiadh · · Score: 0

      Am I in heaven?
      No, thats upstairs. Invitation only.
    2. Re:What?? by TooTechy · · Score: 1

      When you drop you toast in the morning I bet it lands jam side up - Every Time!

      Look out for Jim Carey living next door.

      Wake yourself up before Freddie Kruger arrives.

      Hit yourself on the head with a slice of lemon wrapped around a large gold brick.

      Man you could use this little tool soooo much. I want it installed on my Nokia N770. NOW!

    3. Re:What?? by Pollardito · · Score: 4, Funny

      it'll get posted again tomorrow just to maintain expectations

    4. Re:What?? by Anonymous Coward · · Score: 0

      Yeah, exactly! Slashdot is going old-school again, huh ;-)

  39. Re:Just hope you don't get an effed image. by GPL+Apostate · · Score: 1

    I use Wikipedia, as I use the Internet, for geeky computer stuff and electronics tech, tools, and info.

    I can't imagine ever taking it that seriously that I would use it for mainstream 'non-nerd' stuff.

    --
    Microsoft says legacy (serial/parallel) ports are bad. They don't obfuscate the hardware enough.
  40. Re:Just hope you don't get an effed image. by Jugalator · · Score: 1

    Unfortunately for Wikipedia, the quality or lack of it in competing encyclopedias does not resolve the problems in Wikipedia. I hope Wikipedia can work on these issues because I am seeing some of it too. I'm also seeing article rot being quite common, in that old articles deteriorate, and not really from a lack of good will either. Someone discuss the problem in a blog here: http://nonbovine-ruminations.blogspot.com/2007/02/ where-are-stable-versions.html

    --
    Beware: In C++, your friends can see your privates!
  41. C&D Tomorrow? by fishbowl · · Score: 1

    Can't help but assume there will be a cease and desist order in the /. headlines tomorrow.

    --
    -fb Everything not expressly forbidden is now mandatory.
  42. The Point? by photomonkey · · Score: 1

    I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?

    The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

    Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.

    If it's an academic project, that's really cool, but I don't see a practical point to it.

    --
    Message contains 1 attachment: spam.gif
    1. Re:The Point? by Mr.+Roadkill · · Score: 4, Insightful

      I know that not everyone has a permanent connection to the net everywhere they go, but what is the point of storing a local copy of Wikipedia?
      Ummm... I think the whole point is, as you've pointed out, that not everyone has a permanent connection to the net everywhere they go. Or maybe they don't have access to everything they'd like even if they *do* have net access everywhere, or want to pay extravagant data rates while out and about.

      Joe has all-you-can-eat broadband at home, or an understanding employer with a fat pipe, and spends two hours each day on the train. Two and a half gig per month (and lets face it, you probably don't want to update it more frequently that that) and he's got probably half his reading material sorted out.

      Wang lives in Buttfuckistan, a fictional country with totalitarian leanings with too many real-world counterparts. The Great Firewall of Buttfuckistan (i.e. squidguard, under the control of Buttfuckistan Telecom, and settings in the routers to drop non-port-80 traffic half the time) makes it impossible to reliably access Wikipedia from inside their borders, which is a great shame because the entry on Buttfuckistan is particularly unflattering. Once a month, Joe sticks a DVD with five minutes from an old re-run of Friends and an encrypted dump of Wikipedia in an airmail envelope and sends it to Wang.

      Mary is still at secondary school, and her particular school has wifi access for students who are encouraged to purchase their own laptops, but since the local pastor discovered http://en.wikipedia.org/wiki/Image:Dream_of_the_fi shermans_wife_hokusai.jpg they've been forced to add wikipedia to the school's blocklist. Which is a pity, because it's a great first-approximation source for material or research directions, but there you go. Mary can make a local copy through her home broadband connection, and can access it locally on her laptop wherever she goes - even at school, or church. Bill, Jillian and Mungo (the pastor's son) find out about this, and now all four of them take it in turns to make the copy each month, sharing the bandwidth costs. Their friends Harry and Sally, who don't have broadband but are great friends of the other four, also get copies... and there are plans to distribute the copies further, as a kind of teenage grass-roots knowledge-sharing and social-justice effort.

      Still can't see the point?
    2. Re:The Point? by Riktov · · Score: 1

      >>
      The beauty of it is that it is online and always up-to-date (wrong, or less wrong).

      Trying to capture it locally seems to me to be like trying to print The Internet. By the time it's done spooling, it's out of date.
      >>

      Sure, what's the point of reading an old version of the history of the Battle of Hastings, or the technical specifications of the P-51 Mustang, or the characteristics of a dominant seventh chord? After six months, it's complete obsolete and worthless, right?

    3. Re:The Point? by epine · · Score: 1

      By the time it's done spooling, it's out of date.

      Well, yes, your copy might not include the plot spoilers for Deathly Hallows or the latest exploit by Lindsay Lohan, which excludes all but the narrowest academic use. Once the cellulosic ethanol process is a bit more mature, we can recover some prime real estate where our public libraries used to reside. The children's books alone will fuel a Hummer H1 for almost a year.
  43. WP:1.0 wants you by Titoxd · · Score: 1

    Dude, WP:1.0 wants YOU.

  44. Why didn't he post his howto on wikipedia? by nullchar · · Score: 1

    Wikipedia seems the best place for the author's "how to download and use offline".

    1. Re:Why didn't he post his howto on wikipedia? by jonadab · · Score: 1

      > Wikipedia seems the best place for the author's "how to download and use offline".

      No Original Research.

      --
      Cut that out, or I will ship you to Norilsk in a box.
    2. Re:Why didn't he post his howto on wikipedia? by Anonymous Coward · · Score: 0

      Bad idea -- giving Wiki instructions on how to download and reproduce itself is how SkyNet gets started.

  45. Re:Just hope you don't get an effed image. by ZzzzSleep · · Score: 2, Funny
    Blatantly stolen from David Morgan-Mar.

    In many of the more relaxed corners of the Outer Eastern Rim of the Internet, Wikipedia has already supplanted the great Encyclopaedia Britannica as the standard repository of all knowledge and wisdom, for though it has many omissions and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older, more pedestrian work in two important respects.

    First, it is slightly cheaper; and secondly it has the words "anyone can edit" inscribed in large friendly letters on its cover.
  46. What about moulin? by maubp · · Score: 1

    How is this different to moulin which is a fully interactive, offline version of the entire Wikipedia (without pictures) on a CD-ROM:

    http://moulinwiki.org/l/en/

    1. Re:What about moulin? by Dillon2112 · · Score: 1

      Well, for starters, it's in English.

    2. Re:What about moulin? by maubp · · Score: 1
      Now I've read both articles:

      This guy's work required about 3GB for the compressed Wikipedia data dump (split up into compressed chunks using bzip2recover), plus python, perl, a little database library (xapian) and a web server (Django). He seems to be working in English only, and doesn't seem to provide a "why" or who this might be useful to.

      Moulin has a concrete aim in mind, they are starting with the much smaller French version of Wikipedia, and have built a CD-ROM sized offline viewer for released in West Africa. They've also been working on other languages including left-to-right support for Farsi and Arabic. It sounds like they plan to have the English language version of Wikipedia as an offline DVD, but the techinical details seem a little thin on the ground on their webpage (but there is source code).
      http://moulinwiki.org/l/en/

    3. Re:What about moulin? by LordSnooty · · Score: 1

      Is Moulin anything to do with Kiwix? Cos these guys were also building an offline WP viewer, though only featuring 2000 important articles. Dev seems to have stopped now, a pity as it was a nice package with an excellent page viewer. Ideal for slapping on a laptop and providing something to read when you're away from net.

    4. Re:What about moulin? by YourExperiment · · Score: 1

      It has pictures, and it's not on a CD-ROM.

    5. Re:What about moulin? by Peganthyrus · · Score: 1

      and doesn't seem to provide a "why" or who this might be useful to

      Um, anyone who wants to have the entire English version of Wikipedia on their local machine, for those times when they're away from the net?

      People who "would love to have Wikipedia on their laptop, since this would allow them to instantly check for things they want regardless of their location (business trips, hotels, etc). Others simply don't have an Internet connection - or they don't want to dial up one every time they need to check something." (from TFA)

      Hell, I could have used this just the other day, when I was in a cafe and wanted to look something up, but didn't want to pay $7 into their wireless gateway for the privilege. I think I might try and adapt his process to work on OSX! Point the web browser at my own machine and search.

      --
      egypt urnash minimal art.
  47. ...or the HTML export feature? by georgewilliamherbert · · Score: 1

    There's a one-button (for admins) export-the-whole-wiki-as-html feature in modern MediaWiki software installs...

    But hey, two days and a few hundred lines of code is cool. You geek (verb). If we always took the easy way out we'd be using Windows and have committed suicide long ago.

  48. MyISAM/InnoDB by Anonymous Coward · · Score: 0

    (especially if you are running InnoDB which you of course are since MyISAM isn't a proper database)



    MyISAM is very limited compared to other databases but at least it's a lot faster for some specific (useful) loads. InnoDB is not faster than other databases but is still rather limited. Ergo: use MySQL with MyISAM if your problem is a good fit to its capabilities, use another database (PostgreSQL, Firebird, MSSQL, ...) otherwise.

  49. can we get a PSP version of it? by mu22le · · Score: 3, Interesting

    A PSP is very portable (fits in your sweater/backpack), hackable, and has up to 8Gb of storage. I have been dreaming for an year about porting wikipedia to it. Unfortunately I'm not familiar with the kind of programming needed and I could never find the time...

    1. Re:can we get a PSP version of it? by tritohc · · Score: 1

      The PSP browser is capable of reading local html files, if that is the type of backup you have of wikipedia. To access the root directory, point your browser to file:/

    2. Re:can we get a PSP version of it? by mu22le · · Score: 1

      There is no such thing as a "local html backup of wikipedia" (the wikipedia dump is encoded in xml and is b2zipped, it could never fit your PSP anyway if you did not compress it).

      What you need is a server that uncompress the pages (encodes them in html) and handles them to the browser, but I do not think the PSP has enough juice to run the server and the browser at the same time.

      The idea behind the iPod implementation is to have a single lightweight program that search the compressed archive and displays you the page with some basic markup (italic, bold, links).

      This approach would fit the PSP very well (while the solution used in the article is a quick solution for a workstation), the main problem is that there is non gui toolkit for the psp, so writing an interface is quite hard (for me, anyway :)

  50. There's a bug in TFA: Missing articles. by dannycim · · Score: 4, Insightful

    There's a serious problem with the article's way of treating the data that I didn't see addressed.

    The wikipedia database file is one large bzip2'ed XML file which the author splits into blocks of 900k (bzip2's natural blocking) which he then parses for the "title" and "text" XML tags.

    The problem with that approach is that some of these tags may well end up being split over block boundaries, so some articles risk being missed. EG:

    END-OF-BLOCK: blablablabla...blabla[/text][othertag][ti

    START-OF-NEXT-BLOCK: tle][sometag]blablablablabla...

    So searching for "[title]" in boths blocks separately like TFA does will fail for one article.

    (I've used square brackets instead of lessthans and greaterthans because slashdot won't let me use them.)

    1. Re:There's a bug in TFA: Missing articles. by dannycim · · Score: 1

      Oh, and I forgot: Articles' text which also end up over block boundaries will appear truncated.

    2. Re:There's a bug in TFA: Missing articles. by ttsiod · · Score: 1

      No, actually there is no bug. If you read the contents of the 'show.pl' script, you'll see that it adapts to a missing '' by reading from the next volume - the next recxxx...bz2.
      As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space.
      So you see - no bugs :-)

    3. Re:There's a bug in TFA: Missing articles. by dannycim · · Score: 1

      I'm only seeing one call to "ShowTopic" in show.pl, which does the decompression of a block. I thought about the possibility that bzip2 would be unlikely to split words, but there's still the possibility that the title and text tags from one set could be split over two adjacent blocks.

      Anyway, gotta go to work. When I come back, I'll do some more in-depth sleuthing. :)

    4. Re:There's a bug in TFA: Missing articles. by Anonymous Coward · · Score: 0

      bzip2 (as other compressors) look for previous appearances of a token (in this case, '') and code a reference to it (instead of the full text)

      This is how many compressors work (LZXX based), but not bzip2 which uses a combination of BWT+MTF and Huffman coding.

      So with bzip2 this actually *could* happen

    5. Re:There's a bug in TFA: Missing articles. by ttsiod · · Score: 4, Informative
      (Re-post: for some reason the response I sent some hours ago didn't appear) No, actually there is no bug. If you read the contents of the 'show.pl' script, you'll see that it adapts to a missing '' by reading from the next volume - the next recxxx...bz2.

      As for the title, what you describe can't happen because of a fortunate side-effect: when compressing, bzip2 (as other compressors) look for previous appearances of a token (in this case, '<title>') and code a reference to it (instead of the full text) to save space. Since "text" and "title" appear all the time in these blocks (at least once for each article), they will NOT be split - they will be encoded as "references", and therefore, what you describe shouldn't happen (I hope :-)

    6. Re:There's a bug in TFA: Missing articles. by owlstead · · Score: 1

      Sounds like a border case that needs some testing. Of course, if you are up to it. Missing a single word in 900K does not seem to be much of a problem to me, especially if the 900K is all text. I mean, what are the chances of it actually happening?

  51. SCO is used to improve the image of Novell by Anonymous Coward · · Score: 0

    They need a "bad guy" on the block in order to redirect all the user hatred to them and the other (much worse) companies will pass by quietly. Like Microsoft and Novell.

    Is anybody surprised that we are flooded with SCO stories?

    Also companies are very afraid of stock-boom. With the SCO example they are trying to prove that stock boom can happen only if you are up against other -bigger- companies. While the fact is that stock boom is just around the corner for Microsoft and Novell; people have to simply alert the investors.

  52. awesome. just what I need. why a waste???? by nairbv · · Score: 1

    why are people saying this is useless? it's just what I need, and I was suprised when I went looking and couldn't find it when I first wanted it.

    I cruise around in a sailboat. my longest passage was 35 days. how I would have LOVED to have been able to read wikipedia articles on that passage, even if they were a few weeks old. What do I care if an article is a few weeks old? 35 days at sea I'd have read a paper encyclopedia if I had one, but my boat isn't big enough to carry the weight of a paper encyclopedia. It sails like shit as it is from how many books I have stuffed in my V-berth.

    and sometimes I'm on some random little island with no internet access for a periods of time... hanging out with a bunch of other sailors, and of course we get into discussions that leave us wishing we could go google something.

    Even in Bora Bora, they had internet but it was 24$/hour, on crappy old computers! this would have been great!

    and now! now I'm in China! They block parts of wikipedia. yeah I can setup and SSH tunnel when I happen to have internet access available, but how great it is to have a local (though somewhat outdated) copy of wikipedia, including any blocked articles!

    sounds great!

  53. Hook it in to your desktop search... or... by argent · · Score: 1
    I didn't know about the wikipedia raw database, or I'd probably have done something like this myself, and hooked it into the UNIX "locate" db, or Spotlight, or maybe...

    $ man w locate
    GNU Locate
    From Wikipedia, the free encyclopedia
      (redirected from Locate)
    ...
    This software-related article is a stub. You can help Wikipedia by expanding it.
  54. SDHC? by tepples · · Score: 4, Informative

    Why? Have you seen the price of 4G flash cards recently? Yes, and it's possible that you may need a new PDA in order to use SD cards larger than 2 GB. The 4 GB ones use a different protocol called SDHC that older PDAs may not support. It's analogous to the old ATA hard disk size barriers, especially the 137 GB (128 GiB) barrier. Or are most PDAs capable of being upgraded to handle SDHC?
    1. Re:SDHC? by SCHecklerX · · Score: 2, Informative

      You can get a normal 4GB SD card from Transcend. I am using one in my sandisk sansa e140, which is definitely NOT SDHC compatible.

  55. wikipedia on java mobile? by chrb · · Score: 1

    Are there any projects putting wp on java enabled phones? It would be pretty cool to settle those arguments any time, any place.

  56. I have one called firefox by DragonTHC · · Score: 1

    I just need an offline wikipedia now.

    --
    They're using their grammar skills there.
  57. In what namespace? by tepples · · Score: 1

    Wikipedia seems the best place for the author's "how to download and use offline". No Original Research. In the article namespace. Research about Wikipedia appears to be encouraged in the Wikipedia: namespace.
  58. Curiously by Anonymous Coward · · Score: 0

    What would happen if a title element is split between two bzip files? Or the content of an article is split in the same way?

  59. better yet, a DS version by tepples · · Score: 3, Informative

    A PSP is very portable (fits in your sweater/backpack), hackable You have to buy a used PSP to be sure that you can hack it. New ones are likely to come with firmware version 3.51 or later, which is not cracked as of August 2007. The Nintendo DS, on the other hand, had its last major firmware update in September 2005 and is still cracked, with SLOT-1 modchips available at Wal-Mart for $30.

    and has up to 8Gb of storage So does a CompactFlash card in a GBA Movie Player in a Nintendo DS. It's a pity that the SLOT-1 adapters for DS haven't been shown to be compatible with SDHC.
  60. Re:Just hope you don't get an effed image. by AaronLawrence · · Score: 1

    Well I agree, and I think there is a (near?) consensus in WP policy that such things are not useful, unless they are particularly notable, and should be deleted.

    --
    For every expert, there is an equal and opposite expert. - Arthur C. Clarke
  61. Or a PalmOS version of it? by Uninvited+Guest · · Score: 1

    Good thinking. I was just wondering the same thing about my PalmOS PDA. It has plenty of storage available. I wonder if the existing Python port would be sufficiently powerful to run this.

    --
    Sometimes I worry that I'll develop Alzheimer's disease, but no one will notice.
  62. Re:Why? by Anonymous Coward · · Score: 0

    knowing that there is no connection possible within twenty miles or you could get a cellular card for your laptop. then you'd have to be somewhere pretty damn remote in order to need a local copy of wikipedia.
  63. Re:Why? by nschubach · · Score: 1

    Tell me we haven't reached a point where friends will argue, know he location to the nearest cellular tower and debate simply because they know you can't Google it or look it up on Wikipedia... What did you do before the Internet? Carry around a complete set of Encyclopedias just to prove that your right all the time?

    I'd find new friends. Seriously. NOTHING is that important.

    --
    Every time I start to have faith in humanity, I ruin it by driving to work between 7 and 8 am.
  64. wikipedia for iPod is already here... by mu22le · · Score: 1

    The thing is, a portable version of wikipedia has been already developed:
    http://encyclopodia.sourceforge.net/en/index.html

    for the iPod; also the Encyclopodia Ebook format (basically an indexed b2zipped articles or blocks), is far better suited for portable devices.
    Now if any PSP/DS/Palm developer is reading this...

  65. Re:Why? by CodyRazor · · Score: 0

    its much better to get a pocketpc phone and have it on that, then no matter where you are you can immediatly call bullshit on people trying to sound smart. i love calling people on that bullshit "10% of your brain" thing.

    --
    So Skulldilocks threw acid on the schoolchildrens' faces, cause somebody from the bible told her to do it!
  66. lucene by wwmedia · · Score: 1

    he'll be better of using lucene for search, faster than mysql

  67. Not enough framewurkz by Gen.Anti · · Score: 1

    The core idea for accessing bzip-compressed data is interesting.

    The later execution, however... Perl, Python, PHP, Xapian, Django all together as the runtime, and add to this C code for the preparation (I might be wrong about the details, just skimmed), for such a small application.

    The other poster who rejoices about how wonderfully old-skool TFA is, is obviously right. This kind of duct-tape Linux development feels badly sooo 90's and smells like maintenance, performance, installation and portability problems.

    Keeping it all in just one of the scripting languages would make it much more serious. (Perl, or maybe Bash for the easiest installation?)

    1. Re:Not enough framewurkz by toganet · · Score: 1

      You must be a Windows developer. What you call "duct-tape Linux development" is what the Unix/Linux development philosophy is all about -- individual tools optimized to do a single job, that can be strung together to accomplish arbitrary tasks.

    2. Re:Not enough framewurkz by Gen.Anti · · Score: 1

      PHP and Python and Perl (Django less so ;-)) are actually optimized to do many jobs! Note I suggested Bash myself.

  68. Re:Why? by jshowlett · · Score: 1

    This project *is* of immediate practical interest to its creator.

  69. sdict? by jimmyfergus · · Score: 1

    Anyone with experience of sdict?

    They offer a dictionary reader for various systems, including portable devices, and dictionaries including Wikipedia.

    Unfortunately their Wikipedia dict is a old (January), but it seems like a good approach for laptops or other small devices. When I get an 8Gb SDHC I'm going to try it on my Nokia N800.

  70. Time to fork by jbeaupre · · Score: 1

    In a time honored tradition, he could start a flame war over a trivial aspect of the existing lolcatgenerator, then decide to fork the code. If you're going to do something useless, might as well piss as many people off as possible before doing something redundant.

    --
    The world is made by those who show up for the job.
  71. Re:Why? by jbeaupre · · Score: 0

    Dang it! I was all set to throw a plank across the creek behind my house. Now I've got to get out my math books or die in the effort. Curse you!

    --
    The world is made by those who show up for the job.
  72. Re:Why? by macro187 · · Score: 1

    And LaTeX sucks for this. It is hard to use and nonstandard to PDF. Correct me if I'm wrong (perhaps using this new Wikipedia tool?) but isn't LaTeX still quite a bit more capable than PDF in terms of typesetting?
  73. Not New by Baavgai · · Score: 1

    While interesting, it's certainly reinventing the wheel. There are lots of methods for doing this found on the site itself ( http://en.wikipedia.org/wiki/Wikipedia:Database_do wnload ) including static content already marked up.

    Also, as others have noted, the his choping the file into chunks means you're going to loose at least one article per chunk.

    I'd implemented this with a compressed file system and maybe some symlinks. Happily, the static content is already there for the taking. Some find and grep on a file system should be enough to do a title search with little overhead. The web server ( for searches from a browser ) need be little more than a daemon, in perl, python, etc, you could do it in less than fifty lines.

  74. Gears? by pragma_x · · Score: 1

    I didn't see any posts on this so I thought I'd bring it up. I think the author took the long way around.

    The author did some nifty hacking that resulted in the following stack of dependencies:

            * Perl 5.8.5
            * Python 2.5
            * PHP 5.2.1
            * Xapian 1.0.2
            * Django 0.9.6

    He cited not wanting to use a RDBMS since he's not writing to the database, just reading. I can give him that, but it seems like it caused more trouble that it's worth.

    This leaves me wondering: why not just use Google Gears and be done with it. Sure, the hacking part would shift largely to the javascript side of things (would be mostly wiki conversion), but you'd have the other bits (web server, and storage) already worked out. All you'd have to do is slap together some little app to insert the XML data into the database.

    1. Re:Gears? by Gen.Anti · · Score: 1

      Man, I've posted about this just above. I've even made an effort of reformatting the pasted list. And it's you who got a point! Should have mentioned Google. *cries*

    2. Re:Gears? by pragma_x · · Score: 1

      Thanks for the props, but don't be so hard on yourself. Getting modded up around here is crap shoot at best.

  75. Now get it on a mobile phone by Cato · · Score: 1

    I live in an area with fairly bad mobile signal - I'm always trying to look things up on Wikipedia but finding I can't. Fortunately my Treo 680 smartphone can take 8GB SDHC cards (http://en.wikipedia.org/wiki/Treo_680), so I could fit this on with room to spare for MP3s and photos, and future growth of Wikipedia. Very tempting, though I'd need to port it to something like Lua and GCC - obviously the porting would be fairly trivial by the time Palm releases its Linux-based Treos in early 2008...

  76. They have the internet for computers? by Anonymous Coward · · Score: 0

    I think the most interesting thing is how this guy admits MySQL sucks donkey balls, but still gets promoted to the front page.

  77. Re:Why? by 644bd346996 · · Score: 1

    Actually, the only heated intellectual debates I regularly get into are on camping trips, where cellular reception is very rare. Worse yet, my friends often expect me to have encyclopedic knowledge with which to settle such debates.

    Conversations around a campfire can go anywhere.

  78. Re:Why? by Tweekster · · Score: 1

    Because I can

    which is one of the greatest motivations for human advancements.

    --
    The phrase "more better" is acceptable English. suck it grammar Nazis
  79. OLPC by thekaran · · Score: 1

    Preload this in OLPC

  80. New PSPs work too! by Anonymous Coward · · Score: 0

    You have to buy a used PSP to be sure that you can hack it.

    Not anymore! See the Lumines User Mode Exploit discovered in June. We got my son the game a month ago and he's gone crazy downloading home-brew games and emulators onto his PSP.

    1. Re:New PSPs work too! by tepples · · Score: 1

      You have to buy a used PSP to be sure that you can hack it. Not anymore! See the Lumines User Mode Exploit discovered in June. That works for firmware up to 3.50. But new PSPs in retail stores come with 3.51.
    2. Re:New PSPs work too! by mu22le · · Score: 1

      You can still find in the stores PSP with fw 2.x (one friend of mine bought a 2.80 a while ago).

      I understand you like your DS a lot and I hope someone writes a wikipedia browser for it but, please, be real! You are not going to find many PSPs with fw 3.51 on the US market for the next few months. It's not like they update them all the same night a new fw comes out :)

  81. Re:I hope.... Actually by LinuxEagle · · Score: 1

    The database dumps are static, and until more recent dumps are available, they will stay the same. So if George is ok now, he will be ok 12 hours from now.

  82. TomeRaider does this for Pocket PCs by Mex · · Score: 1

    Yeah, TomeRaider costs money, which already marks it as a loss against the "thrifty" slashdot crowd ;) , but I think it's worth it.

    http://www.tomeraider.com/

    They provide Wikipedia versions that you can use with their e-Reader. I bought one because I can use it on my Pocket PC, and it's just awesome having the Wikipedia available instantly, anytime. It's the fucking Hitchhiker's Guide version 0.1, gyat damn. =)

  83. Missing moderation by Spaceman40 · · Score: 1

    Where's the moderation for +!, Awesome?

    --
    I [may] disapprove of what you say, but I will defend to the death your right to say it.
  84. already exists by Anonymous Coward · · Score: 0

    http://www.webaroo.com/ already has a tool for this. They have packages on numerous topics, including a Wikipedia package

  85. Return policy? by tepples · · Score: 1

    You are not going to find many PSPs with fw 3.51 on the US market for the next few months. It's not like they update them all the same night a new fw comes out :) Which U.S. retail chains that sell PSPs have a good return policy in case I do get unlucky when I buy one?
    1. Re:Return policy? by mu22le · · Score: 1

      I really don't know... here in europe you can (in principle) return anything within a week without any explanation.

  86. Why didn't he use 'like' ? by martinicus · · Score: 1

    From TFA:

    "The result of the import process was also not exactly what I wanted: I could search for an article, if I knew it's exact name; but I couldn't use parts of the name to search; it was all or nothing. To allow these "free-style" searches to work, one must create the search index - which I'm told, takes days to build. DAYS!"

    This seems kind of stupid to me...could he not have done an SQL query using the moral equivalent of 'select articlettext where title like '%thingiaminterestedin%' - thus meaning you don't need to know the exact title.

  87. so, xapian is great by thc4k · · Score: 1

    Uhm, so he used a offline copy of wikipedia with a search engine to search through it? Wow that's like .. one of the very reason xapian exists. All the hightlights except latex come from xapian. This is like 1. Download Wikipedia 2. Install Xapian 3. Make Xapian index wikipeida 4. ???? 5. Slashdot!

  88. Re:awesome. just what I need. why a waste???? by Ken_g6 · · Score: 1

    This Chinese police. You under arrest for around going Great Firewall. Please to turn self in at nearest station.

    Oops! Forgot to post anonymously, dangit!

    --
    (T>t && O(n)--) == sqrt(666)