Slashdot Mirror


Extracting HTML and Images from MHT and CHM?

smoon asks: "I've got a boatload of .CHM files, and have recently run across some .MHT files. The .MHT format appears to be how MS Internet Explorer saves a web site for offline viewing, and appears to be a basic MIME-format with all images inlined as base64 encoded data. I've poked around a bit looking for a utility that will extract the HTML files and associated graphics so that I can view these in Linux, but no luck so far. The .CHM format is billed as a 'compiled' HTML, and boils down to the equivalent of a tarball, in some Microsoft proprietary format, that shows a series of web pages. MS has a .CHM developers kit that allows you to extract all of the data, but the links stop working and it ends up not being very useful. Anyone know of a code that can extract the HTML and associated images from .CHM or .MHT files?"

3 of 14 comments (clear)

  1. Online Documentation by SteveX · · Score: 3, Informative
    They're not for IIS; The .chm format is Microsoft's way of distributing help and online documentation. It's a replacement for the old .HLP file that uses HTML instead of RTF.

    A .CHM is a compilation of HTML files with support for a tree style view of the documents in it, as well as binary files (examples), images, browse order (associating a "forward" button with the page that represents the page after the current page), searching, etc...

    It's a pretty handy way of distributing online documentation, kinda like PDF but for HTML.

    Being HTML you can still dynamically resize the window and have the text reflow - In my opinion that's it's big advantage over PDF. A PDF is basically a rendering of a page - not really what you want for an online help system.

    It probably ends up just being a bunch of standard filenames inside a .CAB file (the .CAB format is what Microsoft puts a lot of their install packages and other archives into).

    As for the format it's in, here's what I found on Google.

    - Steve

  2. CHM extractor. by kyz · · Score: 4, Informative

    First off, MHT is just a mime message, run mpack on it.

    However, for CHM extraction, you can use this portable CHM extractor. I don't think Matthew has officially released it, but it should be OK to use. Get in touch with him if you want.

    --
    Does my bum look big in this?
  3. HTML Help Workshop by vukv · · Score: 2, Informative

    HTML Help Workshop from MS can perfectly extract all of the files in .chm, in their original state. Open it up, click on decompile and voila... I dont see how you managed to run into problems, i have done so with hundreds of chm files...sure, it doesnt always save all chm related data however files are always preserved in their original state