Extracting HTML and Images from MHT and CHM?

← Back to Stories (view on slashdot.org)

Extracting HTML and Images from MHT and CHM?

Posted by Cliff on Sunday January 27, 2002 @12:51AM from the it's-all-about-the-(proprietary)-file-formats dept.

smoon asks: "I've got a boatload of .CHM files, and have recently run across some .MHT files. The .MHT format appears to be how MS Internet Explorer saves a web site for offline viewing, and appears to be a basic MIME-format with all images inlined as base64 encoded data. I've poked around a bit looking for a utility that will extract the HTML files and associated graphics so that I can view these in Linux, but no luck so far. The .CHM format is billed as a 'compiled' HTML, and boils down to the equivalent of a tarball, in some Microsoft proprietary format, that shows a series of web pages. MS has a .CHM developers kit that allows you to extract all of the data, but the links stop working and it ends up not being very useful. Anyone know of a code that can extract the HTML and associated images from .CHM or .MHT files?"

2 of 14 comments (clear)

Min score:

Reason:

Sort:

Online Documentation by SteveX · 2002-01-27 01:56 · Score: 3, Informative

They're not for IIS; The .chm format is Microsoft's way of distributing help and online documentation. It's a replacement for the old .HLP file that uses HTML instead of RTF.
A .CHM is a compilation of HTML files with support for a tree style view of the documents in it, as well as binary files (examples), images, browse order (associating a "forward" button with the page that represents the page after the current page), searching, etc...
It's a pretty handy way of distributing online documentation, kinda like PDF but for HTML.
Being HTML you can still dynamically resize the window and have the text reflow - In my opinion that's it's big advantage over PDF. A PDF is basically a rendering of a page - not really what you want for an online help system.
It probably ends up just being a bunch of standard filenames inside a .CAB file (the .CAB format is what Microsoft puts a lot of their install packages and other archives into).
As for the format it's in, here's what I found on Google.
- Steve
CHM extractor. by kyz · 2002-01-27 05:10 · Score: 4, Informative

First off, MHT is just a mime message, run mpack on it.

However, for CHM extraction, you can use this portable CHM extractor. I don't think Matthew has officially released it, but it should be OK to use. Get in touch with him if you want.

--
Does my bum look big in this?