Extracting HTML and Images from MHT and CHM?
smoon asks: "I've
got a boatload of .CHM files, and have recently run across some .MHT files.
The .MHT format appears to be how MS Internet Explorer saves a web site
for offline viewing, and appears to be a basic MIME-format with all images
inlined as base64 encoded data. I've poked around a bit looking for a utility
that will extract the HTML files and associated graphics so that I can view
these in Linux, but no luck so far. The .CHM format is billed as a 'compiled'
HTML, and boils down to the equivalent of a tarball, in some Microsoft
proprietary format, that shows a series of web pages. MS has a .CHM
developers kit that allows you to extract all of the data, but the links stop
working and it ends up not being very useful. Anyone know of a
code that can extract the HTML and associated images from .CHM or
.MHT files?"
A .CHM is a compilation of HTML files with support for a tree style view of the documents in it, as well as binary files (examples), images, browse order (associating a "forward" button with the page that represents the page after the current page), searching, etc...
It's a pretty handy way of distributing online documentation, kinda like PDF but for HTML.
Being HTML you can still dynamically resize the window and have the text reflow - In my opinion that's it's big advantage over PDF. A PDF is basically a rendering of a page - not really what you want for an online help system.
It probably ends up just being a bunch of standard filenames inside a .CAB file (the .CAB format is what Microsoft puts a lot of their install packages and other archives into).
As for the format it's in, here's what I found on Google.
- Steve
First off, MHT is just a mime message, run mpack on it.
However, for CHM extraction, you can use this portable CHM extractor. I don't think Matthew has officially released it, but it should be OK to use. Get in touch with him if you want.
Does my bum look big in this?