Extracting HTML and Images from MHT and CHM?
smoon asks: "I've
got a boatload of .CHM files, and have recently run across some .MHT files.
The .MHT format appears to be how MS Internet Explorer saves a web site
for offline viewing, and appears to be a basic MIME-format with all images
inlined as base64 encoded data. I've poked around a bit looking for a utility
that will extract the HTML files and associated graphics so that I can view
these in Linux, but no luck so far. The .CHM format is billed as a 'compiled'
HTML, and boils down to the equivalent of a tarball, in some Microsoft
proprietary format, that shows a series of web pages. MS has a .CHM
developers kit that allows you to extract all of the data, but the links stop
working and it ends up not being very useful. Anyone know of a
code that can extract the HTML and associated images from .CHM or
.MHT files?"
I've never tried it, but MIME::Tools should deal with .mht files, since they are (at first glance at least) valid MIME documents. Some assembly required.
"don't fall into the fallacy of believing that Perl can solve social problems. Maybe Perl 6 can, but that's a ways off"
This opens up options for using IE to extract text and graphics. If you just need the odd item, go to the page in Help, copy the URL to IE, and use IE's Save features. Mass dumps are harder. You could probably do it programmatically, since IE is a COM Automation server. But there are differences in the way links work, and you'd have to figure those out.
It's my understanding the CHM is built using something called the IStorage Interface, but I'm vague on the specifics. There's some interesting Delphi software here (look for the "CHM Explorer Demo"). This Usenet thread is also intriguing.
I have to take the excuse to bemoan the poor state of online Help engines. More and more documentation is moving online, yet nobody seems to be really working on it. JavaHelp looks promising, but it doesn't seem to attract much interest outside the Java community. (Are the issues technical or political?) The KDE web site boasts of an "XML help system", but that just means they do their authoring in XML -- which is good, but the content is actually distributed as a large collection of HTML files, managed by a standard web indexer and a poorly integrated TOC manager. GNOME does something similar.
Which means that HTML Help, despite its flaws and limitations, is actually the best widely-used Help Engine! A sad state of affairs.
This isn't rocket science. All you need is a way to combine a lot of files into a resource archive (no sane product integrator wants to include thousands of separate text files in a product), an indexer, an outliner, and some basic UI engineering. But I don't see any real open-source effort at all. Come on, do you want people to use Linux on the desktop or not?