Extracting HTML and Images from MHT and CHM?

← Back to Stories (view on slashdot.org)

Extracting HTML and Images from MHT and CHM?

Posted by Cliff on Sunday January 27, 2002 @12:51AM from the it's-all-about-the-(proprietary)-file-formats dept.

smoon asks: "I've got a boatload of .CHM files, and have recently run across some .MHT files. The .MHT format appears to be how MS Internet Explorer saves a web site for offline viewing, and appears to be a basic MIME-format with all images inlined as base64 encoded data. I've poked around a bit looking for a utility that will extract the HTML files and associated graphics so that I can view these in Linux, but no luck so far. The .CHM format is billed as a 'compiled' HTML, and boils down to the equivalent of a tarball, in some Microsoft proprietary format, that shows a series of web pages. MS has a .CHM developers kit that allows you to extract all of the data, but the links stop working and it ends up not being very useful. Anyone know of a code that can extract the HTML and associated images from .CHM or .MHT files?"

14 comments

Min score:

Reason:

Sort:

try running them thruogh IIS by DrSkwid · 2002-01-27 01:03 · Score: 0

using a web site ripper and sorting it from there

but this, of course, is a prime example of how to lose big time.

I will be interesting to see if any of the MS apologists have much positive to say about .chm files et. al.

A prime example of why not to go with The Beast. I bet these aren't even really supported much in IIS these days either (butt hat's a wild guess from once being stuck on the MS bugfix/oh everything works differently now I'd better spend 2 days reading MSDN treadmill)

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Online Documentation by SteveX · 2002-01-27 01:56 · Score: 3, Informative

They're not for IIS; The .chm format is Microsoft's way of distributing help and online documentation. It's a replacement for the old .HLP file that uses HTML instead of RTF.
A .CHM is a compilation of HTML files with support for a tree style view of the documents in it, as well as binary files (examples), images, browse order (associating a "forward" button with the page that represents the page after the current page), searching, etc...
It's a pretty handy way of distributing online documentation, kinda like PDF but for HTML.
Being HTML you can still dynamically resize the window and have the text reflow - In my opinion that's it's big advantage over PDF. A PDF is basically a rendering of a page - not really what you want for an online help system.
It probably ends up just being a bunch of standard filenames inside a .CAB file (the .CAB format is what Microsoft puts a lot of their install packages and other archives into).
As for the format it's in, here's what I found on Google.
- Steve
1. Re:Online Documentation by DrSkwid · 2002-01-27 02:49 · Score: 1
  
  Oh yeah, I remember now, sorry ppl
  
  I was thinking of IDC / HTX from the IIS 3 days
  
  sure glad I didn't get suckered into those!
  
  the helpfiles format is good, I agree.
  
  The Windows Help version of the PHP manual is much easier to use than it's docbook html cousin.
  
  Oh crap! that almost makes me an MS apologist
  
  ahhhhhh, I'm going crazy like the like the androids in the "I love you. slap!" sequence in Star Trek
  
  --
  There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Slashdot is for this? by dostick · 2002-01-27 03:31 · Score: 0, Flamebait

Like Slashdot is the right place for discussing this. Let's bring our problems here !
I need to extract MPEG2 form Echostar DVR-5000. Anyone done this ?
Why not create topic about this!?
1. Re:Slashdot is for this? by Anonymous Coward · 2002-01-27 03:43 · Score: 0
  
  If you look under the hood you'll notice an MPEG decoder chip. You just need to grab the signals going into that chip.
  
  That wasn't so hard now was it?
Perl! by Howie · 2002-01-27 03:49 · Score: 3, Interesting

I've never tried it, but MIME::Tools should deal with .mht files, since they are (at first glance at least) valid MIME documents. Some assembly required.

--
"don't fall into the fallacy of believing that Perl can solve social problems. Maybe Perl 6 can, but that's a ways off"
No, but... by Anonymous Coward · 2002-01-27 03:56 · Score: 0

...I do want to extract MPEG-2 from OnStar in my Ford? Can I ask that question here?
CHM extractor. by kyz · 2002-01-27 05:10 · Score: 4, Informative

First off, MHT is just a mime message, run mpack on it.

However, for CHM extraction, you can use this portable CHM extractor. I don't think Matthew has officially released it, but it should be OK to use. Get in touch with him if you want.

--
Does my bum look big in this?
Not exactly a CAB file by kyz · 2002-01-27 05:14 · Score: 2

The CAB format is a structure for archiving files, which can use MSZIP, LZX or Quantum compression. CHM is a completely different structure for collating HTML pages and lots of metadata, which uses a slightly tweaked form of LZX compression. That's the only thing the formats have in common.

--
Does my bum look big in this?
Using IE as an extractor, Plus an offtopic rant by fm6 · 2002-01-27 07:10 · Score: 5, Interesting

All the functionality for browsing CHM files is built into IE. Here's a typical URL:
ms-its:C:\WINNT\Help\Gstart.chm::/whats_new_in_win dows_2000_server.htm
You can get the URL for any given CHM page from its properties dialog, available from the right-click menu. Looks like a label, but it's really a read-only text box, so it's easy to copy.
This opens up options for using IE to extract text and graphics. If you just need the odd item, go to the page in Help, copy the URL to IE, and use IE's Save features. Mass dumps are harder. You could probably do it programmatically, since IE is a COM Automation server. But there are differences in the way links work, and you'd have to figure those out.
It's my understanding the CHM is built using something called the IStorage Interface, but I'm vague on the specifics. There's some interesting Delphi software here (look for the "CHM Explorer Demo"). This Usenet thread is also intriguing.
I have to take the excuse to bemoan the poor state of online Help engines. More and more documentation is moving online, yet nobody seems to be really working on it. JavaHelp looks promising, but it doesn't seem to attract much interest outside the Java community. (Are the issues technical or political?) The KDE web site boasts of an "XML help system", but that just means they do their authoring in XML -- which is good, but the content is actually distributed as a large collection of HTML files, managed by a standard web indexer and a poorly integrated TOC manager. GNOME does something similar.
Which means that HTML Help, despite its flaws and limitations, is actually the best widely-used Help Engine! A sad state of affairs.
This isn't rocket science. All you need is a way to combine a lot of files into a resource archive (no sane product integrator wants to include thousands of separate text files in a product), an indexer, an outliner, and some basic UI engineering. But I don't see any real open-source effort at all. Come on, do you want people to use Linux on the desktop or not?
HTML Help Workshop by vukv · 2002-01-27 07:29 · Score: 2, Informative

HTML Help Workshop from MS can perfectly extract all of the files in .chm, in their original state. Open it up, click on decompile and voila... I dont see how you managed to run into problems, i have done so with hundreds of chm files...sure, it doesnt always save all chm related data however files are always preserved in their original state
MHT by Danielle+Gatton · 2002-01-27 20:57 · Score: 1

I am not a programmer, but this works for me. The "code", such as it is, was taken directly from the Perl MIME::Parser module man page. Install MIME::Tools first, and then make a script like:
---
#!/usr/bin/perl
use MIME::Parser;

my $parser = new MIME::Parser;
$parser->output_under("/tmp");
$entity = $parser->parse(\*STDIN);
---

(call it unmht.pl for this example).
Make that script executable, and then just "cat yourmhtfile.mht | unmht.pl". The files contained within the mht should all end up in a directory under /tmp in this case. If you change "output_under" to "output_dir", it will dump them all directly in /tmp, or in whatever directory you specified there.