Why Can't We Reverse Engineer .DOC?
Good question.
I wonder if it has something to do with the mentality of the players involved. I don't think Sun, Corel or Lotus ever thought that they might be able to get together so that they could compete on the Office market, I think they all looked to carve out pieces of the market with their own suites, making such collaboration impossible. Despite popular misperception, Applix does not convert DOC, it converts RTF (which may be close enough for some people). Star Office is striving toward this holy grail, but they aren't quite there yet. So maybe it's not too late for folks to pool resources and finally get the job done. In fact, with the eyes of the court on Microsoft, now might be the perfect time.
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and its presentation really be that much more difficult to reverse engineer? I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed, but I wouldn't say that it's impossible ... not for three big corporations, nor for thousands of loosely organized coders. It's one thing to have control of a file format, but it's another to be put into the position of having to change the format constantly in order to stay in the game. If Microsoft is placed in this situation, the onus would be on them to either concede the format until the next major release is made, or shorten the upgrade cycle on Office. How many businesses would stick with an office suite which forced users to upgrade every eight weeks just to remain compatible? If something like this were to happen, we might finally be able to put a dent in the everpresent Office monopoly.
So why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format. Have we tried, or has Microsoft's reputation, both professionally and legally, kept people from really thinking about it?
- .DOC is an OLE Document
- OLE Document parsers are available for most platforms. Theres even one for Perl
- The
.DOC format is documented on the MSDN CD's - where else would you expect this documentation to appear? - So no reverse engineering is needed. Just follow the spec
What truth remains is that the doc format changes from release to release of MS Word. So developers have to track these changes. The format is also a large and complex format, so its remained fairly niche in the open source world.Matt. Want XML + Apache + Stylesheets? Get AxKit.
DOC isn't a difficult file format. It's pretty well documented in various places around the web.
The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.
DOC is a hugely extensible file format, and you can't support everything DOC can cause DOC can theorectically support just about anything...especially windows applications.
And no that was not done through evil intent. Believe it or not, integration of applications is very much something that good software engineers strive for.
If you have a problem with it, just wait a few years (or maybe a decade) for KOffice etc to mature, and watch people complain as documents created on the Linux version of KOffice won't work because someone decided to embed in their document some python code, or an xpaint image.
http://www.wotsit.org
| Ceci n'est pas une pipe.
Well I can't imagine why but Microsoft, on the other hand, has a strong profit motive. Once the file format changes, as it does every year (or faster) people start getting emails with the new format in attachments. If they could just use a filter then they wouldn't have to upgrade from Word 6 or whatever was the last version that actually offered them new features they needed.
An obfusticated format means that filters are hard to write so such people are forced to upgrade which == cash for Bill. In fact, according to M$ this is their single biggest source of revenue.
I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?
Undoubtedly, if they're ever forced to release it. In fact, since you mention it, the release of the source code would be useful almost exclusively for the .h files with the data structures in them. Frankly, who gives a damn about the rest of the code? I can write my own bugs, thanks.
DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything.
Which means that at some point they'll start changing the definition of XML to close out competitors. They've always taken this approach, why do you think they won't this time?
When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
You mean like this one?
When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"
When twits like you attack M$ for the wrong reasons it makes it harder to get the unobsessed to listen to the valid complaints against M$.
Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial.
Care to point out these 65000 bugs that relate to DOC formats?
BTW remember when Office 97 came out and could not save to an Office 95 .doc format? It actually saved to RTF but gave a .doc extension. Corel's WP could save to the real Office 95 .doc which made it more MS compatable than MS was.
Perssonally I think MS is using its illegal monapolistic practices to make calls to secret windows APIs to give it an advantage.
Today's vices may be tomorrow's virtues.
As far as reverse engineering the file format, its all but impossible. Now that UCITA is here it will get even tougher. I just hope AutoCAD knows to not shooting itself in the foot by suing its own users. If the peoblem ever amounted to a threat to AutoCAD's market share there would probably be quite a backlash.
All you have to do is implement large portions of Windows, COM and Windows Apps to make it work. It uses OLE Structured Storage. OLE (COM/ActiveX) is a Windows thing. To make OLE Structured Storage work on other OSes, you have to make COM available, and use it to read and write the doc. Microsoft did this for the Macintosh, for example.
.doc files, you either have to:
.DOC is not open because the technology it depends on is not open. I'm sure the fellow who wrote a Word viewer in his C programming course did it on Windows, where COM and other Windows APIs are available.
So, to properly read and write
1) run Windows and Word
2) run MacOS and Word
3) port COM to anither OS and write a Word-alike
Yummy. Anyone written COM for Linux lately? TummyX's "it's open, it's open, stop whining" aside,
If he did it on Unix or BeOS or something, he should speak up.
Open file formats are important for interoperability and choice. Non-open ones are important for limiting choice and maintaining control. Knowledge shared is power lost, as Aleister Crowley said.
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
The troth is that those specifications are inaccurate and incomplete with regards to word 97 and 2K. Every person who has tried to implement an import filter has ran into that problem. The end result is that you sit down and create word documents on one PC ( or virtual PC with VMWare ) then go through with a hex editor to figure out what symbol dose what.
To put that all in perspective the two paragraphs above save to 1 KB ( minimum displayed file size on Win98 ) in HTM or text format. In MSWord
Everybody who dose this reverse engineering has to start from scratch. Every company that tries to read *.doc files has to put people to work doing it. A combining of efforts would be very prudent. Let's start by getting The Open Source teems together on this then we can invite IBM, Corel, Sun, etc... to join.
We need someone to advocate the benefits of an LGPL or even BSD licensed library set to corps who must otherwise do it all themselves ? This is what ESR is useful for so go and call him.
--= Isn't it surprising how badly I spell ?
1. .DOC is documented, this question is lame FUD. Quit bashing Microsoft.
Well, if its so well documented, then why can't I open a Word document in WordPerfect? And please don't tell me its because the Word document can contain embedded things like Excel and Access parts. I'm just talking about a regular word processing document with text and a little formatting. Our MIS guys tell me it does work but they apparently received this information from the WordPerfect 8 packaging rather than from experimentation because it doesn't work on my computer and they have been unable to show me where it works on their's.
2. Why are you picking on poor Microsoft? Do you really think they would purposely obfuscate their own code and make it difficult not only on the rest of the world, but themselves as well? Do you really think they're purposely trying to make it difficult for other companies to use the .DOC format?
Um, well yes, that's exactly what I think. What planet have you people been living on for the last 20 years. Of course Microsoft wants to make it difficult for other wordprocessors to use its format. They pretty much have a monopoly on in the Office arena and they want to keep it. If you could go out and buy WordPerfect for $100 less than Word and still be able to use the .DOC format perfectly, how would that help Microsoft? They have done things like this in the past and they will continue to do them as long as they can.
On a more positive note, I'll say that I do think that Microsoft Office is a good product. I mean it works and it does alot of cool stuff(even though that makes it bloated). The problem is in the way which Microsoft has used the power that Office has given them, not in the product itself. And I'm not just bashing Microsoft. I fully believe that if Sun or Corel were in their place they'd be doing the same thing. The bottom line is that consumers are suffering because of proprietary formats. This is one of the big reasons why computers have not made us more productive (or at least as productive as we could be). I can't count the number of hours I've spent simply trying to convert documents from one format to another.
Check out AbiWord.
The analogy is actually more apt than you'd think.
.doc file format is fairly well documented, as these things go, although there are some proprietary aspects, like the VBA streams. It's not that tough to open up a Word doc in your own program and parse the file correctly.
The
The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.
Parsing and display are two different things. While half a million apps can parse HTML, no two of them seem to display it in quite the same way. The question here is a bit like pointing out that no browser displays things like (IE|Netscape). Well, no they don't, but that has nothing to do with an inability to reverse engineer the file format.
Let's not lose sight of the real goal here: that .DOC will become a quaint historical curiousity as Open Source file formats become the standard! Do your part by NEVER using MS's proprietary file formats. Even if you use MS at work, save your files as .RTF and advise your less-hip coworkers to do so as well. (I would say save as .HTM, except that Word produces EXTREMELY ugly HTML).
I am very much afraid that we live in interesting times.
Yes, but doesn't that require that you own the Latest And Greatest (*cough*) version of Word?
I think the point is that you have to pay Microsoft the full price of the office suite for the 'privelege' of using newer document formats. That effectively limits the life of your software purchase so that you have to buy a completely new copy whenever there is a document format change - at that point, why not just use it as your primary version?
THAT is where the rub lies - at that point, you start sending out copies that can only be read in the newer version, and your colleagues begin upgrading as well. It's an endless cycle.
People want to break that cycle so that they can either use a competing Office program based only on its merits or stick with a previous version which they feel was better than the next version (ie. Mac users who upgraded to Word 6, but wish they had stuck with the previous version) .
I believe this is a worthy goal.
- Jeff A. Campbell
- VelociNews (http://www.velocinews.com)
- Jeff
Actually, I think that a post along the lines of:
"Those things that you think of as bugs? Those are not bugs. They are actually hot grits. Which are in my pants."
would have been considerably less lame than the actual post made. Just my two cents.
--
-jacob
-jacob
At one time, you could download the specs for the binary file format. Now, according to:
http://support.micro soft.com/support/kb/articles/Q211/6/41.ASP
You need to write to an e-mail address and explain why you want it. It also says that the formats for earlier versions of Word are no longer available.
For what it's worth.
At Wotsit. Microsoft Word 6.0, 8.0, Word 97, and Palm Pilot doc files where all reverse engineered.
I believe there are actually 2 problems here:
1) As I think several people have touched on, the problem here isn't the documentation, since Microsoft through MSDN etc. has documented the Word file format. The problem is that the only specifications on how to correctly render the Word documents are the Word rendering engine itself. Without the ability to see the exact logic that Word uses to render certain formatting codes (read: source code), it is impossible to reverse-engineer a 100%-compatible converter/viewer. It is a similar situation to what the Samba team faces: the SMB/CIFS protocols have been documented by Microsoft, but the only implementation of those protocols is Windows NT/2000, so Samba in reality must be coded to re-implement NT, not implement the CIFS specifications. The difference here, of course, is that CIFS apparently has a complete spec that Microsoft simply ignores, rather than the Word situation where they purposefully keep people in the dark on how things should be done.
2) the reason that you can't just watch what the Word rendering engine does and duplicate it is because it's stupid. From my experience working with Word itself and wvWare to convert Word files to HTML, it's obvious that Word just throws odd formatting codes where ever it pleases, and never bothers to clean them up. Often tags to end bold formatting (converted to </b> by wvWare) are just randomly placed in the document, nowhere near where any bolding is supposed to occur. The same goes for font sizing/coloring: Word seems to place odd, irrelevant font codes in places, only to override them with the correct codes a few lines later (often without canceling the first codes). In other words, it's a mess. With the Word source code, one may be able to figure out the (supposed) logic behind the mess; without it, I fear anyone is simply grasping at straws, especially since MS continuously changes to Office keeps everyone guessing about what Word is actually doing underneath it all.
My US$0.02 of course.
--Mythos
Right you are, sir! In today's "free" market, there are a slew of businesses which wield monopoly power, but which they don't want you to know about it. Consider:
Cisco Systems has a market value comparable to Microsoft's, and has even exceeded it at times, by maintaining a total stranglehold on the network hardware market. Although they would have us believe that Cisco's strategy is "providing a reliable, top-quality product and good support," a number of internal memos have recently been leaked indicating that Cisco plans to start including support for the "upgraded" IPv6 "extension," putting them in a position to use the "embrace and extend" strategy to leverage their large market share into an almost total monopoly on the Internet's physical infrastructure.
The Lego corporation has a long history of introducing new block designs which render the old blocks almost totally useless from an aesthetic perspective. "I spent all my lawn-mowing money on the medievel set," said a sniffling little boy who asked not to be identified, "but then the Technics came out, and all my spears and stuff wouldn't fit anywhere on the walking robot I built unless I mixed those brown spear-holder blocks in, and then my robot looks yucky." He also pointed out, as is well known, that Lego has broken Technics color-compatibility with their new Mindstorm upgrade, by switching red dye #5 for #8, and yellow #2 for #7. Alas, the legal hassles that await anyone foolish enough to reverse-engineer Lego's proprietary block-connection protocols have ensured that Lego has reigned unchallenged as the only source for toys you can build cool shit with, despite their inferior product. The "accidental" death of Abe Fromage and the subsequent collapse of Tinkertoys spelt the end of competition, even before Lego started blatantly cloning "CPU" and "robotics" technology from the computer industry for use in their "innovative" Mindstorm toys.
Furthermore, Red Lobster, Denny's, and other chain/corporation/restaruant/franchise establishments regularly use unconscionable terms in the dining agreements they make with their patrons. As a large corporation, they play from a position of strength: With their high-priced lawyers and large bankrolls, they can freely impose their will on the consumer (commonly by the use of so-called "walk-through" agreements: the restaurant posts it dining agreement on its wall, you and are considered to have "agreed" simply by choosing to dine there, regardless if you have read or even noticed the sign). Examples of this include:
It is sad, but the powermongering megacorporations who really run our country also have merciless teams of wedgie-men and noogie-goons at their command, and they have bamboozled the media and the government into abusing Microsoft to benefit their own bottom line. What with communistic government interference, backlash from the misinformed public, and the software piracy that is rampant in today's industry, Microsoft can barely stay afloat, let alone research more of the innovative, professionally engineered products the software community has come to expect from them, like Microsoft Bob, the dancing Office paper clip, and email clients that do it all at the click of a mouse! Yay Microsoft! Go Bill! One world, one web, one program!
Well I remember trying to write a 6 page research paper during my college days using Word 6.0. I spent a lot of time tweaking the format, making sur that it would stay on 6 pages and not more. Then I brought that file to school to print, and when Word 7.0 opened it, BINGO, all the formatting was destroyed and it now took 6 pages and 2-3 lines! I tweaked it there and when it got sent to the HP laserjet, it came out as 6 pages and 2-3 lines again! Truly MS word is WYSIWYG!
.doc is a joke specification, if it ever was at all. Sure you could read the files, but the specification is NOT COMPLETE. That is why many people are having a hard time converting.
.DOC format at all - it's all to do with the render layer.
Can you please explain why MS can read the document graphics, but can't maintain format consistency? They seem to have improved a lot in this regard, but so what? All the other guys trying to write a compatible editor are exactly in the position MS itself was a few years ago.
The point is that MS's
This is because Word 6.0 rendered according to screen metrics, Word 7.0 rendered according to printer metrics for better quality output at low font sizes, and Word 8.0 now renders according to *font design* metrics, which means that while it'll look reasonably like what you get on the printer, and obey margins, it will squish the fonts a pixel or two together at times to get the best fit.
It's nothing to do with the
Simon
Coming soon - pyrogyra
By now we know that both the DWG and DOC format have been reverse engineered. We also know that it really does not matter. Autodesk/MS control the data formats. Their rendering of the data is the reference implementation -- and they both change the format at will. They both exploit run-time and new version peculiarities in their rendering of the data.
When it comes time for a company to decide which product to invest in, when it's time to choose if they want to use the proprietary product or some wannabe cheap-o competitor, the answer is alway the same. Go with the standard bearer. And that really is the correct answer. The price differential is completely and totally irrelevant. Corporations invenst a lot more in labor and data than they invest in any one version of a software product. The "open source" factor is -- if not irrelevant -- not appreciated. It is secondary at best.
Look at IntelliCAD. They attempted to commoditize R12 AutoCAD. Supposedly nobody wanted any of the features crammed into post-R12, post-multiplatform AutoCAD. R13 was a bitter pill for AutoCAD customers and loyalists. Supposedly IntelliCAD would allow drafter/designers to draw basic 2D engineering drawing just as well as R13++ for half the price. More importantly, they thought they had given companies that had huge investments in DWG data a viable alternative -- a way out. They could jump from the ship they were supposedly dissatisfied with and seek alternatives.
But you know what? Nobody took the offer.
Not before IntelliCAD was "open source" and not after.
It turns out that Autodesk was able to pull off R14 and salvage their reputation Turns out customers were not all that dissatisfied with Autodesk -- which they correctly saw as a well entrenched, healthy (==rich) partner, committed to investing in both AutoCAD and other forward looking design products and technologies. Turns out AutoCAD is very capable of getting the drafting job done. Besides, IntelliCAD was for shit. Still is. And when Visio sacked the original ItelliCAD development team - a very idealistic and motivated group -- because ICAD was released prematurely with bugs and feature gaps -- any idealism or customer loyalty went out the window. ICAD was exposed for what it had become -- a cheap knock off with no future. The so-called open sourcing of IntelliCAD was just window dressing. The fact was that Visio had interred it's mistake in preparation for acquisition by MS. (It also parted ways with the folks that had inspired IntellCAD, FWIW.)
So what does this have to do with .DOC?
You could come out with a .DOC compatible word processor without a super-human effort. But wihtout the VBA, without the quirky rendering, without all the nuances and endless litany of features of Word it would be nothing more than a knock-off. It would have to beat Word on functional terms in order to be attractive. That would be a very tall order. Like it or not, Word and AutoCAD are very mature products. Maybe they attempt to do too much. Maybe they are bloated with features that any one customer does not want or need. But a whole lot of customers are well served by these products. They get the job done for a broad spectrum of customers.
They are both going to be very, very hard to disslodge.
It's their game to loose.
Beating them on the merits will be damned hard, and possibly not enough.
And, just to goad anyone still reading, being "open source" or not has nothing to do with it.
If open source is a strategic advantage, it will hvae to do with stamina and longevity. Eventually MS/Autodesk will find it hard to keep milking their cash cows. Eventually they will find it harder and harder to justify continued investment in these products. Eventually the WinX platforms both producst are married to will fade. At that point, when Word and AutoCAD stagnate, they may be vulnerable to an open source comminity that can run endlessly on no cash, that can build bridges to newer, more current technologies.
I'm not holding my breath.
In fact, I've changed jobs to get out of the CAD industry. The action is elsewhere. I may not live long enough to see AutoCAD take a fall. It may never happen.
PS: In the CAD space, the most intersting open source activity is not IntelliCAD. The Matra folks have a more interesting offering. IntelliCAD is a corpse. OpenDWG may prove useful if and when the action moves beyond AutoCAD. If that future is to involve open source, it will more likely be centered on Matra than OpenDWG.
Its GPLed, granted it needs work. So scoot onto the abiword mailing list and cvs down the latest version, get hacking on it and sort it out.
ole2 is fully sorted out with libole2, excel is being handling by gnumeric.
What is not handled by wv is not by lack of documentation or design, its simply a matter of spending some time at it. Easy peasy. Info on the MSDN docs can be got from here. They can be gotten off the MSDN 1998 July cd, or you can get some of them from wotsit.org. I even wrote ivt2html for you to convert the office.ivt file into html. Like what else do you need.
90% of all the hard work has been done, wv can parse fast and simple with no bother to it, which was a nightmare to do, it can construct the correct PAP (paragraph properties) and CHP (character properties) for a given run of text. Feed you the correct characters and charset and font, the TAP (table properties), graphic properties and handle to graphics. The correct OLE handle for embedded objects. Document properties etc. There is an example html conversion program included for reference (wvHtml).
I put together libwmf to convert wmf file into something useful as well. Theres a half done implementation of an Escher (the graphics for Office) importer floating around in there as well.
Theres also an implementation of a Summary Stream displayer for all ole2 documents.
I even bust my ass and dragged together the right bunch of motivated people to help implement the decryption module for word 97, 95 and 6, and that was not fun at all to say the least
The hard work is done, if you want something improved you have a very very solid base to work from. Yes the spec is confusing, yes its not a great format, yeah is sort of moves over time, but in a fairly rational way that can be supported with some work. There are any number of equally crap formats with weak documentation supported in various tools.
There is just this false myth that the Microsoft formats are inpenetrable and/or not available. Just download wv, fair enough there might be problem documents, if there are, just debug wv and get onto the abiword list and work it out with them. If something fails it can be fixed and improved, its not a case of "ah well, its a MS format, nothing can be done". If you truly want to handle Microsoft formats there are a number of people working on it that you can help.
So its right there for the right bunch of motivated people to work on. C.
I sometimes write stuff