Why Can't We Reverse Engineer .DOC?
Good question.
I wonder if it has something to do with the mentality of the players involved. I don't think Sun, Corel or Lotus ever thought that they might be able to get together so that they could compete on the Office market, I think they all looked to carve out pieces of the market with their own suites, making such collaboration impossible. Despite popular misperception, Applix does not convert DOC, it converts RTF (which may be close enough for some people). Star Office is striving toward this holy grail, but they aren't quite there yet. So maybe it's not too late for folks to pool resources and finally get the job done. In fact, with the eyes of the court on Microsoft, now might be the perfect time.
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and its presentation really be that much more difficult to reverse engineer? I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed, but I wouldn't say that it's impossible ... not for three big corporations, nor for thousands of loosely organized coders. It's one thing to have control of a file format, but it's another to be put into the position of having to change the format constantly in order to stay in the game. If Microsoft is placed in this situation, the onus would be on them to either concede the format until the next major release is made, or shorten the upgrade cycle on Office. How many businesses would stick with an office suite which forced users to upgrade every eight weeks just to remain compatible? If something like this were to happen, we might finally be able to put a dent in the everpresent Office monopoly.
So why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format. Have we tried, or has Microsoft's reputation, both professionally and legally, kept people from really thinking about it?
- .DOC is an OLE Document
- OLE Document parsers are available for most platforms. Theres even one for Perl
- The
.DOC format is documented on the MSDN CD's - where else would you expect this documentation to appear? - So no reverse engineering is needed. Just follow the spec
What truth remains is that the doc format changes from release to release of MS Word. So developers have to track these changes. The format is also a large and complex format, so its remained fairly niche in the open source world.Matt. Want XML + Apache + Stylesheets? Get AxKit.
DOC isn't a difficult file format. It's pretty well documented in various places around the web.
The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.
DOC is a hugely extensible file format, and you can't support everything DOC can cause DOC can theorectically support just about anything...especially windows applications.
And no that was not done through evil intent. Believe it or not, integration of applications is very much something that good software engineers strive for.
If you have a problem with it, just wait a few years (or maybe a decade) for KOffice etc to mature, and watch people complain as documents created on the Linux version of KOffice won't work because someone decided to embed in their document some python code, or an xpaint image.
Here's another try to act professional, but bash microsoft at the same time type post. Pretty typical of Linux users...
.DOC format is obfuscation, this could be difficult indeed
... not for 3 big corporations, nor for thousands of loosely organized coders.
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and it's presentation really be that much more difficult to reverse engineer?
Well considering DOC can store ANYTHING - including the description of 3D objects yes.
I'd guess this depends more on the design behind said file format. If one of the main goals of the
I see, Microsoft == Evil, so DOC must be created to obfusticate. Very smart of you.
Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read? I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?
but I wouldn't say that it's impossible
Yes, those poor, poor companies like SUN with their open software like Java and Corel Office need to band together and blow up microsoft. resistance is not futile!
Please.
DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well. Just don't complain when your documents are 100MB.
http://www.wotsit.org
| Ceci n'est pas une pipe.
Response2: Yeah! Let's just do it!
This question misses the whole point. The problem (from following the AbiWord list for a while) is not that the .doc file format needs to be reverse engineered, it's that the format is such a piece of crap that you can't implement the spec.
Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial. :)
--
-- Slashdot sucks.
When I used StarOffice I have seen horribly broken formatting that was magically cured when I have installed Microsoft/Monotype fonts into my Linux box with StarOffice. This suggests that Word formatting is very inflexible regarding changing parameters of the media (as opposed to, say, TeX that will adapt to any size of anything as long as it makes sense), and every slight difference in algotithms (never documented ones, not "packaging") can cause horrendous miscalculation of the formatting.
Contrary to the popular belief, there indeed is no God.
BTW remember when Office 97 came out and could not save to an Office 95 .doc format? It actually saved to RTF but gave a .doc extension. Corel's WP could save to the real Office 95 .doc which made it more MS compatable than MS was.
Perssonally I think MS is using its illegal monapolistic practices to make calls to secret windows APIs to give it an advantage.
Today's vices may be tomorrow's virtues.
The answer for why the big office suite vendors haven't banded together in the same manner as the OpenDWG Alliance seems pretty self-evident to me. I'm sure that each of these software manufacturers have at one time or another signed an NDA with regards to the MS Office file formats. Once they did that, they were precluded from sharing that information amongst themselves. End of question.
As for why they signed those NDAs? Again self-evident: early access. If Corel or Lotus wanted to be able to support the new file formats in a timely fashion, they need to know what the spec is well in advance -- TechNet doesn't get that sort of new information fast enough. For that matter, when you subscribe to TechNet, you're signing a limited NDA with Microsoft; I'd check the fine print before I depended upon TechNet information...
Are you moderating this down because you disagree with it,
We call it art because we have names for the things we understand.
For my purposes, and the purposes of the company for which I work: what good reverse-engineered DWG file formats if you still can't get a good, affordable CAD package on anything but Ms-Win? My company is presently standardizing on applications. And (I'm sure MS would be overjoyed to hear this) it looks like the Unix boxen are on their way out. Why? One of the reasons is AutoCrap. It's available only on Ms-Win. Our customers and vendors demand files in AutoCrap format. There are no price-competitive CAD packages available for Unix anymore. (Bentley has dropped support for MicroStation on Unix--in case you didn't know. Note to Bentley: you screwed up! By dropping MicroStation for Unix you removed any incentive for us to consider your product.) So bye-bye to our reliable, low-TOC Unix workstations and X-terminals :-(.
So even though we're evaluating StarOffice to use instead of MS Office, and even though we're evaluating non-MS email clients and other non-MS client apps: even if these pan out the Unix environment is still probably doomed because of AutoCrap :-(. (Then there's Visio and other stuff.)
IMO many vendors, by not making their apps available on non-MS platforms, are missing the boat by failing to differentiate themselves from the run-of-the-mill "Me too! I do Microsoft" crowd. With things happening like the surge of interest in Linux as potentially a viable workstation platform, Solaris for free and Sun hardware getting quite affordable: this seems to me to be narrow-minded. Particularly wrt to vendors like Bentley--who already had Unix versions of their products.
Sigh...
As far as reverse engineering the file format, its all but impossible. Now that UCITA is here it will get even tougher. I just hope AutoCAD knows to not shooting itself in the foot by suing its own users. If the peoblem ever amounted to a threat to AutoCAD's market share there would probably be quite a backlash.
As a figment of your imagination. I've tried them all and they all fail almost as soon as you leave the area of text only docs. In fact none of them print even text only docs well enough for professional use.
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
hmm corels wordprocessor (wordperfect) wont read .doc's perfectly. sun's staroffice is even worse on importing .doc's.
.doc files (.doc's mainly consisting of text) but are pretty bad at importing .docs that have images, tables and other stuff...
they are good at importing simple
-- http://electronicintifada.net --
All you have to do is implement large portions of Windows, COM and Windows Apps to make it work. It uses OLE Structured Storage. OLE (COM/ActiveX) is a Windows thing. To make OLE Structured Storage work on other OSes, you have to make COM available, and use it to read and write the doc. Microsoft did this for the Macintosh, for example.
.doc files, you either have to:
.DOC is not open because the technology it depends on is not open. I'm sure the fellow who wrote a Word viewer in his C programming course did it on Windows, where COM and other Windows APIs are available.
So, to properly read and write
1) run Windows and Word
2) run MacOS and Word
3) port COM to anither OS and write a Word-alike
Yummy. Anyone written COM for Linux lately? TummyX's "it's open, it's open, stop whining" aside,
If he did it on Unix or BeOS or something, he should speak up.
Open file formats are important for interoperability and choice. Non-open ones are important for limiting choice and maintaining control. Knowledge shared is power lost, as Aleister Crowley said.
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
WordPad calls upon the text import filters installed by Windows and Microsoft Office to convert .DOC files to RTF and then reads the RTF file.
-Martin
SoftMaker Office for Windows|Linux|Android
The troth is that those specifications are inaccurate and incomplete with regards to word 97 and 2K. Every person who has tried to implement an import filter has ran into that problem. The end result is that you sit down and create word documents on one PC ( or virtual PC with VMWare ) then go through with a hex editor to figure out what symbol dose what.
To put that all in perspective the two paragraphs above save to 1 KB ( minimum displayed file size on Win98 ) in HTM or text format. In MSWord
Everybody who dose this reverse engineering has to start from scratch. Every company that tries to read *.doc files has to put people to work doing it. A combining of efforts would be very prudent. Let's start by getting The Open Source teems together on this then we can invite IBM, Corel, Sun, etc... to join.
We need someone to advocate the benefits of an LGPL or even BSD licensed library set to corps who must otherwise do it all themselves ? This is what ESR is useful for so go and call him.
--= Isn't it surprising how badly I spell ?
1. .DOC is documented, this question is lame FUD. Quit bashing Microsoft.
Well, if its so well documented, then why can't I open a Word document in WordPerfect? And please don't tell me its because the Word document can contain embedded things like Excel and Access parts. I'm just talking about a regular word processing document with text and a little formatting. Our MIS guys tell me it does work but they apparently received this information from the WordPerfect 8 packaging rather than from experimentation because it doesn't work on my computer and they have been unable to show me where it works on their's.
2. Why are you picking on poor Microsoft? Do you really think they would purposely obfuscate their own code and make it difficult not only on the rest of the world, but themselves as well? Do you really think they're purposely trying to make it difficult for other companies to use the .DOC format?
Um, well yes, that's exactly what I think. What planet have you people been living on for the last 20 years. Of course Microsoft wants to make it difficult for other wordprocessors to use its format. They pretty much have a monopoly on in the Office arena and they want to keep it. If you could go out and buy WordPerfect for $100 less than Word and still be able to use the .DOC format perfectly, how would that help Microsoft? They have done things like this in the past and they will continue to do them as long as they can.
On a more positive note, I'll say that I do think that Microsoft Office is a good product. I mean it works and it does alot of cool stuff(even though that makes it bloated). The problem is in the way which Microsoft has used the power that Office has given them, not in the product itself. And I'm not just bashing Microsoft. I fully believe that if Sun or Corel were in their place they'd be doing the same thing. The bottom line is that consumers are suffering because of proprietary formats. This is one of the big reasons why computers have not made us more productive (or at least as productive as we could be). I can't count the number of hours I've spent simply trying to convert documents from one format to another.
Check out AbiWord.
So because MS wants to keep out competitors, it is entitled to make you find another job simply because you wanted to exercise your choice in software. In my book, hurting innocent people is EVIL!
The analogy is actually more apt than you'd think.
.doc file format is fairly well documented, as these things go, although there are some proprietary aspects, like the VBA streams. It's not that tough to open up a Word doc in your own program and parse the file correctly.
The
The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.
Parsing and display are two different things. While half a million apps can parse HTML, no two of them seem to display it in quite the same way. The question here is a bit like pointing out that no browser displays things like (IE|Netscape). Well, no they don't, but that has nothing to do with an inability to reverse engineer the file format.
"why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format."
Not necesserily true. A format can be encrypted with PGP and a connection to the Internet may be required to read a document encoded with this format. Try and reverse engineer that.
You can't handle the truth.
Let's not lose sight of the real goal here: that .DOC will become a quaint historical curiousity as Open Source file formats become the standard! Do your part by NEVER using MS's proprietary file formats. Even if you use MS at work, save your files as .RTF and advise your less-hip coworkers to do so as well. (I would say save as .HTM, except that Word produces EXTREMELY ugly HTML).
I am very much afraid that we live in interesting times.
You know, I've been thinking about this.
.DOC format; it's in the fact that Word itself reads the statements contained within the .DOC format in confusing and illogical ways.
The obfuscation isn't actually in the
Yet, this readability has been maintained from Office 95 thru Office 97 to Office 2000. (Lets not even talk about Word for Mac!)
This just isn't possible unless Microsoft has internal conformance specifications that they follow from revision to revision.
We know the specs exist because it literally would have been impossible for Microsoft to have functioned without them.
98% of Word documents don't use any advanced Word features. In fact, 98% of Word documents should be saved in RTF format, and lose nothing of value in the translation. With these specifications, the #1 thing companies could do would be to implement a DOC->RTF filter *at the mail gateway* and be done with 98% of Macro Virii.
Will it happen? Nah. The Word Monopoly is just too critical to Microsoft's success. It really is.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
"Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"
;-)
It's very simple really. Unlike Autodesk, which uses some form of logic to create their file formats, Microsoft uses heavy encryption seeded with a semi-random number.
This number is based on the millions of dollars Bill Gates is worth at the year of release. In fact, the file formats for Office 95, 97, and 2000 are identical - it's just that Bill Gates has been worth more at the time of their release, so the file was encrypted differently.
This is why it's so important for the Microsoft stock price to jump around, if it stood still then the file formats wouldn't change, which means people wouldn't buy the latest version of Office, which means the stock price woudln't change, and so forth in an infinite loop.
An alternative solution that I believe may be the answer is to create an open community equivalent to the W3C (is that right, the body that maintains the HTML standard, whether or not it is followed, (it's late and my brain is getting foggy)) for office document formats.
.DOC these days) and places control with back with the consumer.
.DOC and .HTML could be interchangeable. Based on the above points, many companies would require that their employees maintain company data using the new open standard.
This is the only answer to the problem and W3C is a great example. It takes the control away from MicroSoft, a company that uses the spec as a means of driving upgrade sales and maintain their monopoly (the real purpose of
The idea for a common doc format could be marketed successfully based on two points.
First, a common doc format would allow companys and individuals could save large amounts of money by not having to upgrade to the latest verion of Word every two years. This would impact a company's bottom line.
Second, a common doc format would provide companies and individuals with a level of "insurance" that older document types that hold important data would not at some point in the future become obsolete.
Neither of these points even brings up the obvious benefit to the rest of us that use non-MS systems. It would increase competition in the Word Processing arena and would probably move use towards a world where
--
Mmmm...send them files in TeX format?
Seriously though, knowing what the person can read on the other end and sending them that is courtious. Unfortunately, too many people think that everyone uses what they do.
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
At one time, you could download the specs for the binary file format. Now, according to:
http://support.micro soft.com/support/kb/articles/Q211/6/41.ASP
You need to write to an e-mail address and explain why you want it. It also says that the formats for earlier versions of Word are no longer available.
For what it's worth.
At Wotsit. Microsoft Word 6.0, 8.0, Word 97, and Palm Pilot doc files where all reverse engineered.
I believe there are actually 2 problems here:
1) As I think several people have touched on, the problem here isn't the documentation, since Microsoft through MSDN etc. has documented the Word file format. The problem is that the only specifications on how to correctly render the Word documents are the Word rendering engine itself. Without the ability to see the exact logic that Word uses to render certain formatting codes (read: source code), it is impossible to reverse-engineer a 100%-compatible converter/viewer. It is a similar situation to what the Samba team faces: the SMB/CIFS protocols have been documented by Microsoft, but the only implementation of those protocols is Windows NT/2000, so Samba in reality must be coded to re-implement NT, not implement the CIFS specifications. The difference here, of course, is that CIFS apparently has a complete spec that Microsoft simply ignores, rather than the Word situation where they purposefully keep people in the dark on how things should be done.
2) the reason that you can't just watch what the Word rendering engine does and duplicate it is because it's stupid. From my experience working with Word itself and wvWare to convert Word files to HTML, it's obvious that Word just throws odd formatting codes where ever it pleases, and never bothers to clean them up. Often tags to end bold formatting (converted to </b> by wvWare) are just randomly placed in the document, nowhere near where any bolding is supposed to occur. The same goes for font sizing/coloring: Word seems to place odd, irrelevant font codes in places, only to override them with the correct codes a few lines later (often without canceling the first codes). In other words, it's a mess. With the Word source code, one may be able to figure out the (supposed) logic behind the mess; without it, I fear anyone is simply grasping at straws, especially since MS continuously changes to Office keeps everyone guessing about what Word is actually doing underneath it all.
My US$0.02 of course.
--Mythos
Now I have tried wordperfect 8 for Linux, and the word filter does not work on more than half the documents that i have. StartOffice 5.1 does a pretty good job of this and from what I hear is it is getting better. However I know that if you start doing some complex things in word then startoffice may not read all of the document. They are working on this though. Apparently startoffice 5.2 is supposed to have pretty good support for word files.
On another note their are several project that are open source that are working to reading these formats, on of which Ibelieve is called AbiWord. Although it's native output will not be word, last time I talked with them they were working on a word filter.
send flames > /dev/null
Only 'flamers' flame!
There is a lot of confusion here about whether or not the .DOC format has been documented, because there are two layers to the file format. First, there is the Word document format itself, which Microsoft has published in some MSDN CD versions. It also available from places like www.wotsit.org. This specification is inaccurate in places but close enough to make Word document conversion possible. Caolan McNamara has a very good start on a Word-to-HTML converter at www.wvware.com. The Word document format changed in the transition from Word 6 to Word 97, and is the same in Word 2000.
However, Word documents since version 6 are wrapped in OLE Compound Documents, which Microsoft also uses for .XLS files. The Compound Document format is not officially documented anywhere in Microsoft documentation, as far as I can tell. (But see below for a patent that might disclose this structure...) The MSDN library samples invariably use Windows system calls to access data in Compound Documents, and reveal nothing about the file format.
There have been some efforts to reverse-engineer this format:
http://arturo.directmail.org/filtersweb/ and
http://snake.cs.tu-berli n.de:8081/~schwartz/pmh/guide.html,
A Compound Document contains a tree structure of data streams, which seems like a simple enough structure but it is implemented using a very complex file format. The lack of complete documentation of this format is a major impediment to development of robust open-source code that will access the Microsoft Office file formats.
A second potential impediment is a nest of patents that Microsoft has built around the Compound Document format. These are just a few:
US5467472: Method and system for generating and maintaining property sets with unique format identifiers
US5715441: Method and system for storing and accessing data in a compound document using object linking
US5506983: Method and system for transactioning of modifications to a tree structured file
US5706504: Method and system for storing data objects using a small object data stream
There are a fair number of patents (IBM seems to have some possibly related ones as well). You can find them here: http://patent.womplex.ibm.com/home. A search for "((compound document) and microsoft)" lists 24 patents. It would not be surprising if a serious effort to provide open-source access to Microsoft Office documents ran into legal threats because of these patents.
Interestingly, the last one looks like it might disclose the Compound Document format, which Microsoft would have to disclose to satisfy the patent office. The description looks right, but the diagrams do not seem to be available from the IBM site. Looks like I'll have to dig some more -- anyone know how to get the full text and images for U.S. Patent 5,706,504?
I've reverse engineered a number of Microsoft file formats.
.DOC file format were only available by signing an NDA. The 97 format was released publicly, but the latest releases of the .DOC format have not been documented.
.DOC formats when I was reverse engineering the .HLP file format. The person doing the .DOC format believed there would be some similarities in the two, so we worked on them together.
.DOC file would be fairly easy. It's also incredibly tedious.
.DOC file. Then start testing that on a bunch of different .DOC files until you find files that break it. Look to see what's different about those files, fix you code, and repeat, again, ad infinitum.
Several versions of the
I was somewhat involved in the reverse engineering of one of the
It turned out that there were some very small similarities, but not enough to be very helpful to us.
Reverse engineering a
The best way to do it is to start with small files: Start with a file with 1 letter, then two letters, then three.
Then make one of the letters bold, then make one italic, then make one bold italics. Then put each letter in a cell in a table, and so on ad infinitum.
Between each step, do a hex dump and compare the files. Eventually every thing starts to fall into place.
After that's done, then write a converter or dumper for a
Depending on how diligent you are, you can probably get 99% of it.
Personally, I've done about all the reverse-engineering that I want to do, so I'm not going to do it, but if someone wants to follow these instructions, it's probably the easiest way to go. Also, I'd keep the Word 97 specs handy so you can see any similarities that have been carried over from that version to the latest.
Good luck.
I agree with you about Word and Windows, but PDF? I like PDF. And there are free viewers (GhostScript) for it, too. I don't even have AcroRead on my box... I just use gv.
TO BUY A NEW CAR WOULD MAKE YOU SEXUALLY ATTRACTIVE.
Right you are, sir! In today's "free" market, there are a slew of businesses which wield monopoly power, but which they don't want you to know about it. Consider:
Cisco Systems has a market value comparable to Microsoft's, and has even exceeded it at times, by maintaining a total stranglehold on the network hardware market. Although they would have us believe that Cisco's strategy is "providing a reliable, top-quality product and good support," a number of internal memos have recently been leaked indicating that Cisco plans to start including support for the "upgraded" IPv6 "extension," putting them in a position to use the "embrace and extend" strategy to leverage their large market share into an almost total monopoly on the Internet's physical infrastructure.
The Lego corporation has a long history of introducing new block designs which render the old blocks almost totally useless from an aesthetic perspective. "I spent all my lawn-mowing money on the medievel set," said a sniffling little boy who asked not to be identified, "but then the Technics came out, and all my spears and stuff wouldn't fit anywhere on the walking robot I built unless I mixed those brown spear-holder blocks in, and then my robot looks yucky." He also pointed out, as is well known, that Lego has broken Technics color-compatibility with their new Mindstorm upgrade, by switching red dye #5 for #8, and yellow #2 for #7. Alas, the legal hassles that await anyone foolish enough to reverse-engineer Lego's proprietary block-connection protocols have ensured that Lego has reigned unchallenged as the only source for toys you can build cool shit with, despite their inferior product. The "accidental" death of Abe Fromage and the subsequent collapse of Tinkertoys spelt the end of competition, even before Lego started blatantly cloning "CPU" and "robotics" technology from the computer industry for use in their "innovative" Mindstorm toys.
Furthermore, Red Lobster, Denny's, and other chain/corporation/restaruant/franchise establishments regularly use unconscionable terms in the dining agreements they make with their patrons. As a large corporation, they play from a position of strength: With their high-priced lawyers and large bankrolls, they can freely impose their will on the consumer (commonly by the use of so-called "walk-through" agreements: the restaurant posts it dining agreement on its wall, you and are considered to have "agreed" simply by choosing to dine there, regardless if you have read or even noticed the sign). Examples of this include:
It is sad, but the powermongering megacorporations who really run our country also have merciless teams of wedgie-men and noogie-goons at their command, and they have bamboozled the media and the government into abusing Microsoft to benefit their own bottom line. What with communistic government interference, backlash from the misinformed public, and the software piracy that is rampant in today's industry, Microsoft can barely stay afloat, let alone research more of the innovative, professionally engineered products the software community has come to expect from them, like Microsoft Bob, the dancing Office paper clip, and email clients that do it all at the click of a mouse! Yay Microsoft! Go Bill! One world, one web, one program!
--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
I worked on Bento. I was not the designer. Jed Harris was the designer (Ira Reuben the coder). Jed said Bento was an experimental first cut prototype that was pushed into production, and I agree with this view.
The design goal was only rather similar to *.doc. Unfortunately, since Bento was a version one prototype, it never had a redesign for ease in reading and writing until I designed one.
Gat1024: And that was designed from the get-go to be cross platform.
It was technically cross platform, but Bento was very unfriendly as a clearly understandable format. It's big mistake was to use phsyical stream embedding instead of logical embedding, so the recursive flow of control was a nightmare to analyze. The format had physically discontiguous streams embedded inside other physically discontiguous streams, which would give almost anyone the shudders.
Gat1024: Think of it as component hell. And it is unavoidable no matter who does it. This goes for KOffice as well. Complexity is a run away train. I should say entropy. Since we're tending towards chaos here.
You are correct that every open format can embed opaque content that cannot be understood, so all component systems suffer from the risk of component hell.
I would not accept any amount of money to reverse engineer the Office doc format as a regular job, because it would tend to be too hard and frustrating to deal with the complexity under ongoing changes.
Furthermore, I would not trust any junior engineer who did accept such a job, so I would avoid the product based on such work, under the theory it would be fragile and buggy. Am I a pessimist, or what?
David McCusker, former Bento guy
Values have meaning only against the context of a set of relationships.
I really think this was my entire point. They both do the same thing. The only difference is that one (TeX) is an open format dating to the early 80s, while the other (.doc) is a proprietary format that changes every 2-3 years. They both do the same thing, so what possible justification could there be for using the second? Assuming, for a moment, that M$ is, as they claim, concerned with producing real benefits to their customers, I don't see any point to .doc. Do you? If so, please explain it.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
If its compatibility we are looking for here, why would expect MS to do it?
Because their customers expect them to make decisions that make their software better for the user, particularly when those decisions would come at little or no (or negative, in the case of maintaining a consistent document format) cost to Microsoft. The fact that Microsoft repeatedly changed the Word format costs themselves and their competitors money for additional programming work on filter and import/export code, and costs their users money for repeated unnecessary upgrades, incompatibility hassles with other programs. Looking at the Microsoft+competitors+users system as a whole, there is no benefit to anyone for Microsoft to use a poorly documented, convoluted format without an accurate public specification.
However, looking at MS, competitors, and users independently, it's obvious that while the value of the system as a whole is reduced by Microsoft's decisions, the handicap that it gives to competitors and the additional revenues it generates from users causes more of that value to end up as cash in Microsoft's hands.
This isn't the way a free market is supposed to work. If someone makes an inferior product, I'm supposed to be able to switch to a different producer and not be adversely impacted by said product. (and as a side effect, my readily available choices encourage all producers not to produce inferior products) Unfortunatly, when you add network effects, i.e. the requirement that my new product be compatible with the old, suddenly Microsoft has the ability to use an existing large marketshare as it's own "benefit", to make it self-sustaining, to reduce or eliminate that choice.
I'm not saying that, after thinking about it, it doesn't make sense for Microsoft to do just that. I'm just saying that, to consumers used to having a wide selection of companies competing solely based on price and quality for their purchasing dollars, it certainly counts as "unexpected".
MS already does this, with HTML^H^H^H^Hrubbish that gets spat out from Word2k, when you do a "save as html". It's rather frightening, actually, to see the actual code.
Office output is fully XML/XSL transform compliant - which is why Opera can handle it perfectly fine.
Also, a lot of the stuff in there is for round-tripping; it doesn't get used by a browser for display - the XSL transform just deletes it to all intents and purposes.
Simon
Coming soon - pyrogyra
That pretty much sums it up, and from a Microsoft VP, no less. You can pretend that Microsoft is a benevolent company all you want, but that doesn't change the facts.
The facts being that Vinod Vallipolli wasn't a Microsoft VP, nor even anywhere near that. He was a grunt.
Now if you'd said he was a Microsoft V V, then I'd have to agree with you.
I can sit down right here and now and write a document that claims the best way for Microsoft to make money is to take Linus, strap him to a chair with electrodes on his testicles, and fry him like a bug on a hotplate.
This document would get leaked.
Does this mean that this is happening in real life? Well, goddamnit YES! Linus is strapped to a chair! Right now! With electrodes on his testicles!
Funny how you never saw the leaked document which says that Microsoft would be better off if they gave all the lower-level peter-principle'd management a good kicking, stopped the infighting, and stopped the use of brute force in their development practices.
Simon
Coming soon - pyrogyra
That way no application means no editing rather than no picture, which is how dear old "we know what you want" MS have done it.
Actually, MS did it the way you described above as the way it should be done. It's called View Caching. And most converters don't bother because they haven't implemented all of OLE Structured Storage (or at least enough of it to be able to *use* that part).
Simon
Coming soon - pyrogyra
A bit of background:
.doc format and the entire idea of files within files has a lot of merit, as does the concept of only dealing with content supported by ones hardware. However, given the lack of openness by Word file formats, by COM, and the lack of security, Microsoft strikes out.
Word is a COM object and uses COM extensively. OLE was at the roots of COM but these days OLE is just another set of COM objects that one either implements or uses. So.. from the get go, one needs to implement COM, and also IStorage/IStream, on Linux, to get at Word. This would be ok if COM were an open standard, but it ain't. Where in MSDN is the VTABLE format for COM? It isn't there.
Strike 1 against Microsoft. By strike I mean that they are doing the usual evil empire thing by not opening up COM.
Philosophically, IStorage / IStream are a set of COM objects (read Libraries), for divving up a file into its own directory mechanism. The rationale for doing this is that end users want to copy documents as entire entities, and not deal with 200 or even 2000 subdirectories or small files that might comprise a total document. In Microsoft speak, a document must be a moveable entity, and in that regard, COM library based documents are entirely defensible. However, what goes into each of those subdirectory entries, or streams, is free to remain largely undocumented. It is the design intent of COM to ensure interopability between closed interfaces. At this, COM does stunningly well. You can script against COM in any language... but the medium for interchange is an application that you must always have in order to view the document.
Strike 2 against Microsoft. COM IS an excellent piece of software engineering, but it is engineered to do the hypocritical thing. The easiest way to make things interoperable is to post the source...
Much ado has been about Word changing file formats. The critiques of Word say that it is unnecessary to change file formats between releases. This is non-sensical hogwash. New features mean new data requirements, and new data requirements mean new file formats. Every other application on the planet has versions of formats and downward compatibility problems. Have you tried looking at a style sheet page in Netscape 2.0? That Word changes file formats is reasonable.
Hit: Microsoft.
Some criticism has been made about how a Word document changes appearance based on the display or print device. This is in keeping with the philosophy of Windows - which is to enable software features only if the hardware is present to support them. This is radically different from Unix, but this hardware-centric approach of Windows IS defensible on many merits.
Hit: Microsoft.
Word has, in effect, an autoexec scripting mechanism with no sandbox and no security besides that which the user security context of the OS offers. Since Windows 98 effectively runs everyone as root, the vast majority of Windows Word users are flying blind into a cliff.
Strike Three: Microsoft.
The bottom line is this. The
The bottom line is this:
If Microsoft had opened the Word file format, then Word files would have been the defact web page of the Internet, not HTML. That we are doing HTML and HTML rendering engines is testimony to how badly Microsoft missed a golden opportunity with Word. To protect their Word Processing IP, they made sure a non-Word file format (HTML), would become the lingua fraca of the Internet. That by itself is a compelling argument in favor of open file formats.
This is my sig.
By now we know that both the DWG and DOC format have been reverse engineered. We also know that it really does not matter. Autodesk/MS control the data formats. Their rendering of the data is the reference implementation -- and they both change the format at will. They both exploit run-time and new version peculiarities in their rendering of the data.
When it comes time for a company to decide which product to invest in, when it's time to choose if they want to use the proprietary product or some wannabe cheap-o competitor, the answer is alway the same. Go with the standard bearer. And that really is the correct answer. The price differential is completely and totally irrelevant. Corporations invenst a lot more in labor and data than they invest in any one version of a software product. The "open source" factor is -- if not irrelevant -- not appreciated. It is secondary at best.
Look at IntelliCAD. They attempted to commoditize R12 AutoCAD. Supposedly nobody wanted any of the features crammed into post-R12, post-multiplatform AutoCAD. R13 was a bitter pill for AutoCAD customers and loyalists. Supposedly IntelliCAD would allow drafter/designers to draw basic 2D engineering drawing just as well as R13++ for half the price. More importantly, they thought they had given companies that had huge investments in DWG data a viable alternative -- a way out. They could jump from the ship they were supposedly dissatisfied with and seek alternatives.
But you know what? Nobody took the offer.
Not before IntelliCAD was "open source" and not after.
It turns out that Autodesk was able to pull off R14 and salvage their reputation Turns out customers were not all that dissatisfied with Autodesk -- which they correctly saw as a well entrenched, healthy (==rich) partner, committed to investing in both AutoCAD and other forward looking design products and technologies. Turns out AutoCAD is very capable of getting the drafting job done. Besides, IntelliCAD was for shit. Still is. And when Visio sacked the original ItelliCAD development team - a very idealistic and motivated group -- because ICAD was released prematurely with bugs and feature gaps -- any idealism or customer loyalty went out the window. ICAD was exposed for what it had become -- a cheap knock off with no future. The so-called open sourcing of IntelliCAD was just window dressing. The fact was that Visio had interred it's mistake in preparation for acquisition by MS. (It also parted ways with the folks that had inspired IntellCAD, FWIW.)
So what does this have to do with .DOC?
You could come out with a .DOC compatible word processor without a super-human effort. But wihtout the VBA, without the quirky rendering, without all the nuances and endless litany of features of Word it would be nothing more than a knock-off. It would have to beat Word on functional terms in order to be attractive. That would be a very tall order. Like it or not, Word and AutoCAD are very mature products. Maybe they attempt to do too much. Maybe they are bloated with features that any one customer does not want or need. But a whole lot of customers are well served by these products. They get the job done for a broad spectrum of customers.
They are both going to be very, very hard to disslodge.
It's their game to loose.
Beating them on the merits will be damned hard, and possibly not enough.
And, just to goad anyone still reading, being "open source" or not has nothing to do with it.
If open source is a strategic advantage, it will hvae to do with stamina and longevity. Eventually MS/Autodesk will find it hard to keep milking their cash cows. Eventually they will find it harder and harder to justify continued investment in these products. Eventually the WinX platforms both producst are married to will fade. At that point, when Word and AutoCAD stagnate, they may be vulnerable to an open source comminity that can run endlessly on no cash, that can build bridges to newer, more current technologies.
I'm not holding my breath.
In fact, I've changed jobs to get out of the CAD industry. The action is elsewhere. I may not live long enough to see AutoCAD take a fall. It may never happen.
PS: In the CAD space, the most intersting open source activity is not IntelliCAD. The Matra folks have a more interesting offering. IntelliCAD is a corpse. OpenDWG may prove useful if and when the action moves beyond AutoCAD. If that future is to involve open source, it will more likely be centered on Matra than OpenDWG.
I myself developed a syntax directed editor in 1985 called ALICE -- see this page to download it for DOS or Linux -- which still 15 years later does more than Intellisense.
There are some MS innovations but this is also 20 year old stuff.
Has it been over a year since you last donated to the Electronic Frontier Foundation
Its GPLed, granted it needs work. So scoot onto the abiword mailing list and cvs down the latest version, get hacking on it and sort it out.
ole2 is fully sorted out with libole2, excel is being handling by gnumeric.
What is not handled by wv is not by lack of documentation or design, its simply a matter of spending some time at it. Easy peasy. Info on the MSDN docs can be got from here. They can be gotten off the MSDN 1998 July cd, or you can get some of them from wotsit.org. I even wrote ivt2html for you to convert the office.ivt file into html. Like what else do you need.
90% of all the hard work has been done, wv can parse fast and simple with no bother to it, which was a nightmare to do, it can construct the correct PAP (paragraph properties) and CHP (character properties) for a given run of text. Feed you the correct characters and charset and font, the TAP (table properties), graphic properties and handle to graphics. The correct OLE handle for embedded objects. Document properties etc. There is an example html conversion program included for reference (wvHtml).
I put together libwmf to convert wmf file into something useful as well. Theres a half done implementation of an Escher (the graphics for Office) importer floating around in there as well.
Theres also an implementation of a Summary Stream displayer for all ole2 documents.
I even bust my ass and dragged together the right bunch of motivated people to help implement the decryption module for word 97, 95 and 6, and that was not fun at all to say the least
The hard work is done, if you want something improved you have a very very solid base to work from. Yes the spec is confusing, yes its not a great format, yeah is sort of moves over time, but in a fairly rational way that can be supported with some work. There are any number of equally crap formats with weak documentation supported in various tools.
There is just this false myth that the Microsoft formats are inpenetrable and/or not available. Just download wv, fair enough there might be problem documents, if there are, just debug wv and get onto the abiword list and work it out with them. If something fails it can be fixed and improved, its not a case of "ah well, its a MS format, nothing can be done". If you truly want to handle Microsoft formats there are a number of people working on it that you can help.
So its right there for the right bunch of motivated people to work on. C.
I sometimes write stuff
This the exact kind of attitude that should turn people away from MS. Why ? Because it is Bill Gate's explicit goal (and he goes to TV to say this) that MS wants to bring computing to the masses.
Pretend you are him, and you want to achieve this goal. By what means should you use? Closed file formats with lousy specifications? How does that bring computing to the masses when they are prevented from speaking to the Unix Priesthood?
If you, as a MS lackey and worshipper, believe that this is not MS's responsibiilty, then please go take it up with Bill, your prophet. He has stated publicly and many times that this is his goal. Remind him that MS's duty is to the stockholders and they should make as much money as possible. Please tell him that, and also tell him to STOP LYING to the American public.
"Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"
Because the formats suck...?
It's 10 PM. Do you know if you're un-American?