Slashdot Mirror


Why Can't We Reverse Engineer .DOC?

DanPeng asks: "It looks like Autodesk has been pulling the same kind of proprietary file-format monopoly tactics with AutoCAD that Microsoft has been pulling with Office. The difference between Office and AutoCAD, however, is that an organization, the OpenDWG Alliance has been formed by competing companies to reverse-engineer the AutoCAD DWG format. With the amount of funding that it gets, it is actually quite functional and successful, with millions of users. Even when Autodesk revised the format for AutoCAD 2000, the OpenDWG Alliance fully reverse-engineered it within eight weeks. Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"

Good question.

I wonder if it has something to do with the mentality of the players involved. I don't think Sun, Corel or Lotus ever thought that they might be able to get together so that they could compete on the Office market, I think they all looked to carve out pieces of the market with their own suites, making such collaboration impossible. Despite popular misperception, Applix does not convert DOC, it converts RTF (which may be close enough for some people). Star Office is striving toward this holy grail, but they aren't quite there yet. So maybe it's not too late for folks to pool resources and finally get the job done. In fact, with the eyes of the court on Microsoft, now might be the perfect time.

On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and its presentation really be that much more difficult to reverse engineer? I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed, but I wouldn't say that it's impossible ... not for three big corporations, nor for thousands of loosely organized coders. It's one thing to have control of a file format, but it's another to be put into the position of having to change the format constantly in order to stay in the game. If Microsoft is placed in this situation, the onus would be on them to either concede the format until the next major release is made, or shorten the upgrade cycle on Office. How many businesses would stick with an office suite which forced users to upgrade every eight weeks just to remain compatible? If something like this were to happen, we might finally be able to put a dent in the everpresent Office monopoly.

So why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format. Have we tried, or has Microsoft's reputation, both professionally and legally, kept people from really thinking about it?

11 of 337 comments (clear)

  1. .DOC not exactly proprietary by Matts · · Score: 5
    I don't know why this myth continues to propogate:
    • .DOC is an OLE Document
    • OLE Document parsers are available for most platforms. Theres even one for Perl
    • The .DOC format is documented on the MSDN CD's - where else would you expect this documentation to appear?
    • So no reverse engineering is needed. Just follow the spec
    What truth remains is that the doc format changes from release to release of MS Word. So developers have to track these changes. The format is also a large and complex format, so its remained fairly niche in the open source world.
    --

    Matt. Want XML + Apache + Stylesheets? Get AxKit.
    1. Re:.DOC not exactly proprietary by martin-k · · Score: 5
      Close but no cigar.

      1. Physically reading a storage file is not the problem. Making sense out of the streams in the file much more so ...

      2. The Word 97 *was* on the MSDN CDs. Microsoft has pulled it about two years ago. (So much for keeping hundreds of old MSDN CDs around ...)

      3. The Word 2000 additions have never been documented in public.

      4. The MSDN documentation is vague and sometimes plain wrong.

      You get about 85% of a Word converter from coding along the Microsoft docs. It's the remaining 15% that's the hard thing.

      -Martin

  2. Uh duh by TummyX · · Score: 5

    DOC isn't a difficult file format. It's pretty well documented in various places around the web.

    The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.

    DOC is a hugely extensible file format, and you can't support everything DOC can cause DOC can theorectically support just about anything...especially windows applications.

    And no that was not done through evil intent. Believe it or not, integration of applications is very much something that good software engineers strive for.

    If you have a problem with it, just wait a few years (or maybe a decade) for KOffice etc to mature, and watch people complain as documents created on the Linux version of KOffice won't work because someone decided to embed in their document some python code, or an xpaint image.

  3. A Site About File Formats by ekmo · · Score: 5
    --

    | Ceci n'est pas une pipe.
  4. Re:Ok, here we go again... by nagora · · Score: 5
    Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?

    Well I can't imagine why but Microsoft, on the other hand, has a strong profit motive. Once the file format changes, as it does every year (or faster) people start getting emails with the new format in attachments. If they could just use a filter then they wouldn't have to upgrade from Word 6 or whatever was the last version that actually offered them new features they needed.

    An obfusticated format means that filters are hard to write so such people are forced to upgrade which == cash for Bill. In fact, according to M$ this is their single biggest source of revenue.

    I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?

    Undoubtedly, if they're ever forced to release it. In fact, since you mention it, the release of the source code would be useful almost exclusively for the .h files with the data structures in them. Frankly, who gives a damn about the rest of the code? I can write my own bugs, thanks.

    DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything.

    Which means that at some point they'll start changing the definition of XML to close out competitors. They've always taken this approach, why do you think they won't this time?

    When a twit like you starts defending M$ the question I always want to ask is "If they're not a pack of shits why do they bribe, threaten, steal and lie? Do you think it's some sort of hobby?"

    TWW

    --
    "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
  5. Oh, sure, it's "documented" and "open" by 1010011010 · · Score: 4

    All you have to do is implement large portions of Windows, COM and Windows Apps to make it work. It uses OLE Structured Storage. OLE (COM/ActiveX) is a Windows thing. To make OLE Structured Storage work on other OSes, you have to make COM available, and use it to read and write the doc. Microsoft did this for the Macintosh, for example.

    So, to properly read and write .doc files, you either have to:

    1) run Windows and Word
    2) run MacOS and Word
    3) port COM to anither OS and write a Word-alike

    Yummy. Anyone written COM for Linux lately? TummyX's "it's open, it's open, stop whining" aside, .DOC is not open because the technology it depends on is not open. I'm sure the fellow who wrote a Word viewer in his C programming course did it on Windows, where COM and other Windows APIs are available.
    If he did it on Unix or BeOS or something, he should speak up.

    Open file formats are important for interoperability and choice. Non-open ones are important for limiting choice and maintaining control. Knowledge shared is power lost, as Aleister Crowley said.

    --
    Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
    1. Re:Oh, sure, it's "documented" and "open" by 1010011010 · · Score: 4
      TummyX wrote:

      Do you even know what COM is?

      Yes. It's Microsoft's Component Object Model. A formalized descendant of Object Linking and Embedding, which was originally a method of making compond documents with Word and Excel.

      .DOC is an OLE Structured Storage format which can store data streams meant for other programs, like Visio. Those programs also do not have open formats.

      The practice of passing around Word documents in Email because "everyone must be able to read them, right?" is a problem. If someone sends you a document in their favorite proprietary format, you should send them back a document in your favorite proprietary format. Maybe them people will start to understand the need for open, well-documented formats.


      I usually insert visio diagrams in my word documents, and i certainly don't expect to be able to edit those diagrams when i open it up at university with staroffice.


      And isn't that a tragedy.

      --
      Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
  6. Why can't we reverse engineer HTML? by Rilke · · Score: 5

    The analogy is actually more apt than you'd think.

    The .doc file format is fairly well documented, as these things go, although there are some proprietary aspects, like the VBA streams. It's not that tough to open up a Word doc in your own program and parse the file correctly.

    The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.

    Parsing and display are two different things. While half a million apps can parse HTML, no two of them seem to display it in quite the same way. The question here is a bit like pointing out that no browser displays things like (IE|Netscape). Well, no they don't, but that has nothing to do with an inability to reverse engineer the file format.

  7. Re:Ok, here we go again... by Darchmare · · Score: 4

    Yes, but doesn't that require that you own the Latest And Greatest (*cough*) version of Word?

    I think the point is that you have to pay Microsoft the full price of the office suite for the 'privelege' of using newer document formats. That effectively limits the life of your software purchase so that you have to buy a completely new copy whenever there is a document format change - at that point, why not just use it as your primary version?

    THAT is where the rub lies - at that point, you start sending out copies that can only be read in the newer version, and your colleagues begin upgrading as well. It's an endless cycle.

    People want to break that cycle so that they can either use a competing Office program based only on its merits or stick with a previous version which they feel was better than the next version (ie. Mac users who upgraded to Word 6, but wish they had stuck with the previous version) .

    I believe this is a worthy goal.

    - Jeff A. Campbell
    - VelociNews (http://www.velocinews.com)

    --

    - Jeff
  8. Well, I'll be darned.... by ZoneGray · · Score: 4

    At one time, you could download the specs for the binary file format. Now, according to:

    http://support.micro soft.com/support/kb/articles/Q211/6/41.ASP

    You need to write to an e-mail address and explain why you want it. It also says that the formats for earlier versions of Word are no longer available.

    For what it's worth.

  9. -1 (Offtopic) by Chops · · Score: 5
    Do you think MS is the only multi-million dollar business to lie and cheat? I've got news for you. THEY ALL DO. However, MS does it to enforce a monopoly, while other companies do it to try to get a monopoly. That's why it's wrong. The problem is once you get to being a monopoly you have to stop doing all the things that got you there. But don't talk about MS like they are so much worse than other companies. They aren't. They are just the biggest, and most documented.

    Right you are, sir! In today's "free" market, there are a slew of businesses which wield monopoly power, but which they don't want you to know about it. Consider:

    Cisco Systems has a market value comparable to Microsoft's, and has even exceeded it at times, by maintaining a total stranglehold on the network hardware market. Although they would have us believe that Cisco's strategy is "providing a reliable, top-quality product and good support," a number of internal memos have recently been leaked indicating that Cisco plans to start including support for the "upgraded" IPv6 "extension," putting them in a position to use the "embrace and extend" strategy to leverage their large market share into an almost total monopoly on the Internet's physical infrastructure.

    The Lego corporation has a long history of introducing new block designs which render the old blocks almost totally useless from an aesthetic perspective. "I spent all my lawn-mowing money on the medievel set," said a sniffling little boy who asked not to be identified, "but then the Technics came out, and all my spears and stuff wouldn't fit anywhere on the walking robot I built unless I mixed those brown spear-holder blocks in, and then my robot looks yucky." He also pointed out, as is well known, that Lego has broken Technics color-compatibility with their new Mindstorm upgrade, by switching red dye #5 for #8, and yellow #2 for #7. Alas, the legal hassles that await anyone foolish enough to reverse-engineer Lego's proprietary block-connection protocols have ensured that Lego has reigned unchallenged as the only source for toys you can build cool shit with, despite their inferior product. The "accidental" death of Abe Fromage and the subsequent collapse of Tinkertoys spelt the end of competition, even before Lego started blatantly cloning "CPU" and "robotics" technology from the computer industry for use in their "innovative" Mindstorm toys.

    Furthermore, Red Lobster, Denny's, and other chain/corporation/restaruant/franchise establishments regularly use unconscionable terms in the dining agreements they make with their patrons. As a large corporation, they play from a position of strength: With their high-priced lawyers and large bankrolls, they can freely impose their will on the consumer (commonly by the use of so-called "walk-through" agreements: the restaurant posts it dining agreement on its wall, you and are considered to have "agreed" simply by choosing to dine there, regardless if you have read or even noticed the sign). Examples of this include:

    • "Shirt and shoes required" -- usually extended at the whim of the management to cover any situation that might cut into their bottom line. You must keep your shirt buttoned, shoes and feet off the table, wear pants (although it says nothing of this in the dining agreement), and wear all clothing "correctly" (again, at the whim of the management) -- even if you're wearing shoes, placing your socks on your ears will earn you a quick ticket to the street.
    • Even though you have paid in full for the meal, none of it is "yours" to do with as you see fit -- only licensed to you. You cannot throw your potato. You cannot hold a puppet show with your broccoli. You cannot gargle anything. And don't even think about trying to take "your" plate, ashtray, silverware, or table out the door with you -- if you read the fine print, you'll find that these items were only "licensed" to you for the duration of the meal!

    It is sad, but the powermongering megacorporations who really run our country also have merciless teams of wedgie-men and noogie-goons at their command, and they have bamboozled the media and the government into abusing Microsoft to benefit their own bottom line. What with communistic government interference, backlash from the misinformed public, and the software piracy that is rampant in today's industry, Microsoft can barely stay afloat, let alone research more of the innovative, professionally engineered products the software community has come to expect from them, like Microsoft Bob, the dancing Office paper clip, and email clients that do it all at the click of a mouse! Yay Microsoft! Go Bill! One world, one web, one program!