Slashdot Mirror


Dark Corners of the OpenXML Standard

Standard Disclaimer writes "Most here on Slashdot know that Microsoft released its OpenXML specification to counter ODF and to help preserve its market position, but most people probably aren't aware of all the interesting legacy code the OpenXML specification has brought to light. This article by Rob Weir details many of the crazy legacy features in the dark corners of OpenXML. As it concludes after analyzing specification requirements like suppressTopSpacingWP, 'so not only must an interoperable OOXML implementation first acquire and reverse-engineer a 14-year old version of Microsoft Word, it must also do the same thing with a 16-year old version of WordPerfect.'"

17 of 250 comments (clear)

  1. Backwards compatibility by Bob54321 · · Score: 3, Interesting

    I thought most people considered themselves lucky if there documents could open in successive versions of Office. Why would anyone want to implement support for really old versions if Microsoft does not do it themselves?

    --
    :(){ :|:& };:
  2. Re:The author is exactly right. by Zaiff+Urgulbunger · · Score: 2, Interesting

    Totally agree. I wonder how it managed to get approved by ECMA? IIRC only IBM didn't agree to its approval; all other parties (whoever they are) agreed. I don't understand what they felt was good about this "standard" especially given that ODF had already been approved.

  3. Don't forget the page counts... by Anonymous Coward · · Score: 5, Interesting


    ODF spec page count: 722.

    OpenXML spec page count: 6000 !!

  4. Re:Basically by Nicopa · · Score: 4, Interesting

    No. ODF has several real, factual, benefits. It might have been originated in a single product but... it reuses existing standard technologies (SVG, CSS...). It has properly designed XML tags that act as "markup", in OpenDocument xml tags act as container for chunks of data. ODF tries to separate content from style.

    And about your RTF suggestion... can I draw diagrams with RTF? Can I have a ToC? Can I do complex styling? Can I have a "galery" of styles? Can I include images? No. RTF is not a solution.

  5. Forbidden partial implementation? by tepples · · Score: 4, Interesting

    OOXML is just as open as ODF

    The behavior of years-old proprietary word processing software is included by reference into OOXML. How is any spec that includes by reference the behavior of proprietary software exactly "open"? True, implementors could produce a partial implementation of the spec that degrades away the legacy baggage (more or less) gracefully, but some standards' patent licensors forbid implementors to publish a partial implementation. I don't know if this applies to OOXML's license.

  6. Re:Basically by megabyte405 · · Score: 2, Interesting

    Actually, I think for most of the things you suggest, you can do them - I know AbiWord supports them at least. (images, complex styles, TOC) RTF's really not the old dog it seems to be - keep in mind that for copy/paste of any sort of rich text to work in any sensible manner on Windows, one _must_ support RTF well.

    --
    I recognize people by their sigs. Is that a bad thing?
  7. Disadvantages of ISO by BillGatesLoveChild · · Score: 5, Interesting

    Once it is ratified as an ISO Standard, the standard is locked up and anyone that does want to a copy has to buy it from ISO. These are copyrighted. They're not cheap; thousands of dollars. Out of the reach of the average hobbyist, and not listed anywhere on the Internet. That 6,000 page draft will vanish into the mists of time.

    Larger Companies can afford this, but garage companies and hobbyists definitely can't. So what's the chance of an open source or even small upstart challenging Microsoft's Documentonopoly? Zero.

    Want another example? ISO country codes. The country codes (e.g. .us, .jp) are actually ISO, and ISO ended up backing off on a demand for royalties for this(!) But if you want state codes (e.g. California, Kantou), well, forget it unless you want to buy them off ISO. http://www.alvestrand.no/pipermail/ietf-languages/ 2003-September/001472.html

    ISO aren't the only ones guility of doing this. IEEE do it as well. Want the latest simulation standard? Then get out your checkbook: http://standards.ieee.org/catalog/olis/compsim.htm l

    ISO and the IEEE are enemies of openness. Microsoft is taking a page out of their gamebook.

    ISO or IEEE certification is a *bad* thing.

  8. MS areslow learners by WebCowboy · · Score: 3, Interesting
    ...but they do learn....slowly...eventually.

    Their "open" XML format for office docs is a prime example of this.

    I think Steve Jobs was the one who first said "Microsoft just doesn't get it". Microsoft was probably the very first third-party software developer for the Mac and this was Jobs' reaction to Microsoft's first Mac applications (I think a port of Multiplan--which was re-incarnated into Excel IIRC, and MSBasic). They really WERE "tasteless", ugly and took almost no advantage of the revolutionary GUI interface--their DOSness really showed through--I think in the case of Multiplan the mouse could be used only to jump the cursor to a certain cell and that was it--the rest was all like in DOS.

    MS Windows is another example--Microsoft didn't "get it" well enough until the third major release. Now MS is SLOWLY "getting it" with the beneficial characteristics of XML standards. Microsoft's early XML efforts are like Windows 1.0--there is some very rudmentary understanding of the mechanics but not the philosophy of XML, and I wonder if this is why SOAP ended up NOT so simple (given Microsofties were involved in its creation and seemed to be trying to make it a DCOM-in-XML-but-dumber thing). Microsoft's "Version1" XML might look like this:

    <Soap:Envelope>
    <Soap:Body>
    <wsWriteLegacyData>
      <encodedBinaryData>
    SDFgkdfkljSDFJLDFSJKLkjdfbks df jklsdfklj;hk/jkjnb.kndf
    jk.sdfjkldfsddfsdfkkjsdfh kvbkjnkjkjksdfkjsdfkeuieru903
    oijooeoefvkmefmklef lmkseflkvfeklmlmermklemleflmdvldflk
    </encodedBina ryData>
    </wsWriteLegacyData>
    </Soap:Body>
    </Soa p:Envelope>
    "See? We're using XML and SOAP! We're hip! We're cooool! You can't say we don't play by the rules now!"

    Of course, this is an obtuse, opaque and obsfucated way to use XML andtotally NOT in the spirit of interoperability and openness. I won't even go into the nifty XML tools MS has made...nifty to use but they've done a lot to obliterate the S out of SOAP in their crazy output.

    OOXML (Opaque and Obsfucated XML) standard is "version 2.0"--they're doing their best to eliminate ambiguity but now we've gone over to hyper-specificity, and the standard is being shared a bit better...problem is that they don't fully describe the interpretation of the standard elements so as to keep its advantage. All they've done is taken every formatting option and mapped it to an XML element--it is monolithic and completely non-extensible. But hey, at least its publicly available and doesn't involve weirdness like encoded-binary-blobs.

    In a few years MS will reach version 3.0 of "getting" XML...
  9. the real hitch - it never was clear by Erris · · Score: 4, Interesting

    The hitch here is that *not* having them means tons and tons of reverse engineering, and that's only after tracking down every release of every version of every MS Office ever.

    The real hitch, as the article hints, is that the releases are contradictory. For instance, the Mac version of small caps is different from others. This is part of the reason Word is so bloated and does not preserve printing type setting from one machine to the next.

    Ten years ago, a state agency I was working for was forced to move from Word Perfect to Word. Hundreds, if not thousands, of documents were painstakingly converted from one format to the other. The typesetting, which they had never had a problem with previously, was easily broken by moves from one machine to the other or by changing printers. That is the kind of thing that no program can account for - it was broken from then and can not be created correctly today. It's also probably the reason for all of the nebulous "guidance" sections that don't tell you anything other than to look at, and presumably measure, old printed examples. Not even M$ knows what it was really doing in the field. As I saw at the time, no two were alike.

    Of course, the time to get things right is not in your XML it's when you import the document. The author tells us this in so many words. The XML should be general enough to encompass any kind of typesetting. It is the importing program's task to figure out what the old format wanted things to look like. As the author points out, the spec does not do anything other create something impossible to follow. It's not going to magically make things look right no matter how hard they wish it would.

    --
    DMCA, Hollings, Palladium. What might have sounded like paranoia is now common sense.
  10. Blind leading the blind by frisket · · Score: 2, Interesting
    It's instructive to observe the panic-ridden frenzy with which Microsoft have approached the business of using XML as a file format. The marketing influence is all too plain to see, with the result that they feel an inner compulsion to preserve the appearance of the document at all costs, sacrificing all logic and common-sense to do it.

    OOo did the same, but with greater elegance and less haste because they were ahead of the field. Corel screwed it up with WordPerfect by keeping their stylesheet format proprietary so that transfer between WP document code and XML was made as hard as possible (a Class A blunder, given that their XML editor is actually quite good). AbiWord makes a good job of saving DocBook XML, but it's not trying to pretend it's reimportable; it screws up LaTeX formidably, though, by trying to pretend that it absolutely has to preserve line-length and font-size, which is evidence of the same neurotic attitude as Microsoft.

    The problem in all cases is not that the assorted authors and coders don't understand XML (although some of them clearly failed that test too), but that they don't understand documents. This is particularly true at Microsoft, where leaders such as Jean Paoli have been proselytizing XML for years. They still think a document is a jumble of letters; they have no idea of structure, and the DOM is simply laughable as a non-model of a document. Microsoft's particular problem with XML is that they came to it too late, and viewed it as a way of storing data, not text...indeed to this day many XML users, trained with Microsoft blinkers on, are unaware that XML can be used for normal text documents.

    With this level of ignorance surrounding Microsoft, it's hardly unexpected that they should blunder so badly.

  11. Re:Then get the customer to supply it by Anonymous Coward · · Score: 1, Interesting
    They have it so why not let you see what you're supposed to be working to - doesn't cost them anything

    Because they won't be inconvenienced by 10 competing vendors fighting over who gets to study their copy of the standards. Instead, they'll just pick a vendor who has their own copy.

  12. Guillaume Portes = Bill Gates by tendays · · Score: 2, Interesting

    I don't know how many of you noticed: The fictional name "Guillaume Portes" is actually a literal translation of "Bill Gates" in French ...

  13. OOXML's Origin Is Not The Problem by NickFortune · · Score: 3, Interesting
    ODF is a nice idea in theory, but really, it's a similar situation (OpenOffice.Org internal dataformat jammed into a standard, so designed with OO.o in mind by necessity)
    The ODF format must necessarily describe the structure and layout of an office document. There's no need for it to reflect the internal data structures of any specific application, except to the extent that they too describe office documents.

    OOXML includes data elements that should be part of internal import routines rather than being enshrined in the document format, and it includes elements that are not specified except by reference to applications for which no public specs exist. This is the problem, not the fact that OOXML is derived from MS Office file formats.

    RTF. It may not get press attention, but it's actually a fairly well-documented standard, has been working as an interchange format for years, and yet is designed with enough expandability that it's still useful with the kinds of documents produced today. It's a true de-facto standard.
    Well, I was a big fan of RTF at one time. But a few years back I found that documents with any kind of formatting more complex than paragraph+justification+font just wasn't working between MS Office and back. I don't know if this was because the format couldn't cope, or because of faulty implementations. In either case, it led me to give up on RTF.

    In any event, to be a replacement, RTF would need to work for spreadsheets and presentations at a minimum - something I don't think there's a lot of support for in the current RTF specification. We'd also lose the benefits of an XML based format, which given the amount of work on the seamless integration of XML documents into databases, web services and other data management applications means losing a lot of functionality.

    for those who really want interoperability, RTF is the way to go with today's software
    Interoperability is only part of the problem. We also want a spec that can be fully and freely implemented by anyone, which isn't under the control of any single vendor.We want a format to which we can entrust documents, knowing that in twenty years time there will be an application capable of reading them.

    an unnecessary dichotomy is drawn between OpenXML and ODF with regard to their design goals - both are repurposed native formats for a single application.
    I don't know what you mean by native in this case, but the repurposing of OOXML isn't the problem. It's one of size and obfuscation, and as TFA points out specification by reference to closed formats and the behaviour of extinct proprietary software. These are non trivial problems with OOXML which are not (to the best of knowledge) found in ODF.

    There's nothing wrong with ODF. Re-creating it based on the non-XML RTF would be a waste of time and effort.

    --
    Don't let THEM immanentize the Eschaton!
  14. Re:MIcrosoft sucks. by Gr8Apes · · Score: 2, Interesting

    They did. It was "resolved" by being disallowed. There is no per machine "MS tax" anymore. That fell victim in teh IBM suit, I believe. (So many suits, so long ago, so many beers.... ahh - that explains it!)

    --
    The cesspool just got a check and balance.
  15. Re:Basically by megabyte405 · · Score: 2, Interesting

    Well, AbiWord serializes its internal data structure into XML, so it's not an exact dump - it lets us do things like have backward-compatible additions such as LaTeX and MathML equations and include an image preview of the equation as a fallback, for instance. There are things you can do to make your internal format more lucid, and binary->text is one of those things: I can fix almost anything that can go wrong with an AbiWord doc (usually only happens in dev releases, but sometimes strange things happen) with Notepad.

    (And I would say that as long as it's well-documented and in a useful manner, if you're just using it for internal/non-archival data storage and need a lot of speed, using the internal structure would make sense.)

    --
    I recognize people by their sigs. Is that a bad thing?
  16. Re:The power of legacy systems... by clodney · · Score: 2, Interesting

    I suspect that Microsoft has near zero dev docs to work from, at least when it comes to emulating Word 95 or WP 5.1. In the early days MS was known as a very freewheeling culture, and internal docs were very rare.

    Even more to the point, the guidance sections essentially say "the existing implementations are buggy and we can't actually describe the precise behavior for arbitrary input". In the case of previous versions of Word they probably just have copies of the old modules that they feed the data to. For WordPerfect emulation I bet someone just did a black box reverse engineering that *mostly* works. While it may be documented, the rules are going to seem completely arbitrary.

    I am often faced with this kind of situation at work. The requirements calls for significant enhancement to a feature, and it seems clear that the best way to implement it is to write the code from scratch. But the feature still has to maintain backwards compatibility. So you start studying the code, and find that it has subtle bugs that you now have to recreate in a new implementation.

  17. Looks the same--who cares! by Anonymous Coward · · Score: 2, Interesting
    'so not only must an interoperable OOXML implementation first acquire and reverse-engineer a 14-year old version of Microsoft Word, it must also do the same thing with a 16-year old version of WordPerfect.'"
    Someone needs to tell every developer of word processing and page layout software on the planet to abandon the 'must look the same' obsession described by the above. Why worry about making content in application B look like content in Application A? I create books out of Word files submitted by several people. The last thing I want is all the inconsisent formatting from each of them to control a book's look.

    Named styles is the answer. If a paragraph is body text, call it that. If it's an inset quote, call it a quote. If a term is in italics, label it as italicized style not Times Italic 12 point. But don't get all hung up in the distinctions between Times Roman and Times New Roman. The purpose of XML is to define what something is. Not what someone thought it ought to look like on Tuesday three weeks ago.

    Ditto transfers between applications. Why is there so much effort devoted to importing every little odd quirk of Word into InDesign as if the quirk mattered. Bring in the text tagged with what it is and let InDesign determine what it looks like. InDesign is far more powerful and predictable than Word anyway.

    It is, of course, the Microsoft's advantage for everyone to define RTF and now OpenXML as the "standard" and obsess over the sort of things described above. But there's no sane reason for this obsession to exist. If you want a text to always look the same, use PDF. If you want a document to look good, make it look good in the application you're using. Don't try to make that application retain the 'sorta-looks-ok" feel of another application. That's too much work for too little result. It's why all too many ordinary users shrug their hands and buy Word rather than hassle with import quirkiness.