Slashdot Mirror


Office 2003 and XML

zachlipton writes "Internet World is reporting that initial reports from Office 2003 beta testers don't look good for those hoping to share documents with non-MS systems using the XML file format. Gary Edwards, the OpenOffice.org representative for the OASIS XML file-format group is quoted as saying "although it's still early in the review process, it does look as though XP XML has been so seriously crippled as to be useless to anyone but the big content management and collaboration system providers." Apparently, all formatting and presentation information is removed from the XML. Furthermore, Office's new collaboration featres will only work with users who are also running Office 2003 (requiring Windows 2000 or 2003) that are connecting over XP servers." So Microsoft will continue its efforts to lock-in users with proprietary formats, and hopefully the rest of the world will produce an XML standard document format without them.

10 of 502 comments (clear)

  1. I have Office 2003 and this article is BS by Anonymous Coward · · Score: 5, Informative

    I have Office 2003 Beta 2 freshly downloaded from MSDN. This article is completely wrong. I did the following:

    1. Opened a heavily formated .DOC Word document with tables, multiple fonts, etc.
    2. Saved the document as XML.
    3. Opened up the XML document in Word and it looks EXACTLY like the original .DOC format.

    I also opened the XML file in a text editor and sure enough it contains complete formatting information.

    1. Re:I have Office 2003 and this article is BS by RanmaSan · · Score: 4, Informative

      It's not pretty, but it works:

      <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
      <?mso-application progid="Word.Document"?>
      <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/ 2003/2/wordml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:SL="http://schemas.microsoft.com/schemaLibra ry/2003/2/core" xmlns:aml="http://schemas.microsoft.com/aml/2001/c ore" xmlns:wx="http://schemas.microsoft.com/office/word /2003/2/auxHint" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C1488 2" xml:space="preserve"><o:DocumentProperties><o:Titl e>Some Centered Bolded Text</o:Title><o:Author>Mark McWilliams</o:Author><o:LastAuthor>Mar k McWilliams</o:LastAuthor><o:Revision>1</o:Revision ><o:TotalTime>2</o:TotalTime><o:Created>2003-03-13 T17:30:00Z</o:Created><o:LastSaved>2003-03-13T17:3 2:00Z</o:LastSaved><o:Pages>1</o:Pages><o:Words>10 </o:Words><o:Characters>57</o:Characters><o:Compan y>i-FRONTIER</o:Company><o:Lines>1</o:Lines><o:Par agraphs>1</o:Paragraphs><o:CharactersWithSpaces>66 </o:CharactersWithSpaces><o:Version>11.4920</o:Ver sion></o:DocumentProperties><w:fonts><w:defaultFon ts w:ascii="Times New Roman" w:fareast="Times New Roman" w:h-ansi="Times New Roman" w:cs="Times New Roman"/><w:font w:name="Tahoma"><w:panose-1 w:val="020B0604030504040204"/><w:charset w:val="00"/><w:family w:val="Swiss"/><w:pitch w:val="variable"/><w:sig w:usb-0="61007A87" w:usb-1="80000000" w:usb-2="00000008" w:usb-3="00000000" w:csb-0="000101FF" w:csb-1="00000000"/></w:font></w:fonts><w:styles>< w:versionOfBuiltInStylenames w:val="3"/><w:latentStyles w:defLockedState="off" w:latentStyleCount="156"/><w:style w:type="paragraph" w:default="on" w:styleId="Normal"><w:name w:val="Normal"/><w:rsid w:val="7765DB"/><w:rPr><w:rFonts w:ascii="Arial" w:h-ansi="Arial"/><wx:font wx:val="Arial"/><w:sz-cs w:val="24"/><w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA"/></w:rPr></w:style><w:styl e w:type="character" w:default="on" w:styleId="DefaultParagraphFont"><w:name w:val="Default Paragraph Font"/><w:semiHidden/></w:style><w:sty le w:type="table" w:default="on" w:styleId="TableNormal"><w:name w:val="Normal Table"/><wx:uiName wx:val="Table Normal"/><w:semiHidden/><w:rPr><wx:fon t wx:val="Times New Roman"/></w:rPr><w:tblPr><w:tblI nd w:w="0" w:type="dxa"/><w:tblCellMar><w:top w:w="0" w:type="dxa"/><w:left w:w="108" w:type="dxa"/><w:bottom w:w="0" w:type="dxa"/><w:right w:w="108" w:type="dxa"/></w:tblCellMar></w:tblPr></w:style>< w:style w:type="list" w:default="on" w:styleId="NoList"><w:name w:val="No List"/><w:semiHidden/></w:style><w:sty le w:type="paragraph" w:styleId="StyleBoldCentered"><w:name w:val="Style Bold Centered"/><w:basedOn w:val="Normal"/><w:rsid w:val="7765DB"/><w:pPr><w:pStyle w:val="StyleBoldCentered"/><w:jc w:val="center"/></w:pPr><w:rPr><wx:fon t wx:val="Arial"/><w:b/><w:b-cs/><w:sz-c s w:val="20"/></w:rPr></w:style><w:style w:type="paragraph" w:styleId="SmallTitle"><w:name w:val="Small Title"/><w:basedOn w:val="StyleBoldCentered"/><w:rsid w:val="7765DB"/><w:pPr><w:pStyle w:

  2. The authors of the article didn't bother to RTFM.. by malakai · · Score: 5, Informative

    The point of the Office 2003 "Save as XML" with the "Data Only" checkbox is _NOT_ a poor mans Save As XHTML. It's decide to allow the data of the document and pet placed into an XML document based on a schema. You literally can make your own schema file/XSD, and use a tool inside Word to map the elements of a Word document to elements of the schema. If you simply map a paragraph to a string you will lose formating. Unless of course you define in your schema how you'd like to store formating information. But that is generally an overkill.

    Think of a resume. you could define an XSD for a resume, and be able to save resumes against this XSD, as validated pure XML.

    Now, if you want to produce a document, using an XML syntax but want to combine both data and presentation, then you want WordML.

    WordML uses Word's own tags to markup the word document. I was going to show you an example of WordML but i don't feel like escaping allt he greater-than/less-than signs. Anyhow, WordML contains all the formating and everything necessary to display a Word document as it is supposed to look.

    I think this Open Office guy is looking for a devil in Office 11 that isn't there. That or he didn't read the friggin manual.

    -Malakai

  3. Re:Separating Content from Presentation a Good Thi by djoham · · Score: 5, Informative

    This may be bad (keeping in mind the jury is still out on exactly how Microsoft is making this work) because in the case of office documents, the style is actually *part* of the content, from the perspective of Joe Office User.

    If Microsoft just puts the raw text data into a .xml file, then that .xml file is practically useless to anyone who wants to collaborate with the original author since all of the styling information is lost.

    As an example of a good way to do this (IMHO), take a look at how OpenOffice.org builds their files. When you make a .sxw (the default writer format) you're actually taking the raw data of the document, the styling rules for the document and a few other important bits and pieces and zipping them up into a single file.

    After unzipping this file, the following directory structure was exposed:

    content.xml
    META-INF/manifest.xml
    meta.xml
    mi metype
    settings.xml
    styles.xml

    With this type of design, you can get the best of both worlds. Technically, there is a separation between your presentation and content which allows simple programatic access to the data when necessary. At the same time, this design allows for full collaboration between people who also consider the styling of the data to be part of the content because the style rules for the content are included with the document.

    With xml-saved Office documents containing only data and no style, collaboration between non-office users (and apparently Win9x users as well) will be no better off than before. Perhaps worse, assuming the binary .doc, .xls etc formats have changed and will need to be reverse-engineered again.

    If this article is true and Microsoft has decided to remove the styling of their xml-saved office documents, I see two possible reasons for this:

    The first is obvious. You're not using Office? Ok, second class citizen, here's the data but in a format that is next to useless for you to use.

    The second possibility involves Microsoft just not being where they want to be with the Office XML sharing. Keep in mind that it took OpenOffice.org something like a year and half or so to define their XML interchange format. Microsoft may be going there, but due to overwhelming inertia, it just might not be going there very quickly.

    Personally, I think the first option is the most likely. However, with OpenOffice.org working with OASIS and others on a common XML interchange format, I'm hoping Microsoft will be forced by the marketplace into option 2.

    Best regards,

    David

  4. The article is blantantly wrong... by malakai · · Score: 5, Informative

    Read some other articles, or better yet get ahold of a beta and try it out. The authors of this articles will feel like schmucks when they realize what they missed.

    First off, by default, if you save the word document as XML, it gets saved as WordML,which preserves Word's styles and formatting in an XML name-space that's separate from the one bound to the schema-controlled data.

    If you check off the checkbox "Data Only" then you will lose all formating and your own XSD will be used to map this document into XML data.

    WordML looks like a XML'ified RTF language. It would be trival to create an XSL stylesheet that transforms WordML into HTML/CSS with all formating (that HTML is capable of) which directly mimics MS Word. OpenOffice could also eat WordML quite easily and have all the formating/style of Word.

    What the authors of this article are REALLY bithing at, is the fact that MS didn't buy into the OpenOffice Document Specification from OASIS. MS prolly sees OASIS as the US sees the UN. Defunct, not needed.

    If you describe your data using XML semantics, and all it takes to convert from semantic style A to B is some XSL, then who cares about forcing everyone to use one specific format.

    -malakai

  5. WordML by malakai · · Score: 4, Informative

    If you "Save as XML" in Office 11, then by default the data is saved as WordML. WordML is an xml version of MS internal storage format (basically RTF). OpenOffice could quite easily write an interpreter for WordML. Hell, I could write an WYSIWYG editor for WordML in a day. If that. It's pretty simple if you understand the basics of RTF.

    It's only when you Save as XML with the "Data Only" checkbox that you get into striping formating (and rightly so). Word WARNS you about this. In addition, you can specify your own XSD to save to. And word will VALIDATE this for. Not to mention, you can use a word tool to map elements of Word documents to elements of your schema. DAMN COOL.

    In addition (As if that isn't enough) when you save, in either way, you have the option of specifiying a XSL style sheet. It'll go ahead and transform the output for you as part of the save.

    Then only thing the OpenOffice people are upset about is that MS didn't buy into the OASIS/OpenOffice Document Specification. Tough shit. I'll write them an XSL that'll work again WordML to solve that for them. Lazy bastards.

    -malakai

  6. Save As XML = WordML by malakai · · Score: 5, Informative
    Taken from a real review of the XML/Office features:

    Once valid, the document can be saved as XML in two ways. The default is to create WordML, which preserves Word's styles and formatting in an XML name-space that's separate from the one bound to the schema-controlled data. You can optionally save through an XSLT transformation which, in a publish-to-the-Web scenario, could translate WordML formatting into HTML/CSS formatting. Alternatively, if you tick the Save as Data option, you can instead save just the raw XML data. In that case, you can bind one or more XSLT stylesheets to the document, each of which can generate WordML styles and formatting.


    InternetNews is authored by morons.

    -malakai
    1. Re:Save As XML = WordML by Hangtime · · Score: 5, Informative

      Same thing with Excel, you can save as Excel with formatting or not. This comes from the Excel XML with formatting. Quite simply the article is flamebait.

      <Style ss:ID="s26" ss:Parent="s16">
      <Borders>
      <Border ss:Position="Bottom" ss:LineStyle="Continuous" ss:Weight="1"/>
      <Border ss:Position="Top" ss:LineStyle="Continuous" ss:Weight="1"/>
      </Borders>
      <Font ss:FontName="Times New Roman" x:Family="Roman" ss:Size="12" ss:Bold="1"/>
      <NumberFormat ss:Format="_(* #,##0_);_(* \(#,##0\);_(* &quot;-&quot;??_);_(@_)"/>
      </Style>
      <Style ss:ID="s27">
      <Alignment ss:Vertical="Bottom"/>
      <Borders/>
      <Font ss:FontName="Geneva"/>
      <Interior/>
      <NumberFormat/>
      <Protection/>
      </Style>
      <Style ss:ID="s28">
      <Font ss:FontName="Geneva" ss:Size="12"/>
      <NumberFormat ss:Format="0.0"/>
      </Style>

      <Stuff in between here to get around Lameness filter>

      <Style ss:ID="s27">
      <Alignment ss:Vertical="Bottom"/>
      <Borders/>
      <Font ss:FontName="Geneva"/>
      <Interior/>
      <NumberFormat/>
      <Protection/>
      </Style>
      <Style ss:ID="s28">
      <Font ss:FontName="Geneva" ss:Size="12"/>
      <NumberFormat ss:Format="0.0"/>
      </Style>

  7. Goldfarb's conjecture by RobotWisdom · · Score: 4, Informative
    I think the point is that if you save to their XML specification, you will lose all your document formatting.

    I think the root of the confusion goes back to Golfarb's original theory for SGML-- that the styles in a document are secondary to the structures, and should be kept separate.

    This has been a religious conviction ever since, despite the fact that most authors are messy and intuitive, and SGML-etc are very, very rigid and unintuitive. The rationalisation is that messy authors can just represent their styles using 'fake' (ad hoc) XML, but if this turns out to be 90% of the real users of MS Office, then I think MS could indeed save valid XML, but it won't be portable in any useful sense.

  8. On XML file formats.. by PeekabooCaribou · · Score: 4, Informative
    I realize this is redundant by now, but I think this is important enough to warrant a few duplicated posts. For Microsoft's XML format to be useful (and even worth implementing), it's going to require some advantages above and beyond what plain text formatting offers. The only completely useless XML format would be:
    <document>
    This is my document.
    Second paragraph.
    </document>
    I make the assumption that at least some tags are applied, such as some sort of paragraph tags and the like. I may be going out on a limb here, but I would even assume that their final XML format will produce documents identical to .doc files. I would also assume that I could pass this file off to Joe in marketing, and he would see a document identical to the one I saw. What I'm getting at is that style has to be held somewhere. If the XML file has no style associated with it, then congratulations, Microsoft, you did it right. But if Word can display the right formatting, then so can anyone else. (Assuming Word doesn't store the styles in a proprietary format, which I don't think is beyond them.) But why am I even writing this? From the article:
    However, Mark McWilliams, a software engineer and Office 2003 beta tester, said he has seen nothing to indicate that Office 2003 removes formatting information from files saved in .xml. He noted that he opened a heavily formatted .doc Microsoft Word file, saved the file as XML, and later opened the file in Word 2003, "The opened XML document looks exactly like the original .doc file," he said. "And if I open up the XML file in a text editor, I can see that all of the formatting is properly maintained in the XML file."
    Time will tell.
    --
    "I'll say it again for the logic-impaired." -- Larry Wall.