Office 2003 and XML
zachlipton writes "Internet World is reporting that initial reports from Office 2003 beta testers don't look good for those hoping to share documents with non-MS systems using the XML file format. Gary Edwards, the OpenOffice.org representative for the OASIS XML file-format group is quoted as saying "although it's still early in the review process, it does look as though XP XML has been so seriously crippled as to be useless to anyone but the big content management and collaboration system providers." Apparently, all formatting and presentation information is removed from the XML. Furthermore, Office's new collaboration featres will only work with users who are also running Office 2003 (requiring Windows 2000 or 2003) that are connecting over XP servers." So Microsoft will continue its efforts to lock-in users with proprietary formats, and hopefully the rest of the world will produce an XML standard document format without them.
Developer
.xml.
.xml, which makes it easy for companies to integrate InfoPath forms into their existing business processes -- one of the
March 13, 2003
Will Office 2003 Lead to Lock-in?
By Thor Olavsrud
With the recent beta release of Microsoft Office 2003 out the door, many customers got their first look at what Microsoft hopes will re-write the office productivity landscape with a new ecosystem of collaborative functionality based on XML (define). But will organizations have to scrap their piecemeal systems and buy into an entirely Microsoft architecture to tap it?
That's the contention of Gary Edwards, a Web application design consultant and OpenOffice.org's representative on the OASIS Open Office XML Format Technical Committee.
Edwards said that Office 2003 beta's handling of the XML file format means that firms will not be able to tap the rich collaborative features of Open Office 2003 without resorting to proprietary Microsoft file formats. And to truly unlock its collaborative potential, firms will have to standardize on the Windows XP operating system (Office 2003 won't run on Windows 9x), as well as Windows 2003 Server, SharePoint Server, Exchange Server, etc. As for the file formats, he called Office 2003's XML "crippled," because it strips XML files of all presentation and formatting information when saving them in the XML file format. It does not do this when saving files in Microsoft's proprietary file formats.
"Although it's still early in the review process, it does look as though XP XML has been so seriously crippled as to be useless to anyone but the big content management and collaboration system providers," Edwards said. "Reports are that when saving to XML, [Office 2003] strips out the presentation and formatting information, leaving near raw content. It appears, at least from the non-enterprise systems user's perspective, that all the really cool collaborative advantages are based on saving files in the XP proprietary format. Which means that "all" the users in the collaborative effort must be on the XP platform, using XP Office, connecting through XP servers. What kind of universal connectivity and exchange is that? XP users won't even be able to collaborate equally with the 200 million Win9x users. Not unless they upgrade."
However, Ronald Schmelzer, founder and senior analyst of XML research firm ZapThink, noted that Microsoft's approach aligns more closely with a core tenet of XML theory: the separation of process and data.
"The idea is for XML not to specify how the information should be processed, but rather leave that task to XSL (define) templates and other post-XML processing steps," he said. "XML is supposed to be a presentation-neutral format."
Still, Schmelzer said that becomes more tricky when integration goes beyond the enterprise itself.
"I think when it goes beyond intra-business integration to cross-industry and inter-organization integration, the question will be how much of the data they exchange do they want loaded with presentational and operational functionality and how much do they want to leave to the individual implementation of the company?" he said. "This is really not an answerable question -- because it depends on the scenario. The problem with standards is that there are so many of them. The resolution here is to look at how companies and industries will adopt XML in their verticals and then determine which aspects of that should be embodied in standards and which should be embodied in products. Experience shows that companies and industries can hardly agree on the data, let alone the representation, so erring on the side of "less" in the XML body makes more sense."
Microsoft chose not to respond to questions about presentation and formatting in their XML vs. their proprietary file format, simply noting that the native file format for InfoPath (the application for creating XML forms in Office 2003) is
"The native file format for InfoPath forms is
Windows 2003 server is comming out on April 24
Windows 2003
Well, I believe this is the best thing that we could have hoped for coming from Microsoft. Even a division between content and format will allow data to be transferred more easily from one format to another.
You could have the same xml content file to outputting to pdf, rtf, postscript, and any number of other formats. Separating data from format is one of the strengths of xml. This is much better than straight binary format or (ugh) RTF. Separating data and format is a good thing.
You don't keep xml data in your xsl stylesheets do you?
Why, o why must the sky fall when I've learned to fly?
Anyway, Office has a ridiculously complicated format. Any XML that it generates will most likely be a nightmare even if they don't try to make it that way.
Um... XP has been out for quite a while now... do you mean Office 2003???
Anyway, I don't see what the fuss is all about... if everybody would care to read the article, it describes exactly what they are doing.
Key Point: XML is for DATA. DATA, not formatting, XSL is for formatting. The content is stored in XML. The content (data) is what would be needed by other system. InfoPath (formally XDocs) also has content (in XML) AND formatting (in XSL.)
It can use the XML from Word/Excel with ease... you should try it. By having the content stored in XML, it makes it very easy to take that Word/Excel Document and submit it to a Web Service for further processing.
Bill
It's my Sig and you can't have it. Mine! All Mine!
The point of XML is to seperate the presentation from the content anyway. If you add in formatting and what have you directly into XML you have defeated that purpose. That is why there is XSL and CSS. Those are the things you are supposed to use for the actual presentation and formatting.
I have Office 2003 Beta 2 freshly downloaded from MSDN. This article is completely wrong. I did the following:
.DOC Word document with tables, multiple fonts, etc. .DOC format.
1. Opened a heavily formated
2. Saved the document as XML.
3. Opened up the XML document in Word and it looks EXACTLY like the original
I also opened the XML file in a text editor and sure enough it contains complete formatting information.
The point of the Office 2003 "Save as XML" with the "Data Only" checkbox is _NOT_ a poor mans Save As XHTML. It's decide to allow the data of the document and pet placed into an XML document based on a schema. You literally can make your own schema file/XSD, and use a tool inside Word to map the elements of a Word document to elements of the schema. If you simply map a paragraph to a string you will lose formating. Unless of course you define in your schema how you'd like to store formating information. But that is generally an overkill.
Think of a resume. you could define an XSD for a resume, and be able to save resumes against this XSD, as validated pure XML.
Now, if you want to produce a document, using an XML syntax but want to combine both data and presentation, then you want WordML.
WordML uses Word's own tags to markup the word document. I was going to show you an example of WordML but i don't feel like escaping allt he greater-than/less-than signs. Anyhow, WordML contains all the formating and everything necessary to display a Word document as it is supposed to look.
I think this Open Office guy is looking for a devil in Office 11 that isn't there. That or he didn't read the friggin manual.
-Malakai
-Malakai
A Dragon Lives in my Garage
This may be bad (keeping in mind the jury is still out on exactly how Microsoft is making this work) because in the case of office documents, the style is actually *part* of the content, from the perspective of Joe Office User.
.xml file, then that .xml file is practically useless to anyone who wants to collaborate with the original author since all of the styling information is lost.
.sxw (the default writer format) you're actually taking the raw data of the document, the styling rules for the document and a few other important bits and pieces and zipping them up into a single file.
i metype
.doc, .xls etc formats have changed and will need to be reverse-engineered again.
If Microsoft just puts the raw text data into a
As an example of a good way to do this (IMHO), take a look at how OpenOffice.org builds their files. When you make a
After unzipping this file, the following directory structure was exposed:
content.xml
META-INF/manifest.xml
meta.xml
m
settings.xml
styles.xml
With this type of design, you can get the best of both worlds. Technically, there is a separation between your presentation and content which allows simple programatic access to the data when necessary. At the same time, this design allows for full collaboration between people who also consider the styling of the data to be part of the content because the style rules for the content are included with the document.
With xml-saved Office documents containing only data and no style, collaboration between non-office users (and apparently Win9x users as well) will be no better off than before. Perhaps worse, assuming the binary
If this article is true and Microsoft has decided to remove the styling of their xml-saved office documents, I see two possible reasons for this:
The first is obvious. You're not using Office? Ok, second class citizen, here's the data but in a format that is next to useless for you to use.
The second possibility involves Microsoft just not being where they want to be with the Office XML sharing. Keep in mind that it took OpenOffice.org something like a year and half or so to define their XML interchange format. Microsoft may be going there, but due to overwhelming inertia, it just might not be going there very quickly.
Personally, I think the first option is the most likely. However, with OpenOffice.org working with OASIS and others on a common XML interchange format, I'm hoping Microsoft will be forced by the marketplace into option 2.
Best regards,
David
Office 2003 preserves teh presentation markup in the XML file. You can prove this to yourself, if you want (and if you're willing to run the O11 beta): create a test document in Word 2003, save it as XML, and look at the result. The presentation data is in the file -- and it isn't even encryted.
Wow. Someone on /. suggesting we use an MS file format.
For those of you that aren't aware, RTF is an 'Open' format created by MS. All native word files I've looked at ('97 and earlier) used an RTF derived format. The RTF spec is availible from Microsoft, and is the most obfiscated document I have ever had the misfortune of having to read (in the end I gave up and derived the format from the output of wordpad, it was easier).
I am TheRaven on Soylent News
Read some other articles, or better yet get ahold of a beta and try it out. The authors of this articles will feel like schmucks when they realize what they missed.
First off, by default, if you save the word document as XML, it gets saved as WordML,which preserves Word's styles and formatting in an XML name-space that's separate from the one bound to the schema-controlled data.
If you check off the checkbox "Data Only" then you will lose all formating and your own XSD will be used to map this document into XML data.
WordML looks like a XML'ified RTF language. It would be trival to create an XSL stylesheet that transforms WordML into HTML/CSS with all formating (that HTML is capable of) which directly mimics MS Word. OpenOffice could also eat WordML quite easily and have all the formating/style of Word.
What the authors of this article are REALLY bithing at, is the fact that MS didn't buy into the OpenOffice Document Specification from OASIS. MS prolly sees OASIS as the US sees the UN. Defunct, not needed.
If you describe your data using XML semantics, and all it takes to convert from semantic style A to B is some XSL, then who cares about forcing everyone to use one specific format.
-malakai
-Malakai
A Dragon Lives in my Garage
If you "Save as XML" in Office 11, then by default the data is saved as WordML. WordML is an xml version of MS internal storage format (basically RTF). OpenOffice could quite easily write an interpreter for WordML. Hell, I could write an WYSIWYG editor for WordML in a day. If that. It's pretty simple if you understand the basics of RTF.
It's only when you Save as XML with the "Data Only" checkbox that you get into striping formating (and rightly so). Word WARNS you about this. In addition, you can specify your own XSD to save to. And word will VALIDATE this for. Not to mention, you can use a word tool to map elements of Word documents to elements of your schema. DAMN COOL.
In addition (As if that isn't enough) when you save, in either way, you have the option of specifiying a XSL style sheet. It'll go ahead and transform the output for you as part of the save.
Then only thing the OpenOffice people are upset about is that MS didn't buy into the OASIS/OpenOffice Document Specification. Tough shit. I'll write them an XSL that'll work again WordML to solve that for them. Lazy bastards.
-malakai
-Malakai
A Dragon Lives in my Garage
InternetNews is authored by morons.
-malakai
-Malakai
A Dragon Lives in my Garage
The point is, if you have to translate to another format, you hire a developer to do it once, and the XSLT stylesheet that he/she develops can be reused again and again to transform documents. Maybe make a drag & drop script to do the transformation, or possibly a web based back end solution. You don't have to write a separate XSLT stylesheet for every single document, just once to support a required combination of input and output formats.
std::disclaimer<std::legalese> sig=new std::disclaimer; sig->dump(); delete sig;
XML is no place for presentation markup. That should be done with XSL.
I think the root of the confusion goes back to Golfarb's original theory for SGML-- that the styles in a document are secondary to the structures, and should be kept separate.
This has been a religious conviction ever since, despite the fact that most authors are messy and intuitive, and SGML-etc are very, very rigid and unintuitive. The rationalisation is that messy authors can just represent their styles using 'fake' (ad hoc) XML, but if this turns out to be 90% of the real users of MS Office, then I think MS could indeed save valid XML, but it won't be portable in any useful sense.
Isn't office 2k3 suppose to support drm encryption? If so then this would make the file format useless since it will be encrypted.
From a pure bussiness standpoint (not technical)a close proprietary file format is essential if you want consumer lock-in to keep prices sky high. If a competitor can write software that can read your files and format them proprerly then you lose your file format monopoly and would have to compete with everyone else.
http://saveie6.com/
the article says it all.
.xml. He noted that he opened a heavily formatted .doc Microsoft Word file, saved the file as XML, and later opened the file in Word 2003.
.doc file," he said. "And if I open up the XML file in a text editor, I can see that all of the formatting is properly maintained in the XML file."
to quote:
However, Mark McWilliams, a software engineer and Office 2003 beta tester, said he has seen nothing to indicate that Office 2003 removes formatting information from files saved in
"The opened XML document looks exactly like the original
He also noted that when saving a file, a user has the option of saving in a "data only" XML format which does remove formatting.
Why don't people just read the article before posting about how terrible this is.
If you look past the first couple of paragraphs, you can see that formatting is basically kept, expect when graphics are used (precisley where would the graphics fit in a XML file?), and the feature that this guy (and most people) seem to be bitching about is the option which allows data-only. Surely this is a good feature, as we can have formatting stuff which is interoperable with other MS Words, and an option which is just data!
"I'll say it again for the logic-impaired." -- Larry Wall.
Dumbass. No wonder you posted anonymous. Judge jackson ruled Microsoft THE COMPANY was a monopoly AND had broken the law and done illegal things.
The laws apply to the companies, NOT their PRODUCTS.
Liberty.
I took the trouble to go to Internet World and read the ENTIRE article. The portion of the article quoted on /. clearly implies the informtion being received about MS is second hand. "REPORTS ARE [emphasis added] that when saving to XML, [Office 2003] strips out the presentation and formatting information...". The person quoted is a representative of the OASIS OpenOffice XML Format Technical Committee, so there is a definite risk of bias, particularly when coupled with secondhand information.
The article goes on to quote someone who is actually is an Office 2003 beta-tester. He claims that saving in an XML format does not, in fact, strip out the formatting, and states the tests he ran to confirm this.
The source of confusion may be in different XML formats supported by Office 2003. There are two, one of which strips out all of the formatting information, while the other does not. A lively debate then ensues between the pros and cons of both approaches.
Well, it does produce a burning sensation... but no, its a very, very strong drink. A byproduct of making wine, and a fun one at that.
This whole mess comes from how MS office products generate too much formatting (e.g., html in office)
Ever try to edit office document saved as html?
150k of crap html for a simple one line 'hello world' text document.