Tim Bray on Microsoft Office
jgeelan writes "The co-inventor of XML, Tim Bray, has been talking about the newly XML-enabled version of Microsoft Office, code-named 'Office 11' and tells XML-Journal that 'when the huge universe of MS Office documents becomes available for processing by any programmer with a Perl script and a bit of intelligence, all sorts of wonderful new things can be invented that you and I can't imagine.'"
The most important question, besides if the MS Word XML format will be well-documented enough, is if it will be the default saving format. Most MS Office users simply don't care enough to save MS Word documents in RTF, for example, even if it's more than good enough for the vast majority of the documents.
Not the main issue on the article, but it is unfair to single someone as the inventor of XML, which is just a streamlined version of SGML which is an evolution from IBM's GML.
Leandro Guimarães Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
One small such point is when IBM gave out the specs to their hardware for PC allowing everyone to clone it, while Apple did not.
This could be such a point. Maybe in 10 years we'll look back at this and ask ourselves "Why the heck did MS XML-enable their Office app, releasing the hold that they had"
Only time will tell I guess.
I Play Hattrick
You are not entitled to your opinion. You are entitled to your informed opinion. -- Harlan Ellison
MS is trying to time this right.
Right now they are seeing diminishing sales, possible shrinking market share. Most of the danish public sector is looking to save money using OpenOffice/StarOffice.
MS needs to increase their compatibility with other options, as they would otherwise force customers to convert every single user away from MS at once, instead of OpenOffice coming in slowly.
They can also hope, that their format is setting the standard, and the other companies will have to play catch-up rather than the other way around.
...all sorts of wonderful new things can be invented that you and I can't imagine...
When will MS ever learn that we don't WANT to imagine how wonderfull the MS Office Universe is ?
When will I end this grieving ? When will my future begin ?
As far as I can tell, one of the major reasons many businesses refuse to change over from Microsoft Office to cheaper options is due to file compatability. As our company's IT admin put it recently on the suggestion of using OpenOffice, "I get sent hundreds of Microsoft Word, Excel and Access documents a week. I need to know that I can open and access every single one of those without problems". An example of proprietry file formats helping Microsoft keep the monopoly.
However, if Microsoft Office documents become "built around an open, internationalized standard", i.e. XML, would this not enable the people behind OpenOffice, StarOffice etc to acheive total 100% file compatability and thus negate Microsoft's largest advantage with Office?
Of course, this could be yet another Microsoft "embrace and extend" tactic, a la` kerberos. Incorporate the standard in a bastardised form, claim standards compatability, then pollute it so you must be using Microsoft technology to properly interact with it.
Janie took my gun...
Just look at an HTML file exported form Word2k. I would not call that compatible with any HTML I've ever learned. Most probably the XML file exported from Office 11 will be a Microsoft specific file, specifying lots of Office specific ActiveX (aka OLE) info that cannot be emulated. And, hey, they can probably store binary data in XML. The only change is that most competing products will emit files that Word can easily read, i.e. M$ will get the biggest benefits.
Just because the file format, instead of binary, is "human readable", does not make it more open.
For "any programmer with a Perl script and a bit of intelligence" it doesn't make a difference if you read bytes (binary) or XML structures.
As long as you don't get a DTD with extensive comments on how to interpret the elements, along with some promise/guarantee that the DTD won't change every minor release, there is no real improvement at all.
The fact that XML is human readable is irrelevant, since no human shall read the files, but programs such as perl scripts shall. For them it makes hardly any difference; it is only marginally easier since you can use an existent XML parser instead of rolling your own (which is no big deal using the right tools such as YACC).
This 'openness' comes at a good time for Microsoft. They suggest openness in a time that they are criticized and attacked because of file-format lock in. Many 'advisors' shall be mislead, blinded by buzzwords such as XML as they are, and actually believe that this solves the issue.
Because it doesn't matter if everyone is able to read, modify and generate Office-compatible files. People will us Office products in future. Opening the file formats doesn't change anything.
XML makes it easy to create programs that will depend on MS Office. So this only makes it easier to create programs which depend on Microsoft products.
It's just like the old SGML module for Word they used to have about 6 years ago. My guess is that there will be some significant drawback to saving documents in XML, such as loss of some formatting information. That would convince users not to save in the XML format... but that isn't the important thing to Microsoft.
More significantly, there might be small incompatibilities, or ways that Word-created XML documents divert slightly from what is normal and proper in XML. Perhaps Word will make some (intentional) mistakes when reading back XML files generated in other applications, just like Word's old SGML module would choke on many proper SGML documents.
Make no mistake: the fact that almost everybody is using Office and the associated file formats makes it very hard for a new contender to enter the office suite market. Microsoft must be aware of the power they have over the market with their Office file formats. Think of it: when you exchange files with other businesses, you have two realistic choices of file formats: Office or plaintext. And now Microsoft is introducing compatibility with an open and well-defined markup langauge, in favour of their proprietary language? I'll believe it when I see it.
If construction was anything like programming, an incorrectly fitted lock would bring down the entire building...
Maybe they need a migration path away from the win32-based format they use now. .NET also seems to follow that path. Remember that MS needs access to other platforms than the i386/desktop in the future - mobile devices for instance. Keeping a format that is basically a binary image from a PC is good for locking out competition, but not when you have to start competing with yourself.
Perhaps these announcements of XML compatible office file formats are just stalling tactics? MS has done it before.
MS now has a serious competitor in StarOffice/OpenOffice.org. And that competitor has two compelling advantages - it's cheaper/free, and open XML file formats. So when clued-up IT people say to their Pointy-Haired Bosses that they should use StarOffice/OpenOffice.org, PHBs can respond "but MS is doing that next year. We can avoid all the disruption of changing office suites just by waiting a bit and upgrading to the next version of MS Office. Besides, we're already paying for it." Then when MS actually releases Office 11, they will have used all sorts of devious and subtle devices to keep their lock-in of the file format, and MS and PHBs will be happy.
- Spreadsheet::WriteExcel
- Spreadsheet::ParseExcel
(there are also simpler interfaces if you want them too.)Or you could go the whole hog and use a SAX writer like XML::SAXDriver::Excel to create the documents from XML yourself.
(This is not to say I don't think XML native formats arn't cool and will have many uses, I'm just pointing out what you can do now.)
-- Sorry, I can't think of anything funny to say here.
<uueWord2kDocument>
M"@D)("!'3E
M("`@(%9E7)I9VAT("A#*2`Q.3DQ
M($9R9
M92!V97)B87
</uueWord2kDocument>
Yes, the point of XML files is that their _syntax_ is simple and easily parseable by computers. But that doesn't tell you anything about the _semantics_ of a document. And as long as there is no proper documentation on what the mess of tags in your XML file means, there's hardly any way for you to hack together a Perl script to, say, extract plain text, or convert the Word XML file to an OpenOffice.org XML file, or whatever else comes to mind.
SQL Server has had an XML web gateway since version 2000. You can run any query and output it as xml or have an xml template pull the query and transform the results with XSL, all without one line of server side script.
ASP.net uses XML for all the human-readable files, and the IIS in windows.net server finally uses Apache-style configuration files which are also XML.
Pedro
----
The Insomniac Coder
So, what happens when somone want's to email an XML enabled Word document...... Does it somhow become encrypted on its way out of the database, remains scrambled on it's way over the internet, and reassembles itself into nice XML once it arrives on the recepients computer?.... Doesn't sound like XML to me?!
I don't beleive any of this crap is goingto happen from MS. Not for a New York second.
Dark-masked B.Gates approaching you:
"I find your lack of faith....disturbing."
I really have my doubts about wether Microsoft will allow "any programmer with a Perl script and a bit of intelligence" to muck around with Office documents.
Why not? After all, the high-quality ActiveState port of Perl to Win32 exists because Microsoft paid for it, and you can download it for free. Not only that, but if you want to write your own code to manipulate Office documents, you have been able to do that for years in VBA - all the Office programs expose rich APIs. In fact, they are composed of Objects that you can instantiate and use in your own programs if you want - all MS care about is that there is a licensed copy of Office on the user's machine. One of the easiest ways to do charting is to simply reuse a bit of Excel, for example. From there it's a short hop via COM to any program you want.
I'm guessing their XML document format will be just as hard to decyper and the current office formats.
The fact that Office documents have been in a proprietary format in the past is actually unimportant, since the interfaces to the applications (and hence their documents) are well documented (check MSDN or Barnes & Noble if you don't believe me). So the reason that Microsoft are doing this is that they lose nothing and gain from making the platform even more attractive to developers.
I've recently been reviewing a dozen of different software to convert from Word to XML.
.
So far the best tool I found is upCast (free for personal use) from http://www.infinity-loop.de/
To convert a Word file:
* Use Word's AutoFormat feature to convert visual formatting to Word styles
* Redefine all the text as Word styles
* Run upCast to convert to XML using the "XML (content, no DTD)" filter
* Run HTML Tidy from http://tidy.sourceforge.net/ with the parameters -xml -utf8 -clean -bare .
Other tools that might be worth a second look:
* Majix (Open Source) - http://www.tetrasix.com/
* WorX SE - http://www.xyvision.com/
* XML MarkupKit (in German) - http://www.eds.schema.de/download/MarkupKit/
* DocSoft LLC Word-to-XML - http://www.docsoft.com/w2xml.htm
The thread a couple of weeks ago about the death of META headers will apply 1000 times worse for semantic tags-- if the semantic web is going to work at all it needs to start from headers describing the webpage as a whole.
(Also, what's with XML-Journal's claim the article has three pages when it only has two?)
The open office group should get together with the rest of the guys (abyword, koffice and maybe wordperfect) and work out a format that can be submitted to the ISO. Possibly based on the open office format.
Then goverments and corporation will adopt it for official documents so they can read their own documents in ten years.
When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
Okay, so it'll be harder to mount a windows partition effectively, but this doesn't affect transmission of documents, especially if they're stored in an XML format. As for me, I think it's more valuable to have files that I can read outside of their native filesystem rather than have a readable filesystem filled with unreadable files.
There are 2 problems with the current format of Microsoft Office file:
This is mostly solved (thanks to years of trials and errors).
This is definitively more difficult, as nobody knows Office internals and how they expect such additional data to be. StarOffice guys managed to make an acceptable job, at the price of years of trials and errors. It's like watching at a dump of your computer's memory, guesssing what's code, what's data, what's padding and the meaning of every byte...
Now, do an XML format simplifies things? Well, yes, just as an RTF text is easier to manage than a pure binary format, but nothing prevents putting extra cruft in an XML document, so it's just that instead of having to use a hex editor, you now may use a text editor, but giving a correct interpretation of tags and attributes is something that only Microsoft can do, unless it publishes the full specifications (present and future: after all, XML is eXtendible, right?)
Personally, I think that:
Of course, this will never happen. Instead, MS will continue to push their own "open" XML based file formats. Microsoft Kerberos, anyone?
People who think they know everything are a great annoyance to those of us who do.
With XML Schema and DTD's, you can validate various aspects of the data without writing a custome validator.
With XPath and XPointer you can refer to parts of an XML document without needing to understand what the document contains.
With XSL you can translate all or parts of the document from one format to the other without your application needing to know the structure, and without needing to understand more of the format than the parts you are extracting.
With SAX and the DOM you can programmatically traverse and extract information from an XML file without having to write a custom parser.
With CSS an editor or viewer for instance can use a standard mechanism of applying styles to elements without hardcoding the style attributes for elements anywhere.
With XML namespaces, you can intersperse data in various formats in the same file, and the components handling each of the vocabularies need not know anything about the other components - an example would be embedding SVG in HTML: The HTML renderer doesn't need to understand any of the SVG tags, only that it should delegate contents with other namespaces to another component. And the SVG renderer couldn't care less about the HTML.
And this doesn't even touch on the benefits of all the various interchange formats that have been specified on top of these base technologies.
The importance of XML is that it opens up the doors for building interchangable components that operate on data without needing any hardcoded application specific knowledge of the data.
Most of the time, you still have to write some code to tie it all together, but you don't have to build your own parsers, your own document object model, your own styling system, your own way of handling contained data of other types, your own way of transforming data between formats, etc.
For me as a software developer XML delivered years ago. I use XML technologies daily, and it saves me work.
Except I will look to xml.openoffice.org to write some xslt transformations to take Microsoft office documents and liberate them once and for all.
Once I can move my team of 20 people to open office with no real worries or complaints about 'interchanging' files with lusers still using Microsoft, I will.
BUT, have you ever looked at an HTML file generated by Microsoft word? It is a GREAT example of how they can pollute a standard into something unreadable.
I suspect that they will copyright or otherwise lock up their DTD/Schema, and try to lash out at anyone that uses them in other than 'approved' ways.
It is simply not what others is claiming: <?xml version="1.0"><data>blahblah</data>
¦ ©® ±
Think of it: when you exchange files with other businesses, you have two realistic choices of file formats: Office or plaintext
I think PDF is a viable (growing even) third option. Adobe is "evil" just like MS (remeber Sklyarov)... regardless, PDF is nice and it works well, and the files are way smaller than word docs.
Microsoft is switching from a proprietary file format, to XML, and the first 100 comments are all flaming MS. WTF does it take to make you people happy?
.NET that they can make an entire programming framework (and at least 3 assocated languages) into an open standard and even have them ratified by the ECMA and maybe even ISO. Because of this people have already managed to port Perl, Python and many other languages to this framework before it even came out of beta! The guys at Ximian have even managed to port quite a bit of the framework itself as part of the Mono Project.
They've already shown with
So perhaps instead of perpetually slating Microsoft, you could get off your arse and do something useful instead.
Nick...
So you can read Office documents with other programs as long as you have Office and MS dev tools?
You do see the folly in that, right?
-Kevin
Doing XML stuff with OpenOffice is supergreat. It took me half-an-hour to study the format enough to write a XSLT parser that extracts all strings from an OO document.
: //www.w3.org/1999/XSL/Transform"t tp://openoffice.org/2000/office"t p://openoffice.org/2000/style"/ /openoffice.org/2000/text"p enoffice.org/2000/table"o ffice.org/2000/drawing"o rg/1999/XSL/Format"r g/1999/xlink"g /2000/datastyle "c hart="http://openoffice.org/2000/chart"3 d="http://openoffice.org/2000/dr3d"h ttp://www.w3.org/1998/Math/MathML"t tp://openoffice.org/2000/form"p ://openoffice.org/2000/script"
Now I wrote, just for demonstration, the following XSLT example in just a few minutes, useable directly with xsltproc in Linux.
The example prints all the Heading paragraphs in a OO Writer document, indented according to the header level.
<?xml version='1.0'?>
<xsl:stylesheet
xmlns:xsl="http
xmlns:office="h
xmlns:style="ht
xmlns:text="http:
xmlns:table="http://o
xmlns:draw="http://open
xmlns:fo="http://www.w3.
xmlns:xlink="http://www.w3.o
xmlns:number="http://openoffice.or
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:
xmlns:dr
xmlns:math="
xmlns:form="h
xmlns:script="htt
version='1.0'>
<xsl:output method="text" encoding="ISO-8859-1"/>
<!-- Print all headings, indented. -->
<xsl:template match="text:h">
<xsl:value-of select="substring(' ', 1, (@text:level - 1) * 2)"/>
<xsl:text>* </xsl:text>
<xsl:value-of select="text()"/>
<xsl:text>
</xsl:text>
</xsl:template>
<!-- Don't output any other text. -->
<xsl:template match="text()">
</xsl:template>
</xsl:stylesheet>
The result would be something like:
* Top-level heading such as a chapter
* Second-level heading (section)
* Another section
* Subsection
* Subsubsection
* Yet another section
I've seen the native Word XML format (alpha mind you, so it might get changed). It isn't exactly pretty, and if I had to write code to extract all the paragraphs that contained the word "foo" in bold it would give me a bit of a headache, but I could do it.
/> />
The word "foo" in bold single-underline looks something like
<r>
<rf>
<rp class="bold"
<rp class="underline" lines="1"
</rf>
foo</r>
Yeah, it's pretty verbose.
Near as I can tell, it is 100% round-trip-able, i.e. you save as that file format, you read it in again, you hit ctl-S and it saves again; about as good as a native format. Now someone needs to write some script-ware to run Word in batch mode to xml-ify server directories with zillions of office docsl
I think the reason MS is doing this is obvious. Look at their financials - they *really* need people to upgrade to the new version of Office. End-users don't buy Office any more, CIOs and the like do. These people are just not gonna be impressed by another new word-processing feature, but they might be motivated to upgrade if they thought that they were opening up all their data to re-use by other programs.
I expect that with any luck we'll get a secondary industry built around doing cool unexpected stuff to Office docs. Don't want to sound over-excited here, but a huge amount of all the intellectual capital in the world is sitting around in Office docs, and this makes it noticeably more re-usable. Has to be a good thing.
Cheers, Tim
What a bunch of pseudo-technical garbage!
I have a Masters in Computer Science with a focus on databases and storage technology and very little of what you said makes any sense to me. There's nothing easier than getting at data stored in SQL. Where I work, we've shipped a few products where we didn't document the schema because it was too complex and we didn't feel we could support it. Within weeks, almost all of our major customrs had it reverse-engineered anyway. SQL is very easy to get at!
kernel level SQL data
There's no such thing. SQL data is stored in tables. You use queries to get at it. Period.
Also, your story doesn't make any sense. The article says Office 11 is in Beta already. IIRC, the SQL Server and Palladium stuff in the OS doesn't come until Longhorn. Do you think they will actually release a version of Office which won't work until their next OS (who knows when that will be) is released and adopted? How will they make money off all the people who recently upgraded to Windows XP then?
Mmmm.. Donuts