Tim Bray on Microsoft Office
jgeelan writes "The co-inventor of XML, Tim Bray, has been talking about the newly XML-enabled version of Microsoft Office, code-named 'Office 11' and tells XML-Journal that 'when the huge universe of MS Office documents becomes available for processing by any programmer with a Perl script and a bit of intelligence, all sorts of wonderful new things can be invented that you and I can't imagine.'"
Wow, I was way off when I predicted that Microsoft would further obfuscate their Word format. This seems to be in all respects a Good Thing.
StarOffice has used XML for their native file formats for some time now; I wonder if this means we'll see an even better-quality translator between the two formats?
The opinions stated herein do not necessarily represent those of anybody at all. Deal with it.
.... I guess it's just MSXML rather than THE standard XML. But we can figure it out with some "intelligent guesswork" now because the file would be human-readable.
--
Error 500: Internal sig error
The most important question, besides if the MS Word XML format will be well-documented enough, is if it will be the default saving format. Most MS Office users simply don't care enough to save MS Word documents in RTF, for example, even if it's more than good enough for the vast majority of the documents.
Not the main issue on the article, but it is unfair to single someone as the inventor of XML, which is just a streamlined version of SGML which is an evolution from IBM's GML.
Leandro Guimarães Faria Corcete DUTRA
DA, DBA, SysAdmin, Data Modeller
GNU Project, Debian GNU/Lin
I really have my doubts about wether Microsoft will allow "any programmer with a Perl script and a bit of intelligence" to muck around with Office documents.
I'm guessing their XML document format will be just as hard to decyper and the current office formats.
Life is too short to proofread.
One small such point is when IBM gave out the specs to their hardware for PC allowing everyone to clone it, while Apple did not.
This could be such a point. Maybe in 10 years we'll look back at this and ask ourselves "Why the heck did MS XML-enable their Office app, releasing the hold that they had"
Only time will tell I guess.
I Play Hattrick
You are not entitled to your opinion. You are entitled to your informed opinion. -- Harlan Ellison
I beg you pardon? Smelly programmers can keep their hands off my documents. If I wanted you to have them, I'd have emailed them to you as plaintext. I wasn't aware the the Office license meant my documents were common property....
... and today's pet project has
MS is trying to time this right.
Right now they are seeing diminishing sales, possible shrinking market share. Most of the danish public sector is looking to save money using OpenOffice/StarOffice.
MS needs to increase their compatibility with other options, as they would otherwise force customers to convert every single user away from MS at once, instead of OpenOffice coming in slowly.
They can also hope, that their format is setting the standard, and the other companies will have to play catch-up rather than the other way around.
...all sorts of wonderful new things can be invented that you and I can't imagine...
When will MS ever learn that we don't WANT to imagine how wonderfull the MS Office Universe is ?
When will I end this grieving ? When will my future begin ?
WTF!? XML shouldn't need to be documented. The whole point is to create a human readable file that is parseble by computer. If MS Word delivers an XML file that I can't figure out, it's not XML.
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
As far as I can tell, one of the major reasons many businesses refuse to change over from Microsoft Office to cheaper options is due to file compatability. As our company's IT admin put it recently on the suggestion of using OpenOffice, "I get sent hundreds of Microsoft Word, Excel and Access documents a week. I need to know that I can open and access every single one of those without problems". An example of proprietry file formats helping Microsoft keep the monopoly.
However, if Microsoft Office documents become "built around an open, internationalized standard", i.e. XML, would this not enable the people behind OpenOffice, StarOffice etc to acheive total 100% file compatability and thus negate Microsoft's largest advantage with Office?
Of course, this could be yet another Microsoft "embrace and extend" tactic, a la` kerberos. Incorporate the standard in a bastardised form, claim standards compatability, then pollute it so you must be using Microsoft technology to properly interact with it.
Janie took my gun...
Just look at an HTML file exported form Word2k. I would not call that compatible with any HTML I've ever learned. Most probably the XML file exported from Office 11 will be a Microsoft specific file, specifying lots of Office specific ActiveX (aka OLE) info that cannot be emulated. And, hey, they can probably store binary data in XML. The only change is that most competing products will emit files that Word can easily read, i.e. M$ will get the biggest benefits.
Just because the file format, instead of binary, is "human readable", does not make it more open.
For "any programmer with a Perl script and a bit of intelligence" it doesn't make a difference if you read bytes (binary) or XML structures.
As long as you don't get a DTD with extensive comments on how to interpret the elements, along with some promise/guarantee that the DTD won't change every minor release, there is no real improvement at all.
The fact that XML is human readable is irrelevant, since no human shall read the files, but programs such as perl scripts shall. For them it makes hardly any difference; it is only marginally easier since you can use an existent XML parser instead of rolling your own (which is no big deal using the right tools such as YACC).
This 'openness' comes at a good time for Microsoft. They suggest openness in a time that they are criticized and attacked because of file-format lock in. Many 'advisors' shall be mislead, blinded by buzzwords such as XML as they are, and actually believe that this solves the issue.
Because it doesn't matter if everyone is able to read, modify and generate Office-compatible files. People will us Office products in future. Opening the file formats doesn't change anything.
XML makes it easy to create programs that will depend on MS Office. So this only makes it easier to create programs which depend on Microsoft products.
Maybe they need a migration path away from the win32-based format they use now. .NET also seems to follow that path. Remember that MS needs access to other platforms than the i386/desktop in the future - mobile devices for instance. Keeping a format that is basically a binary image from a PC is good for locking out competition, but not when you have to start competing with yourself.
Perhaps these announcements of XML compatible office file formats are just stalling tactics? MS has done it before.
MS now has a serious competitor in StarOffice/OpenOffice.org. And that competitor has two compelling advantages - it's cheaper/free, and open XML file formats. So when clued-up IT people say to their Pointy-Haired Bosses that they should use StarOffice/OpenOffice.org, PHBs can respond "but MS is doing that next year. We can avoid all the disruption of changing office suites just by waiting a bit and upgrading to the next version of MS Office. Besides, we're already paying for it." Then when MS actually releases Office 11, they will have used all sorts of devious and subtle devices to keep their lock-in of the file format, and MS and PHBs will be happy.
- Spreadsheet::WriteExcel
- Spreadsheet::ParseExcel
(there are also simpler interfaces if you want them too.)Or you could go the whole hog and use a SAX writer like XML::SAXDriver::Excel to create the documents from XML yourself.
(This is not to say I don't think XML native formats arn't cool and will have many uses, I'm just pointing out what you can do now.)
-- Sorry, I can't think of anything funny to say here.
I think maybe it was the CEO of Microsoft Denmark. I'm NOT sure though
<uueWord2kDocument>
M"@D)("!'3E
M("`@(%9E7)I9VAT("A#*2`Q.3DQ
M($9R9
M92!V97)B87
</uueWord2kDocument>
Yes, the point of XML files is that their _syntax_ is simple and easily parseable by computers. But that doesn't tell you anything about the _semantics_ of a document. And as long as there is no proper documentation on what the mess of tags in your XML file means, there's hardly any way for you to hack together a Perl script to, say, extract plain text, or convert the Word XML file to an OpenOffice.org XML file, or whatever else comes to mind.
Office's MS-XML will be even less compatible with sthe spec than MS-Kerberos or MS-Java/J++. Office is their cash cow. It brings in 30-40% of their revenues all by itself.
If you think there is even a remote chance in he-double L that MS will loosen their grip on this revenue stream, I have a bridge to sell you.
You can call this flamebait if you want, but what in MS's history would lead me to believe they are suddenly going to change their historic behavior pattern AND risk a huge amount of revenue at the same time?
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
SQL Server has had an XML web gateway since version 2000. You can run any query and output it as xml or have an xml template pull the query and transform the results with XSL, all without one line of server side script.
ASP.net uses XML for all the human-readable files, and the IIS in windows.net server finally uses Apache-style configuration files which are also XML.
Pedro
----
The Insomniac Coder
I doubt it. XML is specifically designed around interoperability, and I don't think MS can charge for use of a standard they don't own. That's why I think that they will break standards compatibility somehow.
XML is a format with nearly infinite possibilities for obfuscation, convolutedness and poorly defined standards. The most we can expect is the possibility to validate a file to absolutely certainly determine if it is compliant with the new Word format or not.
Contrary to the popular belief, there indeed is no God.
I'm working with that weedy Word 2k at the office. And we use Outlook as a standard communication Platform. Believe me, that their Software often is such a pain isn't that much of a greater plan to rule the world, but more the flat-out ineptitude of delivering products with a conceptual consitency.
Looking at Frontpain and Word HTML and extrapolating XML from that, tells me they're gonna do just a crappy job as usual and really think they've done a great thing.
Just like the people sending me source code additions and DB content as Wordfiles. Nothing but simple inemptitude, I say.
Not that my System of choice, Linux, is that much more consistent. Mind you. With a bazillion Font methods, every single one of them looking crappier than the next and QT, GTK+, Motif, Lesstif, Inbetweentif, Swing, TK and whatnot and none of them following the same Clipboard behaviour it's just as weedy. Only it is under *my* control to change it.
That way, the bottom line is: With OSS if it doesn't work, there's another way. With M$ it's 'Game Over' with the first "Error in module [fill in random hexcode here]".
That's the simple difference.
We suffer more in our imagination than in reality. - Seneca
code-named 'Office 11'
awesome. Apparently the next version of the linux kernel is code named 2.6! Wow!
I've recently been reviewing a dozen of different software to convert from Word to XML.
.
So far the best tool I found is upCast (free for personal use) from http://www.infinity-loop.de/
To convert a Word file:
* Use Word's AutoFormat feature to convert visual formatting to Word styles
* Redefine all the text as Word styles
* Run upCast to convert to XML using the "XML (content, no DTD)" filter
* Run HTML Tidy from http://tidy.sourceforge.net/ with the parameters -xml -utf8 -clean -bare .
Other tools that might be worth a second look:
* Majix (Open Source) - http://www.tetrasix.com/
* WorX SE - http://www.xyvision.com/
* XML MarkupKit (in German) - http://www.eds.schema.de/download/MarkupKit/
* DocSoft LLC Word-to-XML - http://www.docsoft.com/w2xml.htm
The thread a couple of weeks ago about the death of META headers will apply 1000 times worse for semantic tags-- if the semantic web is going to work at all it needs to start from headers describing the webpage as a whole.
(Also, what's with XML-Journal's claim the article has three pages when it only has two?)
Look at the bigger picture of where Microsoft is heading. They're diversifying their line of business.
In the past, MS Office was the cash cow at Microsoft, but the market for office packages is rather
saturated... companies and governments are looking for cheaper alternatives etc. Not much room to
grow. Now they can afford playing the good guys by opening up their file formats, since they got
new markets to capture... mobile phones, handheld computers, home entertainment etc.
The open office group should get together with the rest of the guys (abyword, koffice and maybe wordperfect) and work out a format that can be submitted to the ISO. Possibly based on the open office format.
Then goverments and corporation will adopt it for official documents so they can read their own documents in ten years.
When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
any programmer with a Perl script and a bit of intelligence
and I thought intelligence was a prerequisite to be able to handle perl ? :)
Except I will look to xml.openoffice.org to write some xslt transformations to take Microsoft office documents and liberate them once and for all.
Once I can move my team of 20 people to open office with no real worries or complaints about 'interchanging' files with lusers still using Microsoft, I will.
BUT, have you ever looked at an HTML file generated by Microsoft word? It is a GREAT example of how they can pollute a standard into something unreadable.
I suspect that they will copyright or otherwise lock up their DTD/Schema, and try to lash out at anyone that uses them in other than 'approved' ways.
Whats wrong with HTML and CSS2 for all your word processing?
I don't think the new XML format is meant for documents you wish to publish on the web. Office already support the HTML format pretty well (with some extensions.. ahem) since Office 2000. HTML support works even better in Office XP since it allow you to save the document as "filtered HTML", where Office filters most of the Office-specific tags and attributes at the cost of loosing some information in the document.
I think the XML format is being added since XML represent the document with a much more meaningful structure that's easier to parse by third party software for use in electronic commerce and other automated systems, something that's inappropriate to use HTML code for, as it was designed to make pretty layouts, not to describe content for easy parsing.
I think it's pretty obvious why MS would want to add XML support - to spread their Office document format and make Office useful in places such as web services where it wouldn't be as useful before.
Beware: In C++, your friends can see your privates!
It is simply not what others is claiming: <?xml version="1.0"><data>blahblah</data>
¦ ©® ±
Microsoft is switching from a proprietary file format, to XML, and the first 100 comments are all flaming MS. WTF does it take to make you people happy?
.NET that they can make an entire programming framework (and at least 3 assocated languages) into an open standard and even have them ratified by the ECMA and maybe even ISO. Because of this people have already managed to port Perl, Python and many other languages to this framework before it even came out of beta! The guys at Ximian have even managed to port quite a bit of the framework itself as part of the Mono Project.
They've already shown with
So perhaps instead of perpetually slating Microsoft, you could get off your arse and do something useful instead.
Nick...
Unfortunately, Microsoft won't let it happen. The data may be "in XML", but that doesn't mean you can read it or generate it well. Instead, Microsoft will give you just enough to serve their business interests and nobody else's.
How? Office will probably stick undocumented base64 encoded binary stuff into the output, containing formatting information. You can use the document content, for example, with a database, but you can't load it into another word processor and preserve all the formatting. And in the other direction, sure, you can generate simple documents that Office will import, but you can't generate arbitrary Word documents--they will, again, have weird, undocumented tags and binary stuff.
In short: don't hold your breath. Microsoft isn't stupid.
Doing XML stuff with OpenOffice is supergreat. It took me half-an-hour to study the format enough to write a XSLT parser that extracts all strings from an OO document.
: //www.w3.org/1999/XSL/Transform"t tp://openoffice.org/2000/office"t p://openoffice.org/2000/style"/ /openoffice.org/2000/text"p enoffice.org/2000/table"o ffice.org/2000/drawing"o rg/1999/XSL/Format"r g/1999/xlink"g /2000/datastyle "c hart="http://openoffice.org/2000/chart"3 d="http://openoffice.org/2000/dr3d"h ttp://www.w3.org/1998/Math/MathML"t tp://openoffice.org/2000/form"p ://openoffice.org/2000/script"
Now I wrote, just for demonstration, the following XSLT example in just a few minutes, useable directly with xsltproc in Linux.
The example prints all the Heading paragraphs in a OO Writer document, indented according to the header level.
<?xml version='1.0'?>
<xsl:stylesheet
xmlns:xsl="http
xmlns:office="h
xmlns:style="ht
xmlns:text="http:
xmlns:table="http://o
xmlns:draw="http://open
xmlns:fo="http://www.w3.
xmlns:xlink="http://www.w3.o
xmlns:number="http://openoffice.or
xmlns:svg="http://www.w3.org/2000/svg"
xmlns:
xmlns:dr
xmlns:math="
xmlns:form="h
xmlns:script="htt
version='1.0'>
<xsl:output method="text" encoding="ISO-8859-1"/>
<!-- Print all headings, indented. -->
<xsl:template match="text:h">
<xsl:value-of select="substring(' ', 1, (@text:level - 1) * 2)"/>
<xsl:text>* </xsl:text>
<xsl:value-of select="text()"/>
<xsl:text>
</xsl:text>
</xsl:template>
<!-- Don't output any other text. -->
<xsl:template match="text()">
</xsl:template>
</xsl:stylesheet>
The result would be something like:
* Top-level heading such as a chapter
* Second-level heading (section)
* Another section
* Subsection
* Subsubsection
* Yet another section
Look up at this. Putting information in XML makes the first baby step of reverse engineering easier, nothing else.
XML helps only if the creator of the document wants the information to be easily accessible by programs other than their own.
To a Lisp hacker, XML is S-expressions in drag.
I've seen the native Word XML format (alpha mind you, so it might get changed). It isn't exactly pretty, and if I had to write code to extract all the paragraphs that contained the word "foo" in bold it would give me a bit of a headache, but I could do it.
/> />
The word "foo" in bold single-underline looks something like
<r>
<rf>
<rp class="bold"
<rp class="underline" lines="1"
</rf>
foo</r>
Yeah, it's pretty verbose.
Near as I can tell, it is 100% round-trip-able, i.e. you save as that file format, you read it in again, you hit ctl-S and it saves again; about as good as a native format. Now someone needs to write some script-ware to run Word in batch mode to xml-ify server directories with zillions of office docsl
I think the reason MS is doing this is obvious. Look at their financials - they *really* need people to upgrade to the new version of Office. End-users don't buy Office any more, CIOs and the like do. These people are just not gonna be impressed by another new word-processing feature, but they might be motivated to upgrade if they thought that they were opening up all their data to re-use by other programs.
I expect that with any luck we'll get a secondary industry built around doing cool unexpected stuff to Office docs. Don't want to sound over-excited here, but a huge amount of all the intellectual capital in the world is sitting around in Office docs, and this makes it noticeably more re-usable. Has to be a good thing.
Cheers, Tim
MS Office saving its data in XML format is a great start.
But will this really be enough?
Previous complaints about how versions of Office didn't disclose the format were often referred to a specification that Microsoft made available to describe what was in a Word document.
The key problem, IIRC, was the the description was not sufficient for one to predict how the Word document was actually formatted and rendered on the page.
Because XML is very much like SGML or TeX, it has the potential for much more exhaustively describing document structure. But whether the new Word XML format (or OpenOffice format, for that matter) contains sufficient information for developers to reproduce the "right" format is a different issue.
I hope I'm wrong and that the format is specified comparably to the level you'd find in say PostScript or PDF.
Maybe MS is willing to let rendered Office douments change, just as HTML rendered documents change whenever one resizes the browser window.
But I doubt it.
"Provided by the management for your protection."
I think the reason that they are switching over is probably due to the trend in emerging foriegn markets. Peru being a prime example. Countries are starting to enact legislation that requires any government procurments of software to only be for software that uses an open file format. Due to the long term storage problems.
This tied to the fact that US sales are going to slow down or are already, due to the complete inundation of PC, they need new markets, and unless they use an open format they won't be able to get them. I'd be panicked Linux and Java eroding their server market. Governments are eroding their Office market. They only way they can grow is add value.
Which then by virtue of market share becomes standard. It is actually in their best interest to publish it clearly. Then the other potential competitors will feel strong pressure to fit their software to match MS and have no real excuse why they can't. If MS waited there would be some other standard emerging and MS would be pressured by customers to adopt it. Then it would be MS having to shoehorn its document logic into some other form and not the other way around.
While other potential competitors are playing catch-up with making their documents fit into the MS schema MS can be busy thinking about the next thing to do.
So frankly I expect the word document xml (and excel and the rest) to actually be quite clear and documented but very aligned to how MS Word sees a document, which will likely impress others as obtuse.