Can XML Replace Proprietary Document Formats?
Pauly asks: "My former profession of Technical Writer was made very painful by my customers' requirement to have their documents delivered in MS Office formats. PDF/FrameMaker was not acceptable, as they needed to be able to edit the documents as well. Let me tell you, it is painful watching a 3,000+ page Word97 manuscript, the fruit of weeks of hard labor, rendered into rubbish by my customer's Word95. I've missed deadlines, lost money, and will never forgive Microsoft for their abuse of me and my kind. My question: is it possible that XML-based standard file formats suitable for word processor, spreadsheets, etc. could be created that forever do away with proprietary binary formats and inadequate file conversion routines? This notion seems to be working for the graphics crowd in the form of SVG. The benefits are obvious, what are the drawbacks?"
This is exactly what SGML has been doing for documents for years. The government and military has been using SGML to ensure that document structure is maintained and that documents are always readable.
Of course SGML is pretty complex, so XML has been born to simplify SGML. XML is now being used to accomplish the same thing.
sigs are a waste of space
Man, all those people using SGML must be imagining these benefits then!
Seriously, having a DTD is VERY helpful, because it allows you to edit a document using ANY SGML (or nowdays XML) compliant editor and ensure that you will be producing something which can be loaded back in to the original editor 100% cleanly (and without blowing away half of the structure that the original editor had setup). This is the specific functionality that the question was referring to.
DTD's do indeed have their semantics documented, indeed most of the more common ones have their semantics documented MUCH more extensively than ANY proprietary format out there. Easy and obvious examples would be HTML and XHTML, indeed just about anything produced by the W3C. Better exampleswould include DocBook, TEI, MIL-STD-38784, and ISO 12083. I would argue that these are all documented much more extensively than most proprietary file formats. Certianly, being proprietary doesn't mean that the file format defines semantics any better than something with a DTD.
Sure the semantics aren't enforced by the DTD, but they can be enforced by the end user, something which is typically not true when you're editing a proprietary format using a foreign tool.
This kind of stuff is done by the US government on a daily basis.
sigs are a waste of space
In short, I expect to see this sort of tool become a reality in this season's software releases.
Other barriers to this also include decent formatting. We have reasonable XSLT styles for DocBook, but completely modifying these to make a custom look and feel is still pretty hard. Someone is going to release an XSLT WYSIWYG editor real soon now and make another killing in the market.
So in summary, I think yes, XML can and will replace proprietary formats. And ultimately be easier to work with.
Want to deliver XML with Apache to varying media devices in different styles? Get AxKit
Matt. Want XML + Apache + Stylesheets? Get AxKit.
There are other document formats which deliver the same power, have been around longer, have not *radically* changed, and are open to implementation by other vendors. HTML and XML-based grammars are only one example of this. PostScript would be an even better example.
Just one nit: PostScript is actually a pretty bad example of this, because while it's reasonably easy to generate, it's horrendously hard to extract any useful information from.
Tools that take PostScript as input tend to be fairly fragile if they're trying to do anything beyond just rendering the document. "2up" converters often fail on PostScript generated from certain sources. Many graphics packages that allow insertion of EPS simply can't render the EPS on-screen unless there's an embedded TIFF "preview". PostScript to text converters rarely, if ever, work.
PostScript is a nice language for talking to printers. It isn't a good language for talking to software though. That fact that it's Turing complete means a lot of the analyses that would be useful to do on documents simply can't be done with PostScript without actually executing it, and there's no way you can tell if it'll ever halt. PostScript documents also tend to just be filled with low-level rendering information, not high-level semantic information required for things like searching, translation, converson into other formats, etc.
XML is far superior in this respect. XML documents can encode semantic information, and they're easy to analyze. They're also a heck of a lot easier to parse. There are many XML parsers available. I can only think of one PostScript parser that isn't built into a printer (GhostScript). XML isn't a panacea though. Even if every application vendor switched to XML, they'd probably all use different DTD's. That's still better than unreadable binary formats though, because it's a lot easier to reverse engineer the file format, if it isn't published.
The beauty of XML is that it allows people to focus on the data and not the formatting. If the word processor "industry" were to get together to support a single DTD (Document Type Definition) so that everyone would know how to react to specifict tags then you could have a format that any WYSWIG editor would render correctly. And, it would also allow people to do tex style editing as well. Using their favorite text editor (xemacs of course!)
/ZL
Word is not the worst case here, Excel is even worse -- it has changed in almost every new release of MSOffice.
As for why this happens -- peer pressure, and that's exactly what Pauly talks about. If your client uses it, so will you (or at least you will have to convert to your customer's format before exchanging documents). In the recent past it was not even so much a question of tollerance, rather of no choice. Look at any of the Office Productivity Suites reviews at ZDNet or C|Net -- MS is almost always a clear-cut winner, even though most of the blows and whistles an average consumer will NEVER use (as a side note, wouldn't you think that most users could happily live with functionality of Word 2.0?).
As for what could be done to resolve it, I think that trying (whenever possible) to exchange HTML docs could be one solution, but you loose some control over the layout and won't be able to do any sort of document automation. And when it comes to a 3000+ page document -- you just gotta convince that customer not to use Word for this.
A few people had mentioned TeX and LaTeX, as well as SGML here, but I guess this is not the answer for Pauly, as his customers are not happy with it. OTOH, slowly educating them could help a lot. FrameMaker would be the best choice then: you don't need UNIX to run it (unless you'd want to try to convert your customer completely), get great documents, can convert them into SGML (with FrameMaker-SGML).
--AP
The XML approach is much better from a technical point of view. With XML you can specify the structure of documents in the DTD and you simply need one of the many XML libraries to actually parse the data (and even detect errors). If word processing creators would not agree on a single DTD but create their own (which is the most probable thing to happen), you can specify a conversion scheme using a query language and even convert XML word processor documents between the DTD's automatically -- if every element in the source DTD has an equivalent element in the destination DTD.
With products as XMill it is even possible to compress XML documents very welll so that the additional markup won't result in bloated files.
Binary proprietary formats are only good for keeping the structure secret and competitors out of the race. I wonder why Microsoft opened up theirs... Maybe it has become complicated enough so that nobody tries to create a filter! Or the descriptions do not contain 100 percent of the file format or wrong information... Yes, that's a bit paranoid, I know. Anyone from the KOffice team here to give us some insight?!
Guess what? Microsoft (I know - don't bug me) already does it! The New Word 2K format is XML based
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
I'm usually very sceptical of new buzzword technologies. When I first heard about XML and did a little reasearch I was floored by the elegant simplicity of the model. XML at its most basic is a set of parser(HTML or whatever) tags that allow the representation of structured data. Much like simple HTML tables are constructed of tags like "" tags, XML extend this to define more complex structures.
For instance, a simple dataset containg haircolor,eyecolor and name for a group of people could be represented with tags like .... This idea is not only a boon for people trying to translate complex information across the web but it also allows for greated complexity in documents viewed on the web.
Here's the part where things went Awry. The W3C (World Wide Web Consortium is the offical standards organization for the web. Their biggest problem is that as a standards body they are trying to maintain stabilty and conformity of standards. This makes them rather slow at approving and implementing new standards. In the past this resulted in companies like Netscape and Microsoft integrating new technologies into their browsers long before they become new standards. Javascript and ActiveX are just two examples. Can't really blam them, they have to compete in the marketplace and he who gives the consumer what they want soonest usually wins and gets to set the defacto standards. In a nutshell, the W3C has become little more than a R&D organization.
So, then we get to XML. Initially proposed over six years ago it was initially rejected by the W3C. Many outside the W3C like the proposal though so many groups started developing and testig different variations of XML. XML and similar technologies like XSL-Extensible Style Language, SVG Scalable Vector Graphics and and a plethora of others began to appear. See the Oasis to get an idea for how far this has gone.
Today there are so many standards for XML variants that there are actually groups with competing standards for XML formats as specific as data exchange between banks. Kind of like a modern day tower of Babel.
So, to answer your question, yes XML holds a lot of promise for document and data interchangability among different software products, but between here and that goal is one huge civil war among competing groups and technologies. Giants of the software industry like IMB and Microsoft have already staked their grounds. Recent Patent Rules changes and passage of UCITA in several states have complecated matters by allowing companies to patent abstract things like database structure and parsing rules. Hopefully like the war between VHS and Beta a clear winner emerges quickly more importantly the winner must be an open standard.
Well, yes. With 1996 standards even. It might of course be hard to find a popular browser that is even remotely up to date, but you can't blame HTML or stylesheets for that. And XML isn't a magic wand that suddenly makes browser authors do something "advanced" instead of going for the mass market appeal.
-- Abigail
I'm not going to get into the debate about "open" standards, XML vs proprietary format, or whether Microsoft is somehow evil.
I will say that if you prepared 3000 pages in a format that your client wasn't able to use, it's your fault. Stand up and do the legwork to understand your client's needs. If your client had the same version of Word, or you started with a copy of their version of Word, it wouldn't have mangled your "weeks of hard work." If you need critical compatibility, preview using exactly the same set of operating system, software, fonts, video drivers, printer drivers, paper and ink cartridges that they will use.
Applications extend their format all the time. I can't load a Photoshop 5.0 document into version 1.0 without problems. I can't load an HTML 2.0 compliant page into an HTML 1.0 compliant browser without problems.
The same thing would happen even if Microsoft was 100% XML 1.0 compliant, as soon as people made XML 2.0 documents.
It's your responsibility to provide the results for your client; stop blaming the tools. Get tools that will provide the results your clients want. "Gee, my hammer's left-handed, that's why I need to start your kitchen cabinets all over again."
(New file formats are not new. =anagram>
Lament, now refine software.)
[
XML offers quite a few benefits, not least of which being that it forces the author to think of a document in terms of a tree. It by no means will enable everyone to just start talking overnight, magically.
I'm trying to teach myself to set people on fire with my mind... Is it hot in here?
Have you ever tried saving a complex Word 97 document in Word 95 format? If it's just text with some bullets, italics and bolding it's no big deal. If you have graphics, Wordart(tm) or anything more complex than "Left Justify" you're screwed. If you like I can e-mail anyone who asks a sample of what I'm talking about. When I started here, some moron opened a WP file and saved it as a Word file. It took 10 hours to reformat the document because of the arcane features that had been used in the original.
Matthew Miller,
"Live Free or Die." Don't like it? Then keep out of the USA
Good idea Cliff unfortunately this would prevent M$ from breaking things whenever they needed to hence it is unlikely to occur at least as far as M$ Office is concerned. If the proposed remedies in the current M$ anti-trust case include ( as they should ) measures to force ( even temporarily ) M$ to open its file formats then the situation may change. The downside is that M$ will then attempt to coopt whatever the standard becomes and voila there we are back at square one.
my god, at least you realized it was a joke. what are the requirements to moderate? here they are:
1) Recent lobotomy (credit for ECT)
2) Totally humorless (credit for cluelessness)
3) Blind
4) Stupid
5) Poke at keyboard with cane.
it was soooo obviously a joke...oh well. knowing my luck i'll probably end up trying to teach one of these chumps to program one day...take a deep breath...start over at the beginning...keep trying to break through...arrrrgh.
Treatment, not tyranny. End the drug war and free our American POWs.
See my user info for links.
I've got friends claiming that XML is the panacea for computing... for everything from e-commerce to a replacement for SAP to a standard for documents.
The only problem is that the applications have to be created to support the XML standard. So, unless you have a word processor that supports XML, and the people you're sending your documents to can read XML with their software ('cause you know MS will make an MSXML bastardisation...) you might still be out of luck.
BlackNova Traders
The reason we do this is simple. The United States government. Nope, I'm not claiming conspiracy, but look at what one has to do to do business with the US government. You must submit your specs in Word.
Now all the businesses that want to do business with the government switch to word. So what happens next? The businesses that do businesses with those busiensses switch to word. It's recursive.
Personally, I think the government could do much to open up the playing field by making it so all documents sent to the government had to be in some openly documented file format (XML based if you like to pretend that XML solves all problems, or just some random binary format or what not.)
This simple move would smack Microsoft far harder, and more fairly than most any DoJ action.
----------------------------
I spend my working time (and then some) as a Web Designer and have recently been trying to read up on XML and XHTML. (is that slightly redundant?)
It is turning out to be quite a difficult task. While everything I read tells me that it will be replacing all those proprietary document formats, it doesn't tell me exactly how that is supposed to work in a real world scenario. I believe that it does have that potential am stuck in exactly the same place as the poster... not being able to find the answer to what seems to me is a rather basic and obvious question. Is it worth my time to learn XML for future use or is it just another wild dream of a select few people?
If Microsoft's Word 95 and Word 97 document formats were XML based, there is no guarantee that you could seamlessly down convert a Word 97 document to a Word 95 document. What if your Word 97 document uses a few features that are specific or changed in Word 97? The XML converter would have to approximate the Word 95 equivalent and would probably botch the job, the same way the existing 97->95 converter did. The bottom line is that the file format changed between Word 95 and Word 97, and it doesn't matter how the format is stored, things will go wrong when you attempt to down covert.
In addition, XML only effects how the file is stored on its disk. Internally, Microsoft Word will represent your document the same regardless of whether its stored as XML or in a binary format. If it wants to create a binary version of your document, Word will simply write your document's raw internal data structures to the disk; if it wants to create an XML version of the document, it will first convert its internal binary version to XML and write it to disk. The only case where an XML based file format is better is for third parties who don't know the internal structure of Word's file formats, but still want to read its files. For Microsoft, it has intimate knowledge of its file formats, so storing it as XML gives no advantage to Microsoft applications
Sig goes here
"Your example of Word formats changing is a perfect one. If Word95 used XML, Word97 could still be incompatible if it used different elements and attributes."
You're overlooking a fundamental feature of XML. If Word21 needs to add additional elements or attributes to support new features, they simply create new tags. If the document is loaded in Word20 (ignorant of those tags) it won't look quite right (whatever feature was implimented with those tags will be skipped) but it will still display. If M$ wanted to try and maintain it's current upgrade-4-compatability approach, they could change all the tags with every version, but such obvious and outlandish behavior would only serve to destroy whatever fragment of reputation they still have.
"XML can't replace proprietary document formats. That's like asking if ASCII could replace proprietary document formats."
I must not be understanding what you mean when you're refering to ASCII since simple texts replace proprietary document formats all the time. TeX, CSV, RTF, HTML, PS, all are human readable text files. Certainly XML is only part of the solution, it stores the content while the format is handled elsewhere. In that sense it differs from the traditional mixed approach.
The most important thing about the transition from mixed formatting/content to clearly delineated content vs. formatting is that the author isn't (ultimatly) going to have any control over formatting. Relax and give a little thought. The format of a document should be determined (or at least be determinable) by the person reading it. If I counted the times I've read the source of someone's HTML because their background is obnoxious I would have wasted much time.
No, it hasn't (already happened).
Microsoft want you to believe that they are buzword compliant, but in reality the output from Microsoft's "Save As HTML" looks like XML, smells like XML, but isn't. Try parsing it.
See the recent Byte article "The cup is half full" for more details. I'm surprised you haven't heard about this. MS is using it's proprietary XML Islands inside a HTML document. That means you have to get a HTML parser to be able to parse it. The content of the XML is just as proprietary. It's basically a conversion of their OLE Document objects into XML.
Matt. Want XML + Apache + Stylesheets? Get AxKit.
Why have you gotten so offended? If you don't like what I have to say then at least be polite, after all, it only reflects badly on you and hence Slashdot as a whole. I have commonly found an amazing resistance to different opinions amongst the "open" source community, which seems to me to be the antithesis of what you stand for.
Comments such as "XML should be in the kernel" betray a lack of understanding as to the proper function of the kernel. Worse yet, (unlike, say, khttpd), putting an XML parser in the kernel wouldn't provide any benefit. All you're doing is encouraging the kind of useless feature bloat that Microsoft is rightly loathed for. That's why people get upset about remarks like this; they don't want this attitude to spread further than it already has.
Anyway, what you are clearly unaware of is that the perception of performace and stability is far more important in the corporate domain than the actuality of the situation. By integrating XML into the kernel, you have provided Linux with a major marketing point for the people who are actually in charge of what their company uses.
You won't be able to maintain the perception of performance and stability if the actuality is the opposite. Even Microsoft, with its legendary marketing might, has begun to pick up on this fact a little. (Note how stability has become a marketing point for them; why would it need to be, but for the constant crashing of their existing products?)
The exact breakdown of an operating system varies from one OS to another. In general, the purpose of any "operating system" is to arbitrate and manage hardware resources. Anything else is basically fluff. XML parsing is an application support issue, and detracts from the core function of managing hardware resources. Occasionally, an application function may be put in the kernel for good reasons, usually related to huge performance advantages gained by an in-kernel implementation. (khttpd is an example of this.) Even this is resisted strongly, because it "pollutes" the most critical code in the entire system, and poses an inherent risk to the stability, integrity and maintainablility of the system as a whole.
Basically, to add an application-specific function to the kernel, you had better have a really good reason to be suggesting it, one that can be justified (and defended) on a technological basis. If Linus were to allow marketing considerations (such as this) to drive kernel development, not only would he lose the respect of most of his supporters, but the end result would be just as crappy as Windows, sooner or later.
Given that Linus himself has talked about "world domination", doesn't it seem short-sighted to ignore a major selling point in favour of your petty-minded arguments?
Keep in mind that "world domination" remarks are somewhat tongue-in-cheek. Yes, he's half-serious, but only half. He wants people to use Linux over Windows because it's a better system. It wouldn't remain better if this approach to kernel development were adopted. Keeping the kernel pure isn't a "petty-minded argument"; it's a critical element of good design.
All that said, you would have received a much different response had you suggested that Linux systems (as a whole) start integrating XML support , use XML for system configuration and provide XML services for applications. There's a good argument to be made for that, and the marketing value should be similar. There's also technological arguments to be made in favor of it. The distinction here is that this support would all be in "user space" rather than the kernel, even though it might be an integral part of the operation of the system as a whole. The kernel is the core of the system, and the idea of integrating XML into Linux does not imply that it belongs in the kernel.
Deven
"Simple things should be simple, and complex things should be possible." - Alan Kay
The question is: Why do software consumers tolerate this?
The compatibility breaking between different versions of Word is well-known and oft-maligned. I have a hard time seeing it as anything more than a forced upgrade cycle, where Word users MUST buy the latest version in order to exchange documents.
There are other document formats which deliver the same power, have been around longer, have not *radically* changed, and are open to implementation by other vendors. HTML and XML-based grammars are only one example of this. PostScript would be an even better example.
So why have business environments settled on a standard which seems clearly to not be in their best interests? Why do they blindly pay for new versions every few years when their current versions do everything they need and more?
I'm all for letting the free market determine the best product, but Word strikes me as a solid example of the free market failing in this regard. Perhaps poor consumer education is preventing software from being a truly free market. The feature set of Word is nice, but the upgrade-insuring file format should cause people to run away. I would be skeptical of a car that used non-standard gasoline and forced me to buy an engine upgrade each year to handle new gas.
How has this been allowed to happen?
Save the whales. Feed the hungry. Free the mallocs.
What's the downside? Simple. Lack of tool support. There are lots of portable document formats out there already. MIF is published, WordPerfect doc format is published, even RTF is supposedly for portability, etc. Why not send your customers docs in these formats? Because the word processor that has 94% of the market has no incentive to enable competitors by supporting them, and even has a great deal of incentive to minimize compatibility between its own generations (as you found out.)
Assuming that any open document standard emerges, you can pretty well bet that saving from the market leader to that format will be an ugly process (have you looked at the HTML that that turkey produces? Blech!) You can also bet that imports from it will be better but still a pain. For real fun, try repetitive translations between the native format and the portable one and compare the starting and end results.
The sad fact is that monopolists have a huge stake in incompatibility (read the Halloween Documents) and every reason to maintain it. The rest of us will just have to survive in that environment until it changes. Changing it is another topic entirely, but for once I'll say, Vive le France!
Lacking <sarcasm> tags,
XML can't replace proprietary document formats. That's like asking if ASCII could replace proprietary document formats. XML and ASCII are not really file formats. They simply don't do the same job as file formats.
If you have ever used lex or yacc, then you'll know what I mean when I say that XML parsers essentially do the job of lex, but not of yacc. An XML parser is little more than a scanner which breaks a file into chunks to simplify the next level of processing. The XML parser gives the illusion of hierarchical processing that lex can't do, but it's an illusion nonetheless.
Your example of Word formats changing is a perfect one. If Word95 used XML, Word97 could still be incompatible if it used different elements and attributes.
So no, XML will not replace proprietary file formats. XML + proprietary DTD specifications + proprietary semantics could replace proprietary file formats. Is this an improvement? Probably. Will it make backward (or forward, or sideways) compatibility problems go away? Nope.
--
Patrick Doyle
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
First off, while there's a place for MS Word, a 3000-page document ain't it. In my experience it tends to severe breakage in this situation.
Office2K will already save docs in a kind of bastardized HTML++ format which truly sucks because it is neither rules-following HTML nor well-formed XML, and it could have been without much trouble. A little bird has told me that a not-too-distant future release of Office will have a *real* XML save format, which would be cool. I mean, a lot of the tags will still be proprietary MS gibberish, but at least you can parse 'em, and it'll be way less susceptible to inter-version breakage.
A basic part of the XML dream was the notion that the idea that software packages have proprietary data formats is just as silly as the 80's notion that computer networks should have proprietary per-wire data formats (remember DECnet, Wangnet, SNA?). So what pauly wants is exactly what XML is trying to do.
Having said that, a lot of the infrastructure we need to make it easy to author and deliver XML isn't here yet.
What I'm doing these days for complex documents is writing them in HTML++, by which I mean mostly well-formed HTML to which I add my own tags (e.g. , ) whenever I need to; because you can display what you've written in old browsers, which helpfully ignore the non-HTML tags, and you can write perl scripts or use XSL to turn it into RTF if you want to publish paper, and with Mozilla you can write a CSS stylesheet and dress up your own tags the way you want.
Cheers, Tim Bray