Dark Corners of the OpenXML Standard
Standard Disclaimer writes "Most here on Slashdot know that Microsoft released its OpenXML specification to counter ODF and to help preserve its market position, but most people probably aren't aware of all the interesting legacy code the OpenXML specification has brought to light. This article by Rob Weir details many of the crazy legacy features in the dark corners of OpenXML. As it concludes after analyzing specification requirements like suppressTopSpacingWP, 'so not only must an interoperable OOXML implementation first acquire and reverse-engineer a 14-year old version of Microsoft Word, it must also do the same thing with a 16-year old version of WordPerfect.'"
So what can you do?
The solution is simple. Create a job description that is written specifically to your friend's background and skills. The more specific and longer you make the job description, the fewer candidates will be eligible. Ideally you would write a job description that no one else in the world except Guillaume could possibly match. Don't describe the job requirements. Describe the person you want. That's the trick.
So you end up with something like this:
* 5 years experience with Java, J2EE and web development, PHP, XSLT
* Fluency in French and Corsican
* Experience with the Llama farming industry
* Mole on left shoulder
* Sister named Bridgette
Although this technique may be familiar, in practice it is usually not taken this extreme. Corporate policies, employment law and common sense usually prevent one from making entirely irrational hiring decisions or discriminating against other applicants for things unrelated to the legitimate requirements of the job.
But evidently in the realm of standards there are no practical limits to the application of the above technique. It is quite possible to write a standard that allows only a single implementation. By focusing entirely on the capabilities of a single application and documenting it in infuriatingly useless detail, you can easily create a "Standard of One".
Of course, this begs the question of what is essential and what is not. This really needs to be determined by domain analysis, requirements gathering and consensus building. Let's just say that anyone who says that a single existing implementation is all one needs to look at is missing the point. The art of specification is to generalize and simplify. Generalizing allows you to do more with less, meeting more needs with few constraints.
Let's take a simplified example. You are writing a specification for a file format for a very simple drawing program, ShapeMaster 2007. It can draw circles and squares, and they can have solid or dashed lines. That's all it does. Let's consider two different ways of specifying a file format for ShapeMaster.
In the first case, we'll simply dump out what ShapeMaster does in the most literal way possible. Since it allows only two possible shapes and only two possible line styles, and we're not considering any other use, the file format will look like this:
Although this format is very specific and very accurate, it lacks generality, extensibility and flexibility. Although it may be useful for ShapeMaster 2007, it will hardly be useful for anyone else, unless they merely want to create data for ShapeMaster 2007. It is not a portable, cross-application, open format. It is a narrowly-defined, single application format. It may be in XML. It may be reviewed by a standards committee. But it is by its nature, closed and inflexible.
How could this have been done in a way which works for ShapeMaster 2007 but also is more flexible, extensible and considerate of the needs of different applications? One possibility is to generalize and simplify:
After having written some tools on OS X that do stuff with RTF:
RTF is well documented and you can make an RTF document on all manner of platforms (I've done it in Ruby and Cocoa), but many platforms have extended RTF in their own way in order to support special features. OS X has added a few special methods to RTF files to support Mac OS X typography, and I've noticed that different versions of Word handle document attributes (like headers and page numbers) in different ways.
RTF is great if you want to make up something quick that is ONLY formatted text, but readers have all manner of different ways of interpreting the exact appearance of tables, page layouts and margins, and there doesn't seem to be any manageable common mechanism for including images or other documents, something Word and OO.org excel(pun) at. Even HTML seems to be better at this.
I use RTF output in a few little in-house tools I have, so people can get the text+attributes they create and open them in a text editor of their choice for touching-up and delivery. When my tools have to create something that is supposed to be finished, they make PDFs.
RTF is great for interoperability, but I never expect an RTF file to contain a "finished product," unless the recipient expects quality on par with a Selectric. It is merely a relatively-open serialization format for strings with attributes.
Don't blame me, I voted for Baltar.
Once it is ratified as an ISO Standard, the standard is locked up and anyone that does want to a copy has to buy it from ISO. These are copyrighted. They're not cheap; thousands of dollars. Out of the reach of the average hobbyist, and not listed anywhere on the Internet. That 6,000 page draft will vanish into the mists of time.
You mean like the C++ standard (ISO:14882) which can be downloaded as a PDF for $32 or purchased hardcopy for something like $300, and for which there are multiple sources for drafts?
Call me a shill if you want, but I've seen NineNine post in a lot of different threads not related to MS. I think he's acting in good faith.
After all, I am strangely colored.
No. See OpenDocument and OpenDoc. Two different things. Sort of...
Have you ever considered piracy? You'd make a wonderful Dread Pirate Roberts.
Nobody takes ECMA seriously anyway.
You probably know that JavaScript has been standardized as "EcmaScript" by ECMA; everybody just ignores that standard.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
Why would anyone want to implement support for really old versions if Microsoft does not do it themselves?
Nobody would. That's the point of it.
KFG
They wouldn't get too far gauging you for a C++ manual. Here are some examples of what I am saying: ISO/IEC TR 9126 "Software engineering -- Product quality " US$153 each volume * 4 volumes = US$612 IEEE 1278 US$151 each volume * 6 volumes = US$906 Problem is when you are told your software has to comply with one of these, these are the only shops in town. They prohibit copying or sharing the information. Anyone who wants to meet the standard has to send I$O or I money, and there are many, many of these standards. A typical tender might list twenty.
No. Everyone shortens the expansion of ODF to "Open Doc Format". As you note, there was an "OpenDoc" long before OpenOffice.org and the OpenDocument Format, but they have nothing to do with one another. ODF is based on the StarOffice format that has been around a while, that much is true. Just a sad case of mixing up names...
Eh? Isn't that why M$ made this supposedly "open" format? Because governments were tired of paying through the nose for secret formats that broke between versions? The purpose of an archive is to read it later. Governments and companies have already moved to pdf for archives. They are going to move their working documents to reasonable formats next.
But MS opened their own format, thus leveling the playing field so that you must again compete on features ...
You must not have read the 6000 page spec, which includes lots of sections like this:
That's neither open, nor a standard.
Microsoft is hoping people believe what you say, but everyone knows better. Shit like OOXML this only proves that they have not changed. It's just another, more elaborate and more expensive lie. Even the name, by using "OO" is intentionally confusing. The New Office is everything the old Office was and always will be. Vista and Office 2007 are non starters.
DMCA, Hollings, Palladium. What might have sounded like paranoia is now common sense.
Documents are worth far more than software, and they outlive the applications used to create them. See the comment to the original article - reading documents after 5, 20, 30, 100 years or more is not optional.
Which is why medical, legal and military records are often not held in word processor formats. For instance, the military records I have dealt with (NATO mostly) are held in SGML, conforming to carefully designed MIL DTD's that preserve structure rather than presentation. These files can be translated faithfully into HTML, PDF and so on without losing there meaning.
Sadly, as MS Word has become all too pervasive, more and more documents are stuck in a format where you cannot reliably extract data or convert it. For instance, with the processing system for scientific journal articles I worked on a table of contents can't be generated from Word documents, as headings are usually just inline styling that isn't even consistent within a single document. Years ago, the journal publishers could insist on properly structured data (LaTeX for instance), but nowadays scientists are too lazy - all they care about is the cheap thrill of seeing their article in print, sod anyone who wants to index or cross reference the data for future use.
Actually, you can. RTF can express most (if not all) of what the Microsoft Word format can. Let me answer your objections using excerpts from the RTF 1.8 specification:
The \tc control word introduces a table of contents entry, which can be used to build the actual table of contents.
The \stylesheet control word introduces the style sheet group, which contains definitions and descriptions of the various styles used in the document.
An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations and begin with the \pict control word.
\dgmt creates diagrams. \pict of subtype \*metafile supports vector drawings, in case that's what you meant by "diagrams".
Check out http://www.microsoft.com/downloads/thankyou.aspx?
I had a fun problem with a version of Word (for Windows 2.0, I think) many years back. Some friends came by to print a paper for a CS class, and the files they brought were made with the Brazilian Portuguese version of Word.
I had the English version of Word. When I tried to print, I discovered (after a lot of pages, of course) that I had to fix the formatting because some of the formatting was translated... And not even logical stuff like accents - page breaks, footnotes, etc.