Microsoft Releases Office Binary Formats

← Back to Stories (view on slashdot.org)

Microsoft Releases Office Binary Formats

Posted by kdawson on Wednesday February 20, 2008 @01:04AM from the this-way-lies-madness dept.

Microsoft has released documentation on their Office binary formats. Before jumping up and down gleefully, those working on related open source efforts, such as OpenOffice, might want to take a very close look at Microsoft's Open Specification Promise to see if it seems to cover those working on GPL software; some believe it doesn't. stm2 points us to some good advice from Joel Spolsky to programmers tempted to dig into the spec and create an Excel competitor over a weekend that reads and writes these formats: find an easier way. Joel provides some workarounds that render it possible to make use of these binary files. "[A] normal programmer would conclude that Office's binary file formats: are deliberately obfuscated; are the product of a demented Borg mind; were created by insanely bad programmers; and are impossible to read or create correctly. You'd be wrong on all four counts."

16 of 259 comments (clear)

One possible reason for releasing the specs now by Stan+Vassilev · 2008-02-20 01:20 · Score: 5, Insightful

One may wonder, why release the documentation now?

If you read Joel's blog you'll see the formats are very old, and consist primarily of C-structs dumped to OLE objects, dumped directly to what we see as an XLS, DOC and so on files.

There's almost no parsing/validation at load time.

Having this in a well laid documentation may reveal quite a lot of security issues with the old binary formats, which could lead to a wave of exploits. Exploits that won't work on Microsoft's new XML Office formats.

So while I'm not a conspiracy nut, I do believe one of Microsoft's goals here are to assist the process of those binary formats becoming obsolete, to drive Office 2007/2008 adoption.
1. Re:One possible reason for releasing the specs now by Chief+Camel+Breeder · 2008-02-20 01:55 · Score: 5, Informative
  
  Actually, I think they're releasing it now because they were ordered to in a (European?) court settlement, not because they want to.
2. Re:One possible reason for releasing the specs now by Stan+Vassilev · 2008-02-20 02:32 · Score: 5, Insightful
  
  Come on. You really think Microsoft wants to increase the vulnerability of old versions of Office (which are still the vast majority in corporate America). This not only makes their software looks bad, it increases the amount of work they have to do to support the older versions (yes, they still support Office 2003). You don't sell new cars by convincing people the last model was rubbish. I think your tin-foil hat fits a little to tight.
  
  Let me break your statement in pieces:
  
  - that would increase the vulnerability of old Office
  - the majority of corporate America is stuck on old Office
  - you don't sell old cars by convincing old ones are rubbish
  
  You know, have you seen those white-papers by Microsoft comparing XP and Vista and trying to put XP-s reliability and security in bad light?
  
  Or have you seen those ads where Microsoft rendered people using old versions of office as... dinosaur-mask wearing suits?
  
  If the majority of corporate America uses the old Office, then the only way for Microsoft to turn in profit would be to somehow convince them this is not good for them anymore, and upgrade. You're just going against yourself there.
Office Doc Generation on the Server by VosotrosForm · 2008-02-20 01:25 · Score: 5, Informative

I would like to point out another good option Joel doesn't have on his list. It's a software called OfficeWriter, from a company named SoftArtisans in Boston. When I last checked/worked there, it was capable of generating Excel and Word docs on the server, and I believe Powerpoint was probably coming relatively soon. Creating a product that can write office documents isn't quite as impossible in terms of labor as Joel is saying.... but it's still way beyond any hobby project. Plus, he is suggesting that you use Excel automation or the like through scripts to create documents on the server, which is a decent suggestion, if you want Excel or Word to constantly crash and lock up your server, and you enjoy rebooting them every day. If you want to do large scale document generation on a server you are going to need something like Officewriter. -Vosotros/Matt
Promise not a license by G0rAk · 2008-02-20 01:26 · Score: 5, Insightful

As PJ pointed out over on Groklaw, MS are giving a "Promise" not to sue but this is very very far from a license. Careful analysis suggests that any GPL'd software using these binaries could easily fall foul of the fury of MS lawyers.

--

Nothing to see here. Move along.
1. Re:Promise not a license by morgan_greywolf · 2008-02-20 01:38 · Score: 5, Interesting
  
  As PJ pointed out over on Groklaw, MS are giving a "Promise" not to sue but this is very very far from a license. Careful analysis suggests that any GPL'd software using these binaries could easily fall foul of the fury of MS lawyers. Correct.
  
  Here's my suggestion: someone should use these specs to create a BSD-licensed implementation as a library. Then, of course, (L)GPL programs would be free to use the implementation. Nobody gets sued, everybody is happy.
  
  --
  My blog
Re:patent promise doesn't sound very good by Ed+Avis · 2008-02-20 01:36 · Score: 5, Interesting

Basically, Microsoft reserves the right to sue you for software patent infringements. So do thousands of other big software companies and patent troll outfits. The new thing now is that Microsoft likes to generate FUD by producing partial waivers and promises that apply to some people in limited circumstances (Novell customers, people 'implementing a Covered Specification', and so on). The inadequacy of this promise draws attention to the implicit threat to tie you up in swpat lawsuits, which was always there - but until this masterstroke of PR the threat wasn't commented on much.

Ignore the vague language and develop software as you always have.

--
-- Ed Avis ed@membled.com
Re:patent promise doesn't sound very good by ContractualObligatio · 2008-02-20 01:43 · Score: 5, Informative

If there are any optional parts of the spec, those parts aren't covered.

RTFA. That's in the FAQ. Yes they are.

If the spec refers to another spec to define some part of the format, that part isn't covered.

In other words - if you do something related to a spec that isn't covered, it isn't covered. How could it be any different?!

I'm not saying that there aren't any flaws, but this kind of ill informed, badly thought out comment (a.k.a. "+5 Insightful", of course) has little value.
Don't Adopt. Convert. by Doc+Ruby · 2008-02-20 02:00 · Score: 5, Insightful

Spolsky's advice explains that the format code is extremely bad code from the POV of a programmer picking it up to use starting now. Because it grew like a coral reef, starting so long ago that interoperability with anything else but the app's codebase at the time was not in the designs. And every new feature was thrown in as a special case, rather than any general purpose facility for kinds of features or future expansion. The Microsoft legacy that leverages every year's market position into expansion the next year.

But we're not Microsoft, and we don't have the requirements MS had when making these formats. So we should by no means perpetuate them. We should do now what MS never had reason to do: upgrade the code and drop the legacy stuff that makes most of the code such a burden, but doesn't do anything for the vast majority of users today (and tomorrow).

That's OK, because Microsoft has done that, too, already. The MS idea of "legacy to preserve" is based on MS marketing goals, which are not the same as actual user requirements. So that legacy preservation doesn't mean that, say, Office 2008 can read and write Word for Windows for Workgroups for Pen Computing files 100%. MS has dropped plenty of backwards compatibility for its own reasons. New people opening the format for modern (and future) use can do the same, but based on user requirements, not emphasis on product lines if that's not a real requirement.

So what's needed is just converters that use this code to convert to real open formats that can be maintained into the future. Not moving this code itself into apps for the rest of all time. Today we have a transition point before us which lets us finally turn our back on the old, closed formats with all their code complexity. We can write converters that can be used to get rid of those formats that benefited Microsoft more than anyone else. Convert them into XML. Then, after a while, instead of opening any Word or Excel formats, we'll be exchanging just XML, and occasionally reaching for the converter when an old file has to be used currently. MS will go with that flow, because that's what customers will pay for. Soon enough these old formats will be rare, and the converters will be rare, too.

Just don't perpetuate them, and Microsoft's selfish interests, by just embedding them into apps as "native" formats. Make them import by calling a module that can also just batch convert old files. We don't need this creepy old man following us around anymore.

--
--
make install -not war
doing the right thing by carou · 2008-02-20 02:03 · Score: 5, Insightful

From Joel's FA:
There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904. Excel supports both because the first version of Excel, for the Mac, just used that operating system's epoch because that was easy, but Excel for Windows had to be able to import 1-2-3 files, which used 1/1/1900 for the epoch. It's enough to bring you to tears. At no point in history did a programmer ever not do the right thing, but there you have it. Nonsense.

When Excel started importing 1-2-3 documents, the right way to do that would be to create an importer to your own native format. Not to munge a new slightly different format into your existing structures. Yes, you'd have had to convert some dates between 1900 and 1904 formats (and maybe, detect cases where the old 1-2-3 bug could have affected the result) but at least you wouldn't be trying to maintain two formats for the rest of time.

If this is an example of programmers throughout history always doing exactly the right thing, I'd hate to see an example of code where the original author regretted some mistakes that had been made.
I will gladly pay anyone by flanders123 · 2008-02-20 02:20 · Score: 5, Funny
...to take this spec and create an identical .doc format, circumventing Word's bullet AI.
- it
- - never
  - ever
- ever
- - works
The file format is not really important by wrook · 2008-02-20 02:28 · Score: 5, Interesting

I've worked on some of these file formats quite a bit (I was the text conversion guy when WP went to Corel -- don't blame me, it was legacy code! ;-) ) Anyway, while the formats are quite strange in places, they aren't really that difficult to parse. I would be willing to speculate that this was never really much of a problem in writing filters for apps (or at least shouldn't have been).

No, the difficulty with writing a filter for these file formats is that you have no freaking clue what the *formatter* does with the data once it gets it. I'm pretty sure even Microsoft doesn't have an exact picture of that. Hell, I barely ever understood what the WP formatter was doing half the time (and I had source code). File formats are only a small part of the battle. You have all this text that's tagged up, but no idea what the application is *actually* doing with it. There are so many caveats and strange conditions that you just can't possibly write something to read the file and get it right every time.

In all honesty I have at least a little bit of sympathy for MS WRT OOXML. Their formatter (well, every formatter for every word processor I've ever seen) is so weird and flakey that they probably *can't* simply convert over to ODF and have the files work in a backwards compatible way. And lets face it, they've done the non-compatible thing before and they got flamed to hell for it. I honestly believe that (at some point) OOXML was intended to be an honest accounting of what they wanted to have happen when you read in the file. That's why it's so crazy. You'd have to basically rewrite the Word formatter to read the file in properly. If I had to guess, I'd say that snowballs in hell have a better chance...

I *never* had specs for the word file format (actually, I did, but I didn't look at them because they contained a clause saying that if I looked at them I had to agree not to write a file conversion tool). I had some notes that my predecessor wrote down and a bit of a guided tour of how it worked overall. The rest was just trial and error. Believe it or not, occasionally MS would send up bug reports if we broke our export filter (it was important to them for WP to export word because most of the legal world uses WP). But it really wasn't difficult to figure out the format. Trying to understand how to get the WP formatter (also flakey and weird) to do the same things that the Word formatter was doing.... Mostly impossible.

And that's the thing. You really need a language that describes how to take semantic tags and translate them to visual representation. And you need to be able to interact with that visual representation and refer it back to the semantic tags. A file format isn't enough. I need the glue in between -- and in most (all?) word processors that's the formatter. And formatters are generally written in a completely adhoc way. Write a standard for the *formatter* (or better yet a formatting language) and I can translate your document for you.

The trick is to do it in both directions too. Things like Postscript and PDF are great. They are *easy* to write formatters for. But it's impossible (in the general case) to take the document and put it back into the word processor (i.e. the semantic tags that generated the page layout need to be preserved in the layout description). That also has to be described.

Ah... I'm rambling. But maybe someone will see this and finally write something that will work properly. At Corel, my friend was put on the project to do just that 5 times... got cancelled each time ;-) But that was a long time ago...
Re:patent promise doesn't sound very good by jsight · 2008-02-20 02:34 · Score: 5, Informative

Hurr hurr. The Microsoft implementation of Java wasn't buggy: far from it, it was actually superior to the Sun implementation. It was faster and integrated better with Windows.

Among other issues, borderlayoutmanager did not behave properly in MS's implementation. It was buggy in incompatible ways, but your right, that in and of itself wasn't the big problem. The big problem was their insistence on both not fixing the bugs, and not going along with major initiatives (such as JFC/Swing).
But back in the day, the Microsoft J++ development environment was far superior to anything Sun had to offer. We're talking a good 10 years ago. Sun has finally managed to catch up in the past two or three years, but still, Sun's problem wasn't that the Microsoft implementation was worse: their problem was that it was better.

If by "2 or 3 years" you mean about 5 years, then I'd agree. Java development tools didn't really reach maturity until things like Eclipse came onto the scene about 5 years ago.

--
Throw the bums out!
Chunky File Format by mlwmohawk · 2008-02-20 02:55 · Score: 5, Interesting

While I was a contractor for a now defunct contracting company, we did a contract for Microsoft. This was pre windows 3.1. We did some innovations which I think became the bases for some of the OLE stuff, but I digress, Microsoft had a spec for its "Chunky File Format."

The office format based on the chunky file format does not have a format, per se' It is more similar to the old TIFF format. You can put almost anything in it, and the "things" that you put in it pretty much define how they are stored. So, for each object type that is saved in the file, there is a call out that says what it is, and a DLL is used to actually read it.

It is possible for multiple groups within Microsoft to store data elements in the format without knowledge of how it is stored ever crossing groups or being "documented" outside the comments and structures in the source code that reads it.

This is not an "interchange" format like ODF, it is a binary application working format that happens to get saved and enough people use it that it has become a standard. (With all blame resting squarely on M$ shoulders.)

It is a great file format for a lot of things and does the job intended. Unfortunately it isn't intended to be fully documented. It is like a file system format like EXT2 or JFS. Sure, you can define precisely how data is stored in the file system, but it is virtually impossible to document all the data types that can be stored in it.
Re:patent promise doesn't sound very good by msuarezalvarez · 2008-02-20 03:01 · Score: 5, Insightful

Hurr hurr. The Microsoft implementation of Java wasn't buggy: far from it, it was actually superior to the Sun implementation. It was faster and integrated better with Windows.

If their `implementation' different from the specs, then it was not a correct implementation. If it was supposed to be a Java implementation, then by definition it was buggy. If wasn't suppose to be one, then it had no business being called Java. That is why Sun sued them.
L&O: sFoo by poot_rootbeer · 2008-02-20 03:30 · Score: 5, Insightful

"Apps Hungarian", which adds semantic meaning (dx = width, rwAcross = across coord relative to window, usFoo = unsafe foo, etc) to the variable, not typing, is what is good and what he is advocating.

What is the justification for putting that semantic meaning into a variable name, instead of incorporating it into class definitions?

For example, if a string can be "safe" or "unsafe", why not have "SafeString" and "UnsafeString" classes that extend String, and use instances of those, instead of having instances of the base String class names 'sFoo' and 'usFoo'?