Microsoft Releases Office Binary Formats

← Back to Stories (view on slashdot.org)

Microsoft Releases Office Binary Formats

Posted by kdawson on Wednesday February 20, 2008 @01:04AM from the this-way-lies-madness dept.

Microsoft has released documentation on their Office binary formats. Before jumping up and down gleefully, those working on related open source efforts, such as OpenOffice, might want to take a very close look at Microsoft's Open Specification Promise to see if it seems to cover those working on GPL software; some believe it doesn't. stm2 points us to some good advice from Joel Spolsky to programmers tempted to dig into the spec and create an Excel competitor over a weekend that reads and writes these formats: find an easier way. Joel provides some workarounds that render it possible to make use of these binary files. "[A] normal programmer would conclude that Office's binary file formats: are deliberately obfuscated; are the product of a demented Borg mind; were created by insanely bad programmers; and are impossible to read or create correctly. You'd be wrong on all four counts."

14 of 259 comments (clear)

Min score:

Reason:

Sort:

patent promise doesn't sound very good by Timothy+Brownawell · 2008-02-20 01:17 · Score: 4, Insightful
Microsoft irrevocably promises not to assert any Microsoft Necessary Claims against you for making, using, selling, offering for sale, importing or distributing any implementation to the extent it conforms to a Covered Specification ("Covered Implementation"), subject to[...]
If your implementation is buggy, does that mean you're not covered?
To clarify, "Microsoft Necessary Claims" are those claims of Microsoft-owned or Microsoft-controlled patents that are necessary to implement only the required portions of the Covered Specification that are described in detail and not merely referenced in such Specification.
This sounds like:
- If there are any optional parts of the spec, those parts aren't covered.
- If the spec refers to another spec to define some part of the format, that part isn't covered.
1. Re:patent promise doesn't sound very good by msuarezalvarez · 2008-02-20 03:01 · Score: 5, Insightful
  
  Hurr hurr. The Microsoft implementation of Java wasn't buggy: far from it, it was actually superior to the Sun implementation. It was faster and integrated better with Windows.
  
  If their `implementation' different from the specs, then it was not a correct implementation. If it was supposed to be a Java implementation, then by definition it was buggy. If wasn't suppose to be one, then it had no business being called Java. That is why Sun sued them.
One possible reason for releasing the specs now by Stan+Vassilev · 2008-02-20 01:20 · Score: 5, Insightful

One may wonder, why release the documentation now?

If you read Joel's blog you'll see the formats are very old, and consist primarily of C-structs dumped to OLE objects, dumped directly to what we see as an XLS, DOC and so on files.

There's almost no parsing/validation at load time.

Having this in a well laid documentation may reveal quite a lot of security issues with the old binary formats, which could lead to a wave of exploits. Exploits that won't work on Microsoft's new XML Office formats.

So while I'm not a conspiracy nut, I do believe one of Microsoft's goals here are to assist the process of those binary formats becoming obsolete, to drive Office 2007/2008 adoption.
1. Re:One possible reason for releasing the specs now by Stan+Vassilev · 2008-02-20 02:32 · Score: 5, Insightful
  
  Come on. You really think Microsoft wants to increase the vulnerability of old versions of Office (which are still the vast majority in corporate America). This not only makes their software looks bad, it increases the amount of work they have to do to support the older versions (yes, they still support Office 2003). You don't sell new cars by convincing people the last model was rubbish. I think your tin-foil hat fits a little to tight.
  
  Let me break your statement in pieces:
  
  - that would increase the vulnerability of old Office
  - the majority of corporate America is stuck on old Office
  - you don't sell old cars by convincing old ones are rubbish
  
  You know, have you seen those white-papers by Microsoft comparing XP and Vista and trying to put XP-s reliability and security in bad light?
  
  Or have you seen those ads where Microsoft rendered people using old versions of office as... dinosaur-mask wearing suits?
  
  If the majority of corporate America uses the old Office, then the only way for Microsoft to turn in profit would be to somehow convince them this is not good for them anymore, and upgrade. You're just going against yourself there.
Re:first post? by Timothy+Brownawell · 2008-02-20 01:21 · Score: 3, Insightful

I'd assume it has something to do with the antitrust action the EU was taking. Didn't they order that Microsoft had to open all their protocols/formats?
Promise not a license by G0rAk · 2008-02-20 01:26 · Score: 5, Insightful

As PJ pointed out over on Groklaw, MS are giving a "Promise" not to sue but this is very very far from a license. Careful analysis suggests that any GPL'd software using these binaries could easily fall foul of the fury of MS lawyers.

--

Nothing to see here. Move along.
Re:Joel by AKAImBatman · 2008-02-20 01:30 · Score: 4, Insightful

If you actually read the article, he's right. His point is that the use of Hungarian notation has been bastardized beyond believe. Programmers didn't understand why Hungarian originally used his famous notation, and thus tend to make an error every time they attempt to replicate his work. That's why we have tons of Java programs that look like crap due to some foolish programmer mindlessly following Hungarian Notation.

On the subject of the Office Document format, I believe that everything he says is also true; but with a few caveats. The first is the subject of Microsoft intentionally making Office Documents complicated. I fully accept (and have accepted for a long time) that Office docs were not intentionally obfuscated. However, I also accept that Microsoft was 100% willing to use the formats' inherent complexity to their advantage to maintain lock-in. The unnecessary complexity of OOXML proves this.

The other caveat is that I disagree with his workarounds. He suggests that you should use Office to generate Office files, or simply avoid the issue by generating a simpler file. There's no need to do this as it's perfectly possible to use a subset of Office features when producing a file programatically. Libraries like POI can produce semantically correct files, even if they aren't the most feature rich.

--
Javascript + Nintendo DSi = DSiCade
Don't Adopt. Convert. by Doc+Ruby · 2008-02-20 02:00 · Score: 5, Insightful

Spolsky's advice explains that the format code is extremely bad code from the POV of a programmer picking it up to use starting now. Because it grew like a coral reef, starting so long ago that interoperability with anything else but the app's codebase at the time was not in the designs. And every new feature was thrown in as a special case, rather than any general purpose facility for kinds of features or future expansion. The Microsoft legacy that leverages every year's market position into expansion the next year.

But we're not Microsoft, and we don't have the requirements MS had when making these formats. So we should by no means perpetuate them. We should do now what MS never had reason to do: upgrade the code and drop the legacy stuff that makes most of the code such a burden, but doesn't do anything for the vast majority of users today (and tomorrow).

That's OK, because Microsoft has done that, too, already. The MS idea of "legacy to preserve" is based on MS marketing goals, which are not the same as actual user requirements. So that legacy preservation doesn't mean that, say, Office 2008 can read and write Word for Windows for Workgroups for Pen Computing files 100%. MS has dropped plenty of backwards compatibility for its own reasons. New people opening the format for modern (and future) use can do the same, but based on user requirements, not emphasis on product lines if that's not a real requirement.

So what's needed is just converters that use this code to convert to real open formats that can be maintained into the future. Not moving this code itself into apps for the rest of all time. Today we have a transition point before us which lets us finally turn our back on the old, closed formats with all their code complexity. We can write converters that can be used to get rid of those formats that benefited Microsoft more than anyone else. Convert them into XML. Then, after a while, instead of opening any Word or Excel formats, we'll be exchanging just XML, and occasionally reaching for the converter when an old file has to be used currently. MS will go with that flow, because that's what customers will pay for. Soon enough these old formats will be rare, and the converters will be rare, too.

Just don't perpetuate them, and Microsoft's selfish interests, by just embedding them into apps as "native" formats. Make them import by calling a module that can also just batch convert old files. We don't need this creepy old man following us around anymore.

--
--
make install -not war
doing the right thing by carou · 2008-02-20 02:03 · Score: 5, Insightful

From Joel's FA:
There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904. Excel supports both because the first version of Excel, for the Mac, just used that operating system's epoch because that was easy, but Excel for Windows had to be able to import 1-2-3 files, which used 1/1/1900 for the epoch. It's enough to bring you to tears. At no point in history did a programmer ever not do the right thing, but there you have it. Nonsense.

When Excel started importing 1-2-3 documents, the right way to do that would be to create an importer to your own native format. Not to munge a new slightly different format into your existing structures. Yes, you'd have had to convert some dates between 1900 and 1904 formats (and maybe, detect cases where the old 1-2-3 bug could have affected the result) but at least you wouldn't be trying to maintain two formats for the rest of time.

If this is an example of programmers throughout history always doing exactly the right thing, I'd hate to see an example of code where the original author regretted some mistakes that had been made.
Re: "compound documents." oh no, run away! by ContractualObligatio · 2008-02-20 02:20 · Score: 4, Insightful

It's interesting you give a nicely egotistical critique of a well-regarded expert's article, but don't suggest a single alternative to how M$ could have met their design goals, nor explain why the no-interoperability assumption was unreasonable at the time. If you can't appreciate the design goals, nor suggest a way to meet them, what's the point of the rest of your post?
Re:Worst. Workaround. Ever. by ContractualObligatio · 2008-02-20 02:42 · Score: 3, Insightful

I fail to see the problem with using the specification Microsoft released to write a program that can read and write this binary format

That is almost the the stupidest thing I've read today (RTFA with respect to development costs to figure out why), except for this:

If Microsoft didn't want it to be used, they would not have released it.

We can ignore the shockingly poor logic inherent to this statement and just take it at face value: doing something just because M$ wants you to would easily make the Top 10 Stupid Things To Do In IT list. It's particularly bizarre to hear it on Slashdot.
Microsoft marketing by Comboman · 2008-02-20 02:58 · Score: 3, Insightful

You don't sell new cars by convincing people the last model was rubbish.
You're kidding right? That's been exactly Microsoft's marketing strategy for the last ten years. Remember the Win9X BSOD ads for Windows XP? Microsoft is in the difficult position where their only real competition is their own previous products.

--
Support Right To Repair Legislation.
L&O: sFoo by poot_rootbeer · 2008-02-20 03:30 · Score: 5, Insightful

"Apps Hungarian", which adds semantic meaning (dx = width, rwAcross = across coord relative to window, usFoo = unsafe foo, etc) to the variable, not typing, is what is good and what he is advocating.

What is the justification for putting that semantic meaning into a variable name, instead of incorporating it into class definitions?

For example, if a string can be "safe" or "unsafe", why not have "SafeString" and "UnsafeString" classes that extend String, and use instances of those, instead of having instances of the base String class names 'sFoo' and 'usFoo'?
Re: "compound documents." oh no, run away! by ContractualObligatio · 2008-02-20 07:39 · Score: 3, Insightful

I think the design goals were flawed. That's my point.

And I think your ability to assess another's work is flawed courtesy of an over sized ego. That was my point.

You have yet to provide an alternative solution to the problem. Given that one constraint is memory, your inability to be concise suggests you're not capable of coming up with one either. Certainly your "squeeze out a few extra microseconds" comment suggests you have absolutely no clue what you are talking about. Yet you persist in calling it bad design. You are strangely smug about what was quite possibly an implicit assumption forced by tough constraints, with no actual interoperability requirements, at a time when they were rarely offered let alone expected. I would stop using "IMHO" - clearly there is nothing humble about your opinion.

Why the bit about metadata, out of interest? It's as if you think the more irrelevant things you can fit into the post, the more we're supposed to be impressed.