Microsoft Releases Office Binary Formats

← Back to Stories (view on slashdot.org)

Microsoft Releases Office Binary Formats

Posted by kdawson on Wednesday February 20, 2008 @01:04AM from the this-way-lies-madness dept.

Microsoft has released documentation on their Office binary formats. Before jumping up and down gleefully, those working on related open source efforts, such as OpenOffice, might want to take a very close look at Microsoft's Open Specification Promise to see if it seems to cover those working on GPL software; some believe it doesn't. stm2 points us to some good advice from Joel Spolsky to programmers tempted to dig into the spec and create an Excel competitor over a weekend that reads and writes these formats: find an easier way. Joel provides some workarounds that render it possible to make use of these binary files. "[A] normal programmer would conclude that Office's binary file formats: are deliberately obfuscated; are the product of a demented Borg mind; were created by insanely bad programmers; and are impossible to read or create correctly. You'd be wrong on all four counts."

19 of 259 comments (clear)

Min score:

Reason:

Sort:

Why not ODF or OOo? by jfbilodeau · 2008-02-20 01:26 · Score: 2, Interesting

Why does the author avoid any mention of ODF or OpenOffice as alternatives to work with MS Office docs? He seems stuck on 'old' formats like WKS or RTF.

I know OOo is not a perfect Word/Excel converter, but it has served me marvelously since the StarOffice days. I wish that there was a simple command-line driven tool that could convert .docs or .xls to ODS or PDF using the OOo code. Any one knows about such a tool?

--
Goodbye Slashdot. You've changed.
Retaliation? by ilovegeorgebush · 2008-02-20 01:33 · Score: 2, Interesting

Is this retaliation to the impending doom of the OOXML format requesting ISO standard status? Is MS's thinking: "Right, ISO has failed us, so we'll release the binaries so everyone keeps using the office formats anyway"?

--
ilovegeorgebush
Re:patent promise doesn't sound very good by Ed+Avis · 2008-02-20 01:36 · Score: 5, Interesting

Basically, Microsoft reserves the right to sue you for software patent infringements. So do thousands of other big software companies and patent troll outfits. The new thing now is that Microsoft likes to generate FUD by producing partial waivers and promises that apply to some people in limited circumstances (Novell customers, people 'implementing a Covered Specification', and so on). The inadequacy of this promise draws attention to the implicit threat to tie you up in swpat lawsuits, which was always there - but until this masterstroke of PR the threat wasn't commented on much.

Ignore the vague language and develop software as you always have.

--
-- Ed Avis ed@membled.com
Re:Promise not a license by morgan_greywolf · 2008-02-20 01:38 · Score: 5, Interesting

As PJ pointed out over on Groklaw, MS are giving a "Promise" not to sue but this is very very far from a license. Careful analysis suggests that any GPL'd software using these binaries could easily fall foul of the fury of MS lawyers. Correct.

Here's my suggestion: someone should use these specs to create a BSD-licensed implementation as a library. Then, of course, (L)GPL programs would be free to use the implementation. Nobody gets sued, everybody is happy.

--
My blog
"compound documents." oh no, run away! by radarsat1 · 2008-02-20 01:42 · Score: 4, Interesting

You see, Excel 97-2003 files are OLE compound documents, which are, essentially, file systems inside a single file.

I don't see why just because something is organized filesystem-like (not such an awful idea) means it has to be hard to understand. Filesystems, while they can certain get complicated, are fairly simple in concept. "My file is here. It is *this* long. Another part of it is over here..."
They were not designed with interoperability in mind.

Wait, I thought you were trying to convince us that this doesn't reflect bad programming...
That checkbox in Word's paragraph menu called "Keep With Next" that causes a paragraph to be moved to the next page if necessary so that it's on the same page as the paragraph after it? That has to be in the file format.

Ah, I see, you're trying to imply that it's the very design of the Word-style of word processor that is inherently flawed. Finally we're in agreement.

Anyways, it's no surprise that it's all the OLE, spreadsheet-object-inside-a-document, stuff that would make it difficult to design a Word killer. (How often to people actually use that anyway?) It would basically mean reimplementing OLE, and a good chunk of Windows itself (libraries for all the references to parts of the operating system, metafiles, etc), for your application. However, it certainly can be done. I'm not sure it's worth it, and it can't be done overnight, but it's possible. However you'll have a hard time convincing me that Microsoft's mid-90's idea of tying everything in an application to inextricable parts of the OS doesn't reflect bad programming. Like, what if we need to *change* the operating system? At the very least, it reflects bad foresight, seeing as they tied themselves to continually porting forward all sorts of crud from previous versions of their OS just to support these application monstrosities. This is a direct consequence of not designing the file format properly in the first place, and just using a binary structure dump.

It reminds me of a recovery effort I tried last year, trying to recover some interesting data from some files generated on a NeXT cube from years ago. I realized the documents were just dumps of the Objective C objects themselves. In some ways this made the file parseable, which is good, but it other ways it meant that, even though I had the source code of the application, many of the objects that were dumped into the file were related to the operating system itself instead of the application code, which I did _not_ have the source code to, making the effort far more difficult. (I didn't quite succeed in the end, or at least I ran out of time and had to take another approach on that project.)

In their (MS's) defense, I used to do that kind of thing back then too, (dumping memory structures straight to files instead of using extensible, documented formats), but then again I was 15 years old (in 1995) and still learning C.
Re:patent promise doesn't sound very good by julesh · 2008-02-20 01:51 · Score: 4, Interesting

If your implementation is buggy, does that mean you're not covered?

That is my primary concern with the entire promise. None of this bullshit not-tested-in-court crap that came up the other day: it doesn't cover implementations with slight variations in functionality.

This, it seems, is intentional. MS don't want to allow others to embrace & extend their standards.
Worst. Workaround. Ever. by organgtool · 2008-02-20 01:52 · Score: 4, Interesting

FTA:
There are two major alternatives you should seriously consider: letting Office do the work, or using file formats that are easier to write.
His first workaround is to use Microsoft Office to open the document and then save that document in a non-binary format. Well that assumes that I already have Microsoft Windows, Microsoft Word, Microsoft Excel, Microsoft PowerPoint, etc. Do you see the problem here?

The second "workaround" is the same as the first, only a little more proactive. Instead of saving my documents as binary files and then converting them to another format, I should save them as a non-binary format from the start! Mission accomplished! Oh wait - how do I get the rest of the world to do the same? That could be a problem.

I fail to see the problem with using the specification Microsoft released to write a program that can read and write this binary format. If Microsoft didn't want it to be used, they would not have released it. Even if Microsoft tried to take action against open source software for using the specs that they opened, how could Microsoft prove that the open source software used those specs as opposed to reverse engineering the binary format on their own? I think this is a non-issue.
Seems that these aren't the full specs by amazeofdeath · 2008-02-20 02:23 · Score: 3, Interesting

Stephane Rodrigues comments:
"I first gave a cursory look at BIFF. 1) Missing records: examples are 0x00EF and 0x01BA, just off the top of my head. 2) No specification: example is the OBJ record for a Forms Combobox," Rodriguez wrote. "Then I gave a cursory look at the Office Drawing specs. And, again, just a cursory look at it showed unspecified records." http://www.zdnet.com.au/news/software/soa/Microsoft-publishes-incomplete-OOXML-specs/0,130061733,339286057,00.htm

--
U+F8FF
The file format is not really important by wrook · 2008-02-20 02:28 · Score: 5, Interesting

I've worked on some of these file formats quite a bit (I was the text conversion guy when WP went to Corel -- don't blame me, it was legacy code! ;-) ) Anyway, while the formats are quite strange in places, they aren't really that difficult to parse. I would be willing to speculate that this was never really much of a problem in writing filters for apps (or at least shouldn't have been).

No, the difficulty with writing a filter for these file formats is that you have no freaking clue what the *formatter* does with the data once it gets it. I'm pretty sure even Microsoft doesn't have an exact picture of that. Hell, I barely ever understood what the WP formatter was doing half the time (and I had source code). File formats are only a small part of the battle. You have all this text that's tagged up, but no idea what the application is *actually* doing with it. There are so many caveats and strange conditions that you just can't possibly write something to read the file and get it right every time.

In all honesty I have at least a little bit of sympathy for MS WRT OOXML. Their formatter (well, every formatter for every word processor I've ever seen) is so weird and flakey that they probably *can't* simply convert over to ODF and have the files work in a backwards compatible way. And lets face it, they've done the non-compatible thing before and they got flamed to hell for it. I honestly believe that (at some point) OOXML was intended to be an honest accounting of what they wanted to have happen when you read in the file. That's why it's so crazy. You'd have to basically rewrite the Word formatter to read the file in properly. If I had to guess, I'd say that snowballs in hell have a better chance...

I *never* had specs for the word file format (actually, I did, but I didn't look at them because they contained a clause saying that if I looked at them I had to agree not to write a file conversion tool). I had some notes that my predecessor wrote down and a bit of a guided tour of how it worked overall. The rest was just trial and error. Believe it or not, occasionally MS would send up bug reports if we broke our export filter (it was important to them for WP to export word because most of the legal world uses WP). But it really wasn't difficult to figure out the format. Trying to understand how to get the WP formatter (also flakey and weird) to do the same things that the Word formatter was doing.... Mostly impossible.

And that's the thing. You really need a language that describes how to take semantic tags and translate them to visual representation. And you need to be able to interact with that visual representation and refer it back to the semantic tags. A file format isn't enough. I need the glue in between -- and in most (all?) word processors that's the formatter. And formatters are generally written in a completely adhoc way. Write a standard for the *formatter* (or better yet a formatting language) and I can translate your document for you.

The trick is to do it in both directions too. Things like Postscript and PDF are great. They are *easy* to write formatters for. But it's impossible (in the general case) to take the document and put it back into the word processor (i.e. the semantic tags that generated the page layout need to be preserved in the layout description). That also has to be described.

Ah... I'm rambling. But maybe someone will see this and finally write something that will work properly. At Corel, my friend was put on the project to do just that 5 times... got cancelled each time ;-) But that was a long time ago...
Re:Joel by Anonymous Coward · 2008-02-20 02:33 · Score: 3, Interesting

Hungarian should not be used in any language which has a reasonable typing system;

That's "Systems Hungarian" in the original article, and you are correct.

"Apps Hungarian", which adds semantic meaning (dx = width, rwAcross = across coord relative to window, usFoo = unsafe foo, etc) to the variable, not typing, is what is good and what he is advocating. It is exactly "good variable naming". You can see that you shouldn't be assigning rwAcross = bcText, because why would you turn assign a byte count to a coordinate even though they're both ints. The article is quite good really. How relevant it is in a .NET/Java world is another discussion entirely.
Re:I thought it was pretty well known by erroneus · 2008-02-20 02:35 · Score: 4, Interesting

It's a DOCUMENT format. You know, you put words and pictures in there? Things you type in with your own keyboard with your fingers? There should be no need to have API calls in a document format. The same is true for WMF. WMF was very exploitable as a result, so not only is it bad style, it's dangerous.
Re:Joel by mike_sucks · 2008-02-20 02:52 · Score: 1, Interesting

All design patterns are workarounds for missing language features. See GTK's use of an object oriented pattern in C, for example. Hungarian is a design pattern (well, naming convention, but same thing) for the same weakly typed language: C.

Modern languages are strongly typed and hence will tell the programmer when they've screwed up, at compile time or later. So there's no need for Hungarian in these languages, much like C# or Java and maybe even C++ now has built in support for object oriented programming.

So again, Joel spins something that was useful historically as being something that is still essential, even though it is now completely redundant. This man is a living, breathing excuse for poor practices based on historical, obsolete artifice.

Now, to get to your point - it is moot that some programmers are using some bastardised version of Hungarian, because even if done correctly it is now a waste of time when using a modern programming language. It only contributes by making a program harder to read, hence increasing complexity and reducing maintainability rather than providing any actual benefit.

/Mike

--
-- "So, what's the deal with Auntie Gerschwitz et all?"
Chunky File Format by mlwmohawk · 2008-02-20 02:55 · Score: 5, Interesting

While I was a contractor for a now defunct contracting company, we did a contract for Microsoft. This was pre windows 3.1. We did some innovations which I think became the bases for some of the OLE stuff, but I digress, Microsoft had a spec for its "Chunky File Format."

The office format based on the chunky file format does not have a format, per se' It is more similar to the old TIFF format. You can put almost anything in it, and the "things" that you put in it pretty much define how they are stored. So, for each object type that is saved in the file, there is a call out that says what it is, and a DLL is used to actually read it.

It is possible for multiple groups within Microsoft to store data elements in the format without knowledge of how it is stored ever crossing groups or being "documented" outside the comments and structures in the source code that reads it.

This is not an "interchange" format like ODF, it is a binary application working format that happens to get saved and enough people use it that it has become a standard. (With all blame resting squarely on M$ shoulders.)

It is a great file format for a lot of things and does the job intended. Unfortunately it isn't intended to be fully documented. It is like a file system format like EXT2 or JFS. Sure, you can define precisely how data is stored in the file system, but it is virtually impossible to document all the data types that can be stored in it.
Outlook by c00rdb · 2008-02-20 03:31 · Score: 2, Interesting

Why is Outlook missing from the released formats? I've spent some time reverse engineering meeting requests myself and I'd love to see the complete .msg file specification. You could find some useful on MSDN already but it was nowhere near as complete as these releases appear to be.
Re:doing the right thing by Schnapple · 2008-02-20 04:04 · Score: 3, Interesting

When Excel started importing 1-2-3 documents, the right way to do that would be to create an importer to your own native format. Not to munge a new slightly different format into your existing structures.
Well, ignoring the fact that the article elaborates on why they made some of the technical decisions early on, Joel, who was at one point a program manager for Microsoft Excel, actually has an article on this very thing. Basically, this is exactly what they did - Excel initially opened 1-2-3 documents, but it could not write to them. You could open up your Lotus 1-2-3 document but you'd have to save it in Excel format. Excel 4.0 introduced the ability to write to Lotus 1-2-3 documents, and Excel 4.0 was the version that served as the "tipping point" - it was the version that businesses started buying in mass numbers and it was the version that signaled the end for Lotus 1-2-3.

Why? Because, as the article states, Excel 4.0 was the first version that would let you go back. You could just try out Excel and if it didn't work no big deal, just go back to Lotus 1-2-3. It seems completely counter-intuitive to do so, and it apparently wasn't the easiest thing to convince Microsoft management to do so, but it worked and now everyone uses Excel and Lotus 1-2-3 is ancient history.

The programmers did both the right thing and the thing which would be successful. With all due respect to the OpenOffice folks, they're not in the business of selling software. If people don't move to OpenOffice in mass numbers it doesn't spell doom for the company, because there is no company. Doing what you suggest might be the right thing in a programmer's perspective (and I agree), it's not compatible with a company that is trying to make a product to take over the market with. This is why Microsoft is so successful - they're staffed by a large number of people (like Joel) who get this.

--
Schnapple
Re:One possible reason for releasing the specs now by orra · 2008-02-20 04:33 · Score: 2, Interesting

One may wonder, why release the documentation now?

I would say it's because they get good PR for for pretending to be transparent/friendly, whilst not actually giving away any new information.

Look at page 129 of the PDF specifying the .doc format.. (The page is actually labelled 128 in the corner, but it's page 129 of the PDF). You will see there's a bit field. One of the many flags that can be set in this bit field: "fUseAutospaceForFullWidthAlpha".

The description?:
Compatibility option: when set to 1, use auto space like Word 95

Gee, thanks. That's helpful. You know, an earlier Slashdot article said Microsoft were going to release a BSD licensed converter to convert from .doc to .docx. But this will never help anyone further understand either of the two formats: binary .doc files which are auto spaced like Word 95 will be converted to "XML" files which are auto spaced like Word 95.
Re:patent promise doesn't sound very good by mhall119 · 2008-02-20 07:18 · Score: 2, Interesting

Besides, how do you envisage that a file format which is essentially a detailed description of the actual binary data structure is going to be missing something? Because I've read the MSOOXML spec, and that's exactly what they did in there. Since MSOOXML seems like a simple translation of the binary format into XML, it would assume that the same important parts of the spec will be missing here.

--
http://www.mhall119.com
Re:MS definately don't know how their formatter wo by guardian-ct · 2008-02-20 07:58 · Score: 2, Interesting

This is caused by the "WYSIWYG" feature. Your HP printer driver is probably set to choose fonts that are "close" to the ones Windows uses, but are instead native fonts for the HP printer. Your PDF uses the Windows, and/or Adobe, fonts directly. Word uses the printer driver settings while you're editing, and if you change printers, the document repages with any different native fonts.

In Windows 2000, you can open the printers control panel, choose "printing preferences" on your HP, poke the "Advanced..." button, and tell it to "Download as SoftFont". This should make changing between PDF and printer less painful, at the expense of increased memory usage and time to print with the HP. For the real advanced version, you can try and find which Adobe fonts are exactly the same size as the HP native ones, and tell the PDF writer to use those.
Re:I thought it was pretty well known by James+McGuigan · 2008-02-20 09:55 · Score: 2, Interesting

From MS perspective its not a document format, its just another component in the "user experience" that is MS Office. They trade clean data formats for tightly integrated software designed for a MS only environment. Part of the trade off may be week security, which may be unacceptable to you, but may be acceptable to the MS marketing department, which considers the lack of certain frivolous features to unacceptable.