Microsoft Releases Pre-2007 Binary File Format Specs
An anonymous reader writes "Microsoft has released the specifications for the binary file formats used by pre-2007 Microsoft Office applications. They're accurate this time! Honest! While the documents are enormous (Word alone requires 533 pages; Excel runs over 1000 plus another 850 pages for the Office 2007 binary format), they hopefully will be useful to developers trying to create or extract information from Microsoft Office files (which despite their flaws, have been the de facto standard in many fields for some time now)."
I know it's old hat by now, but back in the Office 98 days, file corruption was a big deal.
I wonder what was going on, but it occurs to me that now I could concievably actually back out
the errors, and figure the thing out.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Personally, the VBA .pdf is the most interesting of the lot.
Wouldn't want to sound ungrateful about some of the tasty bits not present, so let me hope that this is yet another positive step that encourages follow-on.
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
A far cry from the 6,000 pages for OOXML ..
...to finally share proper doc of the old standards. This just means they feel confident that MS Office 2007 will take firm enough root to ensure that the old game of catch up for FOSS projects will stay the same.
And wasn't it just yesterday some twits had an artice about how MS is changing/will change? I sure wouldn't hold my breath!
Caveat Utilitor
Did anyone else notice this is coming out on the first business day at MS that is Gates free...?
is WHEN are they going to release the source code to the Flight Sim in Excel 98?
Word alone requires 533 pages; Excel runs over 1000 plus another 850 pages for the Office 2007 binary forma ... So I'm solid on the bed time story front for some time! Gee, thanks Microsoft!
Wow... this is great! A decade or so late, but... great.
EP
It should not really be noteworthy that they have 500+ specs.
the "license" conditions no doubt will contain several pitfalls for anyone who actually wants to use it to implement a file input/output filter in conjunction with free software... and the other problem is once having seen the specification, you'll never be able to safely work on other free software projects again...
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Isn't this old news? I mean, it's been covered on Slashdot at least twice now. (Dear timothy, I'd like to introduce you to my friend Google.)
Yes, the formats are large and complicated, but for a variety of good, if antiquated, reasons. I'd suggest anyone interested read Joel Spolsky's blog post on it (which, being posted last February, isn't news either but hey, this is Slashdot).
"What do you despise? By this are you truly known." --Princess Irulan, Manual of Muad'Dib
/)
And to think, this happens the day after Gates steps down...
I honestly believe that they are trying to give out complete information. It's just that they have 20 years of spaghetti code to somehow shape into an API document. I doubt if anyone at Microsoft really knows how the code works.
With a 1000 page document describing how to list off spreadsheet information, I shudder to think about how organized their kernel is.
The released specifications are in a pre-2007 MS Office binary file format.
I can't understand the negativity. Sure Microsoft has an unpleasant past, but this is a good move on their part and should be met with nothing less than praise.
We want to encourage more behavior like this.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
The only problem? They released them in Word format...
(Okay, not really -- someone must have realized that that would be silly.)
533 pages,1000+ pages, 850+ pages! Holy Hannah! Is it my imagination but are these comparatively the most unusually bloated specs ever seen? 533+ pages just to describe how text is saved??! Even accounting for formatting and tables and whatever, I'm thinking that just maybe MS has inflated the docs so as to make any real work with them by competing interests nigh impossible.
I recall having a copy of the spec for ATAPI/IDE a few years back and it was a toilet-read length in comparison at roughly 250 pages . Trying hard here not to bash MS but I figure this falls under the "let's follow the word not the spirit of the law" kind of thing to meet DoJ and Euro govs rulings...
Wait ... what did I just say? ...
I don't think I'm feeling well. I'm gonna go lie down now.
Or is it Wholly Crap?
I guess we'll see. I'm rather shocked by this. This is a kind of "giving in" gesture that is MOST uncharacteristic of Microsoft. Is this was the "Post-Gates" Microsoft will be like? How much more cooperative spirit will the community enjoy?
It's always a trap.
Help stamp out iliturcy.
From here -> You or anyone else has nothing to worry about. Microsoft has changed its tune.:
Microsoft irrevocably promises not to assert any Microsoft Necessary Claims against you for making, using, selling, offering for sale, importing or distributing any implementation to the extent it conforms to a Covered Specification (âoeCovered Implementationâ), subject to the following. This is a personal promise directly from Microsoft to you, and you acknowledge as a condition of benefiting from it that no Microsoft rights are received from suppliers, distributors, or otherwise in connection with this promise. If you file, maintain or voluntarily participate in a patent infringement lawsuit against a Microsoft implementation of such Covered Specification, then this personal promise does not apply with respect to any Covered Implementation of the same Covered Specification made or used by you. To clarify, âoeMicrosoft Necessary Claimsâ are those claims of Microsoft-owned or Microsoft-controlled patents that are necessary to implement only the required portions of the Covered Specification that are described in detail and not merely referenced in such Specification. âoeCovered Specificationsâ are listed below.
Descriptive specifications alone are never good enough.
Trusted Computing FAQ | Free Dawit Isaak!
a) Does this mean the standard GNU response is now invalid?
b) If someone writes a FOSS implementation of a .doc/.xls viewer, does that mean MSFT could more easily throw their weight to declaring .doc a standard? (Since a standard ought to have multiple implementations, although maybe office 2003 and 2007 counts as two, or office and word/excel/powerpoint viewer :p )
I knew about this since august 2007 and even submitted it to slashdot twice, although it didn't get picked for front page. See http://developers.slashdot.org/~mastropiero/journal/
This is definitely useful for app developers of free software.
Raymond Chen (well known Microsoft blogger) linked to Joel on Software today about Why the MS Office file formats are so complicated
You are right. This is a great step forward. However, I think the Slashdot community, with its cynical eye on Microsoft, is reminding us to take this in the proper context. It remains to be seen whether this is the beginning of a slow but steady change of course for the world's largest software company, or whether this is a fake-out to fool people into thinking that Microsoft is nice.
Personally, I suspect that this reflects internal conflict within Microsoft, with some portions of the behemoth trying to do something good, while another faction still trying to squeeze money out of Microsoft's unique position in the software world.
In any case, remember how some people would say, "You always complain about Microsoft! What would it take for you to admit that Microsoft is doing something good?"
#2 on the list was: Stop hijacking the HTML standard and make a compliant browser! Then they put out IE7. (Not perfect, but a heckuva lot better than IE6!)
#1 on the list was: Open up the Word document file format. Okay, so they've done that. (Again, not perfect, but a heckuva lot better than what went on before!)
Congrats, Microsoft. You did it. A little late in coming, and you really didn't impress us with your OOXML fiasco waving that money around, but I'm willing to adopt a wait-and-see attitude to see whether it's still those same money-grubbing upper level managers that are in control, or whether this really is a new day at Microsoft.
404555974007725459910684486621289147856453481154 in hex is "You sank my Battleship?"
[GPG key in journal]
Where is Visio ?
"In addition to posting this documentation, Microsoft also published a list indicating which of the published protocols built into the following products are covered by Microsoft patents or patent applications"
"Some of the Microsoft protocols include patented inventions, and others do not. You may benefit from a patent license if you are distributing implementations of these protocols commercially or if you use an implementation of any of the protocols covered by Microsoft patents"
davecb5620@gmail.com
This might be Microsoft's way to help combat global warming, by freezing Hell. :-)
home
"This is definitely useful for app developers of free software"
You mean as in you work on the implementation for free and Microsoft benefits from any commercial developments.
davecb5620@gmail.com
I'm sure this move was somewhat forced to please the European Union or something.
In any case, I'm sure this would be just what Sun needs to make OpenOffice(.org) more compatible with MS Office than MS Office itself :)
Crystal clear to me .. :)
davecb5620@gmail.com
20 years ago, at what was the world's largest software project, we used to joke that if we wanted to ruin our competition, we would send them a copy of our specs. It looks to me that Microsoft got the same idea.
"To those who are overly cautious, everything is impossible. "
Microsoft releases api/ protocol specs | Feb. 2008
http://www.theregister.co.uk/2008/02/21/microsoft_goes_open/
Microsoft releases further specs | April. 2008
http://www.theregister.co.uk/2008/04/08/microsoft_posts_protocol_documents/
And they state that more will come after gathering feedback between then and June.
Between now and June it will garner feedback from the developer community. Then, at the end of June, Microsoft will publish the final versions of technical documentation - along with definitive patent licensing terms.
Hehe. I love how the PDF was produced by Microsoft Word 2007, nice little dig at Adobe after they kicked up a fuss about it being installed by default.
This means, as far as I know, that GPL implementations are not allowed. So it's an even worse situation than before, because Free Software developers can't even look at this documentation to verify any of the conclusions of their reverse engineering.
Could somebody explain to me the "flaws" of the office documents format? Besides not being open format, that is. This is a genuine question for a genuinely interested person.
Mod parent up! His link to an article in Joel Spolsky's blog is very relevant, and the article puts this whole code release into perspective!
is that they're ready with their new "standard", and they're confident that that won't be Reverse Engineered....
Exceeding the recommended torque is not recommended.
Just look at the simple .msg file format!
Just because Lucy has always jerked the football away doesn't mean Charlie Brown won't get to kick it this time.
Have they learned nothing from Bill? And before his body^H^H^H^Hchair was even cold too!
I can't resist the flame bait. The suite certainly has flaws but I having a hard time thinking of a better suite. I can think a free one that no one can seem to give away :)
Go ahead. You know you want to flame me...
/LabMonkey09
This action will buy Microsoft enough time to come up with some new crap for the masses to decode.
1. This is VERY important, it means that, inter alia, all documents published up to now are no longer hostage, for a fee they can be recovered.
Nor can anyone seriously claim that, if you migrate to FOSS you will be __locked_out__.
2. We need the same for Exchange and Active Directory aka Kerberos+LDAP
That would really level the Enterprise playing field.
De gelukwensen en houden gaand Frau Neelie Kroes
I would disagree that the protocol shouldn't be patentable. Especially to contract driven programming, the protocol is really the only thing that matters.
This is my sig.
Or you could NOT be a fucking retard and just use CSV.
CSV is crap. These days, customers that want to use Excel in an application want all the formulas and formatting. Generating Excel XML is rather popular, but the idea of being able to work with an Excel chart and injecting VBA code into an Excel document is downright intoxicating.
Quite honestly, this move to open file formats will entrench Microsoft Office even -more-, as corporations that love Excel will find themselves building applications that use it.
Ultimately, what's going to happen is that Microsoft, or one of its partners, will wind up releasing Excel file builders for .NET, and then you'll see an explosion of Excel content in corporate intranets. There won't be that much CSS and DHTML any more, it will all be just Excel, binary.
This is my sig.
t they can do that despite using 20 year old spaghetti code
Spaghetti Code can be good stuff. Remember, OOP only happened when computers got fast enough to deal with the concept of objects, not that, OOP was ever the fastest way to do things. The -fastest- way for a program to run is to have all sorts of nasty pointer tricks, pack memory together as tightly together as possible, watch the alignment, twiddle bits, use tricks with the virtual memory space (aka, address), to help classifiy what the pointer is pointing to without having to dereference it explicitly. All of that stuff, my friend, makes for some incomprehensible code, but, you can get systems to go very fast that way. Remember too, that Microsoft culture used to be about being faster and more flexible than mainframes. DOS and early Windows were written in straight assembly language and it wasn't until WinNT that the true realm of big bloat started kicking in.
This is my sig.
You're welcome to spend the next 5 years finding out all the stuff we did wrong. Heck, on the way, maybe you'll find a security hole that forces all corporations to upgrade to the latest software, since we'll be dropping all support and maintenance for this crap now.
whatever happened to that lump of shit? did he finally get flushed?
He's busy shilling for Barry-O.
*Patents*. Microsoft has patents that may cover your implementations
of the formats. Neither this notice nor Microsoft's delivery of the
documentation grants any licenses under those or any other Microsoft
patents. However, the formats may be covered by Microsoft's Open
Specification Promise (available here:
http://www.microsoft.com/interop/osp
). If you would prefer
a written license, or if the formats are not covered by the OSP,
patent licenses are available by contacting iplg@microsoft.com
Are they hoping enough OSS developers will read and work off these specs then cry patent violation?
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
They're just doing this so that a format they created becomes de facto, and any use of other formats will greatly diminish. It'll just enable them to make more money.
Only 30 years to go until all the nuances of WMV & AVI are known. What combination of 2 & 0x2000 means mp3?
...to see the file formats for Outlook's PST files.
K
--- I was far from home, and the spell of the Eastern sea was upon me. -Lovecraft-
We want to encourage more behavior like this.
I just saw Rush in concert on Friday and was reminded by this MS crap (yes, crap!) of a phrase in Tom Sawyer: "those who wish to seem."
MS wants people to think they are opening up but this has been pointed at before:
It is important to note that open source developers, whether commercial or non-commercial, will not need a patent license for the development of implementations of these protocols or for the non-commercial distribution of these implementations, according to Microsoft's Patent Pledge for Open Source Developers.
You can develop implementations of these protocols all you want, JUST DON'T DISTRIBUTE THEM COMMERCIALLY. Isn't that pretty useless for most open source developers? Note also how it's worded, to make you THINK it's open, when in reality it is not. They just wish to seem "open" but not to be.
Unless something changed, "those who wish to seem" is in Limelight, not Tom Sawyer =)
Offtopic - I'm gonna go see Rush the 19th.
I'm starting to think GNU is the problem with "GNU/Linux" these days.
So, they released the spec, great.
Unfortunately, everyone else already reverse engineered it, and *basically* had perfect support anyway.
I fear the Y2038 bug
The Word format document is interesting. Much of this has previously been reverse engineered, but it's good to see the documentation. First, there are several layers of packaging and encapsulation before you get to the actual content (the "WordDocument stream"). The actual content is an indexed collection of "characters", but they're not stored in sequence. Section 2.4.1 (page 37) describes the algorithm for retrieving character N. It's clearly essential to do some caching to read the document efficiently. This seems to be a mechanism to allow "fast save", where only parts of the document file are updated.
Within Word files, there are lists of properties which apply to a range of characters. This is the basic structure of formatting information. It's not, though, the only form of formatting information. Some info, like paragraph boundaries, table cell boundaries, and section boundaries, are stored as character values in the character stream.
This should be a big help to Open Office's import filter, which has trouble getting correct positioning info from Word documents.
I'd like to know if rendering directions are included. There are a few reasons why after using OpenOffice as my main office software for a long time I had to buy MS Office.
1) Drawings get all messed up
2) Tried to edit proposals that got passed between Office and OpenOffice. Do that a few times and you will see an unholy mess. Hint, styles were involved.
In other words OOXML says "like Word97", etc. so I wonder is the "like" part included in the spec or is that "just implementation"? If not included then it does not maybe help OOo to achieve equivalence.
Of course there are also the OOo devs who do not like equivalence. I have submitted lots of bug reports and enhancement requests. What about a client who requires you use the password function, even though it is broken (and I wrote a letter to them about it), because it is seen as a minimal attempt at security in the eyes of auditors. Using OOo means it looks like I am intentionally removing security. I would prefer an OpenOffice that achieves complete equivalence and then adds value on top of that, instead of including holy battles. Give me the option to draw charts just like MS Office, and use the stupid broken password function. The best part of OpenOffice is the autocomplete, which although not as enhanced as I would wish it saves me many keystrokes, also Draw is quite nice. But most people in business cannot use OOo because it wrecks the documents you have to share with people. Drawback? The other guy still uses OpenOffice exclusively and I got more work to do - the stuff OOo can't handle. I still have it (as PortableApps) and use it for heavy typing but not for editing MS Office documents anymore
Why there is nowhere the Access DB file format (*.mdb files) ? I hate when I've a "corrupted" file and all my data are still there...
Do these licensing terms leave room for OpenOffice to improve support for microsoft formats? I think it could be debated wheter OpenOffice is "commercialy distributed".
Or will it mean a fork of OpenOffice? One fork not implementing stuff from these specifications, fit to be "commercially distributed", and one implementing the specifications and not fit to be in any payed for distribution?
Quark XPress, the former de facto standard in DTP software, regularly had and has corrupted files. But that's no wonder - they're really worse than Microsoft in about every aspect. They were a monopolist, delivered (and still deliver) a crappy product with lots of bugs and are arrogant as hell. They won't even *talk* to you about giving out the file format specs, even if you tell them "here's the checkbook, how much do you want?"
Nowadays almost everything we get is Adobe InDesign, but Adobe's not much better, also in almost every aspect. Installation-corrupting auto-update anyone? Legit-license-disabling DRM anyone? Go Adobe, go. You're the champion. Oh, and it would be nice if we could get working PDF printers in 10.5. It's not as if it weren't out for a while now or as if the 10.5 release had been a sudden surprise.
Who is General Failure and why is he reading my hard disk?
I for one hope this will finally allow the OpenOffice.org dev team to enable Draw to open Publisher files and to save to them. That is, to my knowledge, the last bit of major compatibility missing from OpenOffice.org at this time. That would also allow for people who use Publisher a lot to finally be able to make the switch to Draw without re-creating all of their old documents and for them to share with people who still use Publisher.
That would be a major boon for OO.org if you ask me. :)
~Petaris "The world is open. Are you?"
It seems lately that Microsoft can't keep up with the competition:
"Linux on Asus EEE? Uh, we'll throw WinXP at it, until we come up with something decent. Crap, we'll have to implement ODF before our OOXML into Office! Well, drop old document specs to our competitors, that'll keep them busy implementing our standards."
These specs are nothing but a bait that will keep F/LOSS off course delivering the best open document standards to the world. The only thing that we need is a mass converter from .doc to .odt or .pdf, not another office app that creates even more non-standard documents.
Ceterum censeo Microsoft esse delendam.