Microsoft Releases Pre-2007 Binary File Format Specs
An anonymous reader writes "Microsoft has released the specifications for the binary file formats used by pre-2007 Microsoft Office applications. They're accurate this time! Honest! While the documents are enormous (Word alone requires 533 pages; Excel runs over 1000 plus another 850 pages for the Office 2007 binary format), they hopefully will be useful to developers trying to create or extract information from Microsoft Office files (which despite their flaws, have been the de facto standard in many fields for some time now)."
I know it's old hat by now, but back in the Office 98 days, file corruption was a big deal.
I wonder what was going on, but it occurs to me that now I could concievably actually back out
the errors, and figure the thing out.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Personally, the VBA .pdf is the most interesting of the lot.
Wouldn't want to sound ungrateful about some of the tasty bits not present, so let me hope that this is yet another positive step that encourages follow-on.
Get thee glass eyes, and, like a scurvy politician, seem to see things thou dost not.--King Lear
A far cry from the 6,000 pages for OOXML ..
...to finally share proper doc of the old standards. This just means they feel confident that MS Office 2007 will take firm enough root to ensure that the old game of catch up for FOSS projects will stay the same.
And wasn't it just yesterday some twits had an artice about how MS is changing/will change? I sure wouldn't hold my breath!
Caveat Utilitor
Did anyone else notice this is coming out on the first business day at MS that is Gates free...?
is WHEN are they going to release the source code to the Flight Sim in Excel 98?
Wow... this is great! A decade or so late, but... great.
EP
the "license" conditions no doubt will contain several pitfalls for anyone who actually wants to use it to implement a file input/output filter in conjunction with free software... and the other problem is once having seen the specification, you'll never be able to safely work on other free software projects again...
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Isn't this old news? I mean, it's been covered on Slashdot at least twice now. (Dear timothy, I'd like to introduce you to my friend Google.)
Yes, the formats are large and complicated, but for a variety of good, if antiquated, reasons. I'd suggest anyone interested read Joel Spolsky's blog post on it (which, being posted last February, isn't news either but hey, this is Slashdot).
"What do you despise? By this are you truly known." --Princess Irulan, Manual of Muad'Dib
/)
And to think, this happens the day after Gates steps down...
I honestly believe that they are trying to give out complete information. It's just that they have 20 years of spaghetti code to somehow shape into an API document. I doubt if anyone at Microsoft really knows how the code works.
With a 1000 page document describing how to list off spreadsheet information, I shudder to think about how organized their kernel is.
The released specifications are in a pre-2007 MS Office binary file format.
I can't understand the negativity. Sure Microsoft has an unpleasant past, but this is a good move on their part and should be met with nothing less than praise.
We want to encourage more behavior like this.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
The only problem? They released them in Word format...
(Okay, not really -- someone must have realized that that would be silly.)
Wait ... what did I just say? ...
I don't think I'm feeling well. I'm gonna go lie down now.
Or is it Wholly Crap?
I guess we'll see. I'm rather shocked by this. This is a kind of "giving in" gesture that is MOST uncharacteristic of Microsoft. Is this was the "Post-Gates" Microsoft will be like? How much more cooperative spirit will the community enjoy?
It's always a trap.
Help stamp out iliturcy.
From here -> You or anyone else has nothing to worry about. Microsoft has changed its tune.:
Microsoft irrevocably promises not to assert any Microsoft Necessary Claims against you for making, using, selling, offering for sale, importing or distributing any implementation to the extent it conforms to a Covered Specification (âoeCovered Implementationâ), subject to the following. This is a personal promise directly from Microsoft to you, and you acknowledge as a condition of benefiting from it that no Microsoft rights are received from suppliers, distributors, or otherwise in connection with this promise. If you file, maintain or voluntarily participate in a patent infringement lawsuit against a Microsoft implementation of such Covered Specification, then this personal promise does not apply with respect to any Covered Implementation of the same Covered Specification made or used by you. To clarify, âoeMicrosoft Necessary Claimsâ are those claims of Microsoft-owned or Microsoft-controlled patents that are necessary to implement only the required portions of the Covered Specification that are described in detail and not merely referenced in such Specification. âoeCovered Specificationsâ are listed below.
Descriptive specifications alone are never good enough.
Trusted Computing FAQ | Free Dawit Isaak!
a) Does this mean the standard GNU response is now invalid?
b) If someone writes a FOSS implementation of a .doc/.xls viewer, does that mean MSFT could more easily throw their weight to declaring .doc a standard? (Since a standard ought to have multiple implementations, although maybe office 2003 and 2007 counts as two, or office and word/excel/powerpoint viewer :p )
If you think Word is only dealing with "saving text" you need to spend some time learning what it can do. The format specs are big because their users needs are big.
Good News:
MS is releasing specs on Word & Excel!
Bad News:
The documentation can only be opened using WordStar 2.0...
(But I hear they're working on a version for TROFF)
Perhaps they have a few of:
:-)
"This Page Intentionally Left Blank"
I knew about this since august 2007 and even submitted it to slashdot twice, although it didn't get picked for front page. See http://developers.slashdot.org/~mastropiero/journal/
This is definitely useful for app developers of free software.
Raymond Chen (well known Microsoft blogger) linked to Joel on Software today about Why the MS Office file formats are so complicated
You are right. This is a great step forward. However, I think the Slashdot community, with its cynical eye on Microsoft, is reminding us to take this in the proper context. It remains to be seen whether this is the beginning of a slow but steady change of course for the world's largest software company, or whether this is a fake-out to fool people into thinking that Microsoft is nice.
Personally, I suspect that this reflects internal conflict within Microsoft, with some portions of the behemoth trying to do something good, while another faction still trying to squeeze money out of Microsoft's unique position in the software world.
In any case, remember how some people would say, "You always complain about Microsoft! What would it take for you to admit that Microsoft is doing something good?"
#2 on the list was: Stop hijacking the HTML standard and make a compliant browser! Then they put out IE7. (Not perfect, but a heckuva lot better than IE6!)
#1 on the list was: Open up the Word document file format. Okay, so they've done that. (Again, not perfect, but a heckuva lot better than what went on before!)
Congrats, Microsoft. You did it. A little late in coming, and you really didn't impress us with your OOXML fiasco waving that money around, but I'm willing to adopt a wait-and-see attitude to see whether it's still those same money-grubbing upper level managers that are in control, or whether this really is a new day at Microsoft.
404555974007725459910684486621289147856453481154 in hex is "You sank my Battleship?"
[GPG key in journal]
Where is Visio ?
Or because it uses the highly popular "Big Ball of Mud" software architecture so prevalent among Windows developers.
Deleted
"In addition to posting this documentation, Microsoft also published a list indicating which of the published protocols built into the following products are covered by Microsoft patents or patent applications"
"Some of the Microsoft protocols include patented inventions, and others do not. You may benefit from a patent license if you are distributing implementations of these protocols commercially or if you use an implementation of any of the protocols covered by Microsoft patents"
davecb5620@gmail.com
This might be Microsoft's way to help combat global warming, by freezing Hell. :-)
home
"This is definitely useful for app developers of free software"
You mean as in you work on the implementation for free and Microsoft benefits from any commercial developments.
davecb5620@gmail.com
I'm sure this move was somewhat forced to please the European Union or something.
In any case, I'm sure this would be just what Sun needs to make OpenOffice(.org) more compatible with MS Office than MS Office itself :)
Crystal clear to me .. :)
davecb5620@gmail.com
20 years ago, at what was the world's largest software project, we used to joke that if we wanted to ruin our competition, we would send them a copy of our specs. It looks to me that Microsoft got the same idea.
"To those who are overly cautious, everything is impossible. "
Microsoft releases api/ protocol specs | Feb. 2008
http://www.theregister.co.uk/2008/02/21/microsoft_goes_open/
Microsoft releases further specs | April. 2008
http://www.theregister.co.uk/2008/04/08/microsoft_posts_protocol_documents/
And they state that more will come after gathering feedback between then and June.
Between now and June it will garner feedback from the developer community. Then, at the end of June, Microsoft will publish the final versions of technical documentation - along with definitive patent licensing terms.
Hehe. I love how the PDF was produced by Microsoft Word 2007, nice little dig at Adobe after they kicked up a fuss about it being installed by default.
This means, as far as I know, that GPL implementations are not allowed. So it's an even worse situation than before, because Free Software developers can't even look at this documentation to verify any of the conclusions of their reverse engineering.
Could somebody explain to me the "flaws" of the office documents format? Besides not being open format, that is. This is a genuine question for a genuinely interested person.
is that they're ready with their new "standard", and they're confident that that won't be Reverse Engineered....
Exceeding the recommended torque is not recommended.
Just because Lucy has always jerked the football away doesn't mean Charlie Brown won't get to kick it this time.
Have they learned nothing from Bill? And before his body^H^H^H^Hchair was even cold too!
I can't resist the flame bait. The suite certainly has flaws but I having a hard time thinking of a better suite. I can think a free one that no one can seem to give away :)
Go ahead. You know you want to flame me...
/LabMonkey09
I would disagree that the protocol shouldn't be patentable. Especially to contract driven programming, the protocol is really the only thing that matters.
This is my sig.
t they can do that despite using 20 year old spaghetti code
Spaghetti Code can be good stuff. Remember, OOP only happened when computers got fast enough to deal with the concept of objects, not that, OOP was ever the fastest way to do things. The -fastest- way for a program to run is to have all sorts of nasty pointer tricks, pack memory together as tightly together as possible, watch the alignment, twiddle bits, use tricks with the virtual memory space (aka, address), to help classifiy what the pointer is pointing to without having to dereference it explicitly. All of that stuff, my friend, makes for some incomprehensible code, but, you can get systems to go very fast that way. Remember too, that Microsoft culture used to be about being faster and more flexible than mainframes. DOS and early Windows were written in straight assembly language and it wasn't until WinNT that the true realm of big bloat started kicking in.
This is my sig.
You're welcome to spend the next 5 years finding out all the stuff we did wrong. Heck, on the way, maybe you'll find a security hole that forces all corporations to upgrade to the latest software, since we'll be dropping all support and maintenance for this crap now.
Wise man say building all corporate data on excel spreadhseets is building a house of cards.
Seriously, there are issues to using excel for data. Corruption being one. Companies often have no clue what they are doing as they build this house of cards. Links to files include hard coded server names. If you replace a server or move any data things break. Yes it can be prevented. But still, not everyone knows how to do this.
Corporations that build everything on excel will experience Microsoft Vendor Lock-In (tm). That includes being tied to Microsoft Vista and Microsoft Windows 7. Depending on how bad that is. It may be no fun at all trying to move to a new platform. And if they think it is bad trying to convert 5 years of data locked up in excel format, think of all the fun of converting 10 years worth of data.
vi +
*Patents*. Microsoft has patents that may cover your implementations
of the formats. Neither this notice nor Microsoft's delivery of the
documentation grants any licenses under those or any other Microsoft
patents. However, the formats may be covered by Microsoft's Open
Specification Promise (available here:
http://www.microsoft.com/interop/osp
). If you would prefer
a written license, or if the formats are not covered by the OSP,
patent licenses are available by contacting iplg@microsoft.com
Are they hoping enough OSS developers will read and work off these specs then cry patent violation?
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife
They're just doing this so that a format they created becomes de facto, and any use of other formats will greatly diminish. It'll just enable them to make more money.
Only 30 years to go until all the nuances of WMV & AVI are known. What combination of 2 & 0x2000 means mp3?
...to see the file formats for Outlook's PST files.
K
--- I was far from home, and the spell of the Eastern sea was upon me. -Lovecraft-
The Exchange protocol specs HAVE been released.
For a site about things like basic rights, Slashdot users sure do like to censor "dissent".
We want to encourage more behavior like this.
I just saw Rush in concert on Friday and was reminded by this MS crap (yes, crap!) of a phrase in Tom Sawyer: "those who wish to seem."
MS wants people to think they are opening up but this has been pointed at before:
It is important to note that open source developers, whether commercial or non-commercial, will not need a patent license for the development of implementations of these protocols or for the non-commercial distribution of these implementations, according to Microsoft's Patent Pledge for Open Source Developers.
You can develop implementations of these protocols all you want, JUST DON'T DISTRIBUTE THEM COMMERCIALLY. Isn't that pretty useless for most open source developers? Note also how it's worded, to make you THINK it's open, when in reality it is not. They just wish to seem "open" but not to be.
Wise man say building all corporate data on excel spreadhseets is building a house of cards.
I couldn't agree with you more, but the more recent trend is to use Excel as the presentation layer, which is much, much safer. You build a web site that pumps the data out of the database, create Excel sheets dynamically, and you got a lot of happy Excel junkies.
This is my sig.
Unless something changed, "those who wish to seem" is in Limelight, not Tom Sawyer =)
Offtopic - I'm gonna go see Rush the 19th.
I'm starting to think GNU is the problem with "GNU/Linux" these days.
So, they released the spec, great.
Unfortunately, everyone else already reverse engineered it, and *basically* had perfect support anyway.
I fear the Y2038 bug
The Word format document is interesting. Much of this has previously been reverse engineered, but it's good to see the documentation. First, there are several layers of packaging and encapsulation before you get to the actual content (the "WordDocument stream"). The actual content is an indexed collection of "characters", but they're not stored in sequence. Section 2.4.1 (page 37) describes the algorithm for retrieving character N. It's clearly essential to do some caching to read the document efficiently. This seems to be a mechanism to allow "fast save", where only parts of the document file are updated.
Within Word files, there are lists of properties which apply to a range of characters. This is the basic structure of formatting information. It's not, though, the only form of formatting information. Some info, like paragraph boundaries, table cell boundaries, and section boundaries, are stored as character values in the character stream.
This should be a big help to Open Office's import filter, which has trouble getting correct positioning info from Word documents.
I'd like to know if rendering directions are included. There are a few reasons why after using OpenOffice as my main office software for a long time I had to buy MS Office.
1) Drawings get all messed up
2) Tried to edit proposals that got passed between Office and OpenOffice. Do that a few times and you will see an unholy mess. Hint, styles were involved.
In other words OOXML says "like Word97", etc. so I wonder is the "like" part included in the spec or is that "just implementation"? If not included then it does not maybe help OOo to achieve equivalence.
Of course there are also the OOo devs who do not like equivalence. I have submitted lots of bug reports and enhancement requests. What about a client who requires you use the password function, even though it is broken (and I wrote a letter to them about it), because it is seen as a minimal attempt at security in the eyes of auditors. Using OOo means it looks like I am intentionally removing security. I would prefer an OpenOffice that achieves complete equivalence and then adds value on top of that, instead of including holy battles. Give me the option to draw charts just like MS Office, and use the stupid broken password function. The best part of OpenOffice is the autocomplete, which although not as enhanced as I would wish it saves me many keystrokes, also Draw is quite nice. But most people in business cannot use OOo because it wrecks the documents you have to share with people. Drawback? The other guy still uses OpenOffice exclusively and I got more work to do - the stuff OOo can't handle. I still have it (as PortableApps) and use it for heavy typing but not for editing MS Office documents anymore
Quark XPress, the former de facto standard in DTP software, regularly had and has corrupted files. But that's no wonder - they're really worse than Microsoft in about every aspect. They were a monopolist, delivered (and still deliver) a crappy product with lots of bugs and are arrogant as hell. They won't even *talk* to you about giving out the file format specs, even if you tell them "here's the checkbook, how much do you want?"
Nowadays almost everything we get is Adobe InDesign, but Adobe's not much better, also in almost every aspect. Installation-corrupting auto-update anyone? Legit-license-disabling DRM anyone? Go Adobe, go. You're the champion. Oh, and it would be nice if we could get working PDF printers in 10.5. It's not as if it weren't out for a while now or as if the 10.5 release had been a sudden surprise.
Who is General Failure and why is he reading my hard disk?
I for one hope this will finally allow the OpenOffice.org dev team to enable Draw to open Publisher files and to save to them. That is, to my knowledge, the last bit of major compatibility missing from OpenOffice.org at this time. That would also allow for people who use Publisher a lot to finally be able to make the switch to Draw without re-creating all of their old documents and for them to share with people who still use Publisher.
That would be a major boon for OO.org if you ask me. :)
~Petaris "The world is open. Are you?"
Excel as the presentation layer, which is much, much safer. You build a web site that pumps the data out of the database, create Excel sheets dynamically, and you got a lot of happy Excel junkies.
It should be possible to use Openoffice Calc or gnumeric as the client for this presentation layer.
vi +
It seems lately that Microsoft can't keep up with the competition:
"Linux on Asus EEE? Uh, we'll throw WinXP at it, until we come up with something decent. Crap, we'll have to implement ODF before our OOXML into Office! Well, drop old document specs to our competitors, that'll keep them busy implementing our standards."
These specs are nothing but a bait that will keep F/LOSS off course delivering the best open document standards to the world. The only thing that we need is a mass converter from .doc to .odt or .pdf, not another office app that creates even more non-standard documents.
Ceterum censeo Microsoft esse delendam.