Microsoft Releases Office Binary Formats
Microsoft has released documentation on their Office binary formats. Before jumping up and down gleefully, those working on related open source efforts, such as OpenOffice, might want to take a very close look at Microsoft's Open Specification Promise to see if it seems to cover those working on GPL software; some believe it doesn't. stm2 points us to some good advice from Joel Spolsky to programmers tempted to dig into the spec and create an Excel competitor over a weekend that reads and writes these formats: find an easier way. Joel provides some workarounds that render it possible to make use of these binary files. "[A] normal programmer would conclude that Office's binary file formats: are deliberately obfuscated; are the product of a demented Borg mind; were created by insanely bad programmers; and are impossible to read or create correctly. You'd be wrong on all four counts."
The original post is brought to you by the Microsoft corporation
Joel's articles are a joy to read. No matter what time I receive the email about a new article by Joel, it will be read on the spot.
that the hell would rather freeze over - well, looks like Satan is now skating on frozen magma lakes...
This is Slashdot. Common sense is futile. You will be modded down.
Except... we all don't have this, OLE, thing on our computers nor do we all walk it easier than the languages we deal with now.
But let's say you do. Now you have to find an API to do it for you. As an every day guy, I can write my own HTTP parser, IP connection manager and so forth, w/o requiring special API to do it. As a smarter guy, I'd look for the libraries that can do some of the heavy lifting for me. It's flexibility. The document structure is going to affect how I write code to work with ti.
W/ office docs, Joel is arguing, I have to know the one way to interact with them. There's no TIMTOWTDI about it. There's no intuitive way to do it either. Were the format to be simple, be it "sanely" constructed CSV, XML, RTF, etc, I have more choices. I'd rather use the most well known, bestest of the best, but sometimes it's not intuitive and just hamper's work. It shuts out programmers who would think, open(file); readSomeData(); construct_a_structure();. Now it's, structure = oneOfAHandfulOfParsersThatWillEverWork().
The worst part of that is, since I have no way *I* can choose how to mess with documents. I have to either a) spend more time figuring out the native format unless I'm a genius or have an MS crone behind me, or b) parse it incorrectly, and then have to go back and fix any number of things, including my methodology. Remember how the various encodings affected document format? I.e. UTF-8, 16, Latin-1, Unicode, etc etc etc..
Joel, you're not right.
One may wonder, why release the documentation now?
If you read Joel's blog you'll see the formats are very old, and consist primarily of C-structs dumped to OLE objects, dumped directly to what we see as an XLS, DOC and so on files.
There's almost no parsing/validation at load time.
Having this in a well laid documentation may reveal quite a lot of security issues with the old binary formats, which could lead to a wave of exploits. Exploits that won't work on Microsoft's new XML Office formats.
So while I'm not a conspiracy nut, I do believe one of Microsoft's goals here are to assist the process of those binary formats becoming obsolete, to drive Office 2007/2008 adoption.
I'd assume it has something to do with the antitrust action the EU was taking. Didn't they order that Microsoft had to open all their protocols/formats?
I would like to point out another good option Joel doesn't have on his list. It's a software called OfficeWriter, from a company named SoftArtisans in Boston. When I last checked/worked there, it was capable of generating Excel and Word docs on the server, and I believe Powerpoint was probably coming relatively soon. Creating a product that can write office documents isn't quite as impossible in terms of labor as Joel is saying.... but it's still way beyond any hobby project. Plus, he is suggesting that you use Excel automation or the like through scripts to create documents on the server, which is a decent suggestion, if you want Excel or Word to constantly crash and lock up your server, and you enjoy rebooting them every day. If you want to do large scale document generation on a server you are going to need something like Officewriter. -Vosotros/Matt
Only took something like 5 years*, eh? :P
* I can't actually remember how long ago it was
which is totally what she said
As PJ pointed out over on Groklaw, MS are giving a "Promise" not to sue but this is very very far from a license. Careful analysis suggests that any GPL'd software using these binaries could easily fall foul of the fury of MS lawyers.
Nothing to see here. Move along.
Why does the author avoid any mention of ODF or OpenOffice as alternatives to work with MS Office docs? He seems stuck on 'old' formats like WKS or RTF.
.docs or .xls to ODS or PDF using the OOo code. Any one knows about such a tool?
I know OOo is not a perfect Word/Excel converter, but it has served me marvelously since the StarOffice days. I wish that there was a simple command-line driven tool that could convert
Goodbye Slashdot. You've changed.
How to look nice and offload some work in one shot.
With this M$ can shut off critics that say proprietary formats are evil, especially those using the long-term viability argument.
Now that the formats are documented, hordes of open source hobbyist can develop (for free) code and tools to read / convert the old Office formats. Then M$ will tell "See, we do not lockout anybody, there are myriads of ways to read our old crap".
Smart indeed. And anyway these format do not hold any competitive advantage anymore since most users are coping with the new ones now.
Is this retaliation to the impending doom of the OOXML format requesting ISO standard status? Is MS's thinking: "Right, ISO has failed us, so we'll release the binaries so everyone keeps using the office formats anyway"?
ilovegeorgebush
"[A] normal programmer would conclude that Office's binary file formats: are deliberately obfuscated; are the product of a demented Borg mind; were created by insanely bad programmers; and are impossible to read or create correctly. You'd be wrong on all four counts..." ...It's something far more sinister.
(Sorry, sometimes ya just gotta get it out)
I'm a she-slashdotter... but I make up for it by living with my folks.
Just as OOXML files and WMF make references to Windows or Office programming APIs, I think it would come as no surprise to anyone that Office binary formats would also make similar references. The strategy behind it would be obvious -- to tie the data to the OS and to the software as closely as possible.
I don't see why just because something is organized filesystem-like (not such an awful idea) means it has to be hard to understand. Filesystems, while they can certain get complicated, are fairly simple in concept. "My file is here. It is *this* long. Another part of it is over here..."
Wait, I thought you were trying to convince us that this doesn't reflect bad programming...
Ah, I see, you're trying to imply that it's the very design of the Word-style of word processor that is inherently flawed. Finally we're in agreement.
Anyways, it's no surprise that it's all the OLE, spreadsheet-object-inside-a-document, stuff that would make it difficult to design a Word killer. (How often to people actually use that anyway?) It would basically mean reimplementing OLE, and a good chunk of Windows itself (libraries for all the references to parts of the operating system, metafiles, etc), for your application. However, it certainly can be done. I'm not sure it's worth it, and it can't be done overnight, but it's possible. However you'll have a hard time convincing me that Microsoft's mid-90's idea of tying everything in an application to inextricable parts of the OS doesn't reflect bad programming. Like, what if we need to *change* the operating system? At the very least, it reflects bad foresight, seeing as they tied themselves to continually porting forward all sorts of crud from previous versions of their OS just to support these application monstrosities. This is a direct consequence of not designing the file format properly in the first place, and just using a binary structure dump.
It reminds me of a recovery effort I tried last year, trying to recover some interesting data from some files generated on a NeXT cube from years ago. I realized the documents were just dumps of the Objective C objects themselves. In some ways this made the file parseable, which is good, but it other ways it meant that, even though I had the source code of the application, many of the objects that were dumped into the file were related to the operating system itself instead of the application code, which I did _not_ have the source code to, making the effort far more difficult. (I didn't quite succeed in the end, or at least I ran out of time and had to take another approach on that project.)
In their (MS's) defense, I used to do that kind of thing back then too, (dumping memory structures straight to files instead of using extensible, documented formats), but then again I was 15 years old (in 1995) and still learning C.
Still missing the binary format for access, still never mind it's not that hard to work out
thank God the internet isn't a human right.
I'd assume it has something to do with the antitrust action the EU was taking. Didn't they order that Microsoft had to open all their protocols/formats?
As far as I remember, they only insisted on protocols (it was on the basis of a complaint from server OS vendors that MS was tying their market-leading desktop OSs to their server OSs and gaining an unfair advantage).
The second "workaround" is the same as the first, only a little more proactive. Instead of saving my documents as binary files and then converting them to another format, I should save them as a non-binary format from the start! Mission accomplished! Oh wait - how do I get the rest of the world to do the same? That could be a problem.
I fail to see the problem with using the specification Microsoft released to write a program that can read and write this binary format. If Microsoft didn't want it to be used, they would not have released it. Even if Microsoft tried to take action against open source software for using the specs that they opened, how could Microsoft prove that the open source software used those specs as opposed to reverse engineering the binary format on their own? I think this is a non-issue.
Joel is being awfully apologetic. I understand why they are bad formats, but it doesn't change the fact they are bad.
If an officer ever threatens to taze you, say you have a pacemaker.
Spolsky's advice explains that the format code is extremely bad code from the POV of a programmer picking it up to use starting now. Because it grew like a coral reef, starting so long ago that interoperability with anything else but the app's codebase at the time was not in the designs. And every new feature was thrown in as a special case, rather than any general purpose facility for kinds of features or future expansion. The Microsoft legacy that leverages every year's market position into expansion the next year.
But we're not Microsoft, and we don't have the requirements MS had when making these formats. So we should by no means perpetuate them. We should do now what MS never had reason to do: upgrade the code and drop the legacy stuff that makes most of the code such a burden, but doesn't do anything for the vast majority of users today (and tomorrow).
That's OK, because Microsoft has done that, too, already. The MS idea of "legacy to preserve" is based on MS marketing goals, which are not the same as actual user requirements. So that legacy preservation doesn't mean that, say, Office 2008 can read and write Word for Windows for Workgroups for Pen Computing files 100%. MS has dropped plenty of backwards compatibility for its own reasons. New people opening the format for modern (and future) use can do the same, but based on user requirements, not emphasis on product lines if that's not a real requirement.
So what's needed is just converters that use this code to convert to real open formats that can be maintained into the future. Not moving this code itself into apps for the rest of all time. Today we have a transition point before us which lets us finally turn our back on the old, closed formats with all their code complexity. We can write converters that can be used to get rid of those formats that benefited Microsoft more than anyone else. Convert them into XML. Then, after a while, instead of opening any Word or Excel formats, we'll be exchanging just XML, and occasionally reaching for the converter when an old file has to be used currently. MS will go with that flow, because that's what customers will pay for. Soon enough these old formats will be rare, and the converters will be rare, too.
Just don't perpetuate them, and Microsoft's selfish interests, by just embedding them into apps as "native" formats. Make them import by calling a module that can also just batch convert old files. We don't need this creepy old man following us around anymore.
--
make install -not war
When Excel started importing 1-2-3 documents, the right way to do that would be to create an importer to your own native format. Not to munge a new slightly different format into your existing structures. Yes, you'd have had to convert some dates between 1900 and 1904 formats (and maybe, detect cases where the old 1-2-3 bug could have affected the result) but at least you wouldn't be trying to maintain two formats for the rest of time.
If this is an example of programmers throughout history always doing exactly the right thing, I'd hate to see an example of code where the original author regretted some mistakes that had been made.
Joel is usually spot on, but the advice he gave in the article is actually pretty terrible if you are going to have to generate any volume of Excel reports. Automating Excel is slow and unwieldy, and should not be hooked up to a server. You will be limited to a few workbook generation requests per second, and if you need to handle more, buying another Windows/Office license and load balancing is pretty awful. The only way that this might be workable is to set up a process that sits in the background with a "pool" of automated excel instances launched and waiting for work, so that when there is a high volume of requests, they get forwarded to different instances. Still not very scalable.
There are companies out there that have reverse engineered the file format (the one I have experience with is SoftArtisan ExcelWriter, which is buggy), but overall there will be no clean, scalable solution for this until Excel 2007/the Excel 2003 compatibility pack are more prevalent you can just generate the XML to represent the workbook.
Unfortunately, I think I BSD released it... :-(
:-)
I know! I'll get Theo to rant at you!!!
It's interesting you give a nicely egotistical critique of a well-regarded expert's article, but don't suggest a single alternative to how M$ could have met their design goals, nor explain why the no-interoperability assumption was unreasonable at the time. If you can't appreciate the design goals, nor suggest a way to meet them, what's the point of the rest of your post?
In their (MS's) defense, I used to do that kind of thing back then too, (dumping memory structures straight to files instead of using extensible, documented formats), but then again I was 15 years old (in 1995) and still learning C.
Except for the "1995" part, wasn't that pretty much how Microsoft got started?
They haven't advanced from that point by much....
U+F8FF
Then what's with that 2Gb limit ? Or what's with the decision to use such formats for mail-storage and databases ?
Religion is what happens when nature strikes and groupthink goes wrong.
I've worked on some of these file formats quite a bit (I was the text conversion guy when WP went to Corel -- don't blame me, it was legacy code! ;-) ) Anyway, while the formats are quite strange in places, they aren't really that difficult to parse. I would be willing to speculate that this was never really much of a problem in writing filters for apps (or at least shouldn't have been).
;-) But that was a long time ago...
No, the difficulty with writing a filter for these file formats is that you have no freaking clue what the *formatter* does with the data once it gets it. I'm pretty sure even Microsoft doesn't have an exact picture of that. Hell, I barely ever understood what the WP formatter was doing half the time (and I had source code). File formats are only a small part of the battle. You have all this text that's tagged up, but no idea what the application is *actually* doing with it. There are so many caveats and strange conditions that you just can't possibly write something to read the file and get it right every time.
In all honesty I have at least a little bit of sympathy for MS WRT OOXML. Their formatter (well, every formatter for every word processor I've ever seen) is so weird and flakey that they probably *can't* simply convert over to ODF and have the files work in a backwards compatible way. And lets face it, they've done the non-compatible thing before and they got flamed to hell for it. I honestly believe that (at some point) OOXML was intended to be an honest accounting of what they wanted to have happen when you read in the file. That's why it's so crazy. You'd have to basically rewrite the Word formatter to read the file in properly. If I had to guess, I'd say that snowballs in hell have a better chance...
I *never* had specs for the word file format (actually, I did, but I didn't look at them because they contained a clause saying that if I looked at them I had to agree not to write a file conversion tool). I had some notes that my predecessor wrote down and a bit of a guided tour of how it worked overall. The rest was just trial and error. Believe it or not, occasionally MS would send up bug reports if we broke our export filter (it was important to them for WP to export word because most of the legal world uses WP). But it really wasn't difficult to figure out the format. Trying to understand how to get the WP formatter (also flakey and weird) to do the same things that the Word formatter was doing.... Mostly impossible.
And that's the thing. You really need a language that describes how to take semantic tags and translate them to visual representation. And you need to be able to interact with that visual representation and refer it back to the semantic tags. A file format isn't enough. I need the glue in between -- and in most (all?) word processors that's the formatter. And formatters are generally written in a completely adhoc way. Write a standard for the *formatter* (or better yet a formatting language) and I can translate your document for you.
The trick is to do it in both directions too. Things like Postscript and PDF are great. They are *easy* to write formatters for. But it's impossible (in the general case) to take the document and put it back into the word processor (i.e. the semantic tags that generated the page layout need to be preserved in the layout description). That also has to be described.
Ah... I'm rambling. But maybe someone will see this and finally write something that will work properly. At Corel, my friend was put on the project to do just that 5 times... got cancelled each time
Comment removed based on user account deletion
Wait, I thought you were trying to convince us that this doesn't reflect bad programming... Wholly out of context, Batman! They made a design decision to ignore interoperability and optimized towards small memory space. What part of that is hard to understand? You think everything should be designed up front for interoperability, regardless of context? In the mid to late 80s, there just wasn't a huge desire for this feature, as Joel states. but then again I was 15 years old (in 1995) and still learning C. Ah, now your post makes sense. You completely lack perspective. The Word/Excel doc formats were around 10 years before you. You lack the knowledge about why dumping C data structures directly to disk was necessary--even though Joel spells it out. You don't understand what OLE truly solved (not just embedding spreadsheets inside of word, by the way). And most importantly, you seem to lack the ability to understand design trade-offs.
I think the design goals were flawed. That's my point. Their design goals should have included, how can we ensure that our customer's data will be (usefully) readable in the future? Sure, back then maybe it was worth it to skimp on validation in order to squeeze out a few extra microseconds of processing time, because the competition would avoid doing this and beat you with claims of efficiency. I guess we've all learned a lot about how to deal with data since the 90's. A big part of that was learning the importance of metadata. (ie., tagged, extensible formats)
Anyways, just because it was done years ago, under different conditions, doesn't mean it wasn't bad programming. Maybe everyone else would have done it the same way, maybe I would have too. Still doesn't mean it wasn't bad programming. (I shouldn't say "bad programming" of course, the code could be fine for all I know.. I should say "bad design", in hindsight. Like a lot of things.)
By the way, "the no-interoperability assumption" is _always_ unreasonable. (IMHO of course.)
...or at least, not much, encoding type info is not what he intended. And you've just demonstrated AKAImBatman's point that "Programmers didn't understand why Hungarian originally used his famous notation". That's not to say Hungarian Notation is necessarily good or bad (I'm not arguing about it! heh), but you're not making your judgment on the facts.
In their (MS's) defense, I used to do that kind of thing back then too, (dumping memory structures straight to files instead of using
extensible, documented formats), but then again I was 15 years old (in 1995) and still learning C.
Well, 1995 microsoft wasn't much older than you so it's kind of understandable.
1. You have a web-based application that's needs to output existing Word files in PDF format. Here's how I would implement that: a few lines of Word VBA code loads a file and saves it as a PDF using the built in PDF exporter in Word 2007. You can call this code directly, even from ASP or ASP.NET code running under IIS. It'll work. The first time you launch Word it'll take a few seconds. The second time, Word will be kept in memory by the COM subsystem for a few minutes in case you need it again. It's fast enough for a reasonable web-based application.
2. Same as above, but your web hosting environment is Linux. Buy one Windows 2003 server, install a fully licensed copy of Word on it, and build a little web service that does the work. Half a day of work with C# and ASP.NET. So if you are on a Linux system, you are screwed . I think this article is written by some M$ fanboy. Nothing wrong here. But saying that Linux user should just dump their software, and go for Microsoft stuff , just because It's very helpful of Microsoft to release the file formats for Microsoft and Office, but it's not really going to make it any easier to import or save to the Office file formats. I think it's wrong wrong wrong.
this is a good discussion. compatibility is important to us, not only from one system to another but also across time.
.rtf
.rtf and would like to see it become an ISO/ANSI standard
.xls and .doc files that will need to be converted to OOXML or risk becoming un-usable
it has often seemed to me that proprietary solutions should be avoided for this reason.
i recently converted my Win 3.11 computer to XP. quite a move, but look how much i saved not doing all the interim updates!
i did have some documents in the old WordPerfect 5.1 format but I managed to acquire a program that will read these and write them as
I like
but think how many libraries are loaded with
hmmm
So that XP get exploited and thus puts Vista in better light...
While I was a contractor for a now defunct contracting company, we did a contract for Microsoft. This was pre windows 3.1. We did some innovations which I think became the bases for some of the OLE stuff, but I digress, Microsoft had a spec for its "Chunky File Format."
The office format based on the chunky file format does not have a format, per se' It is more similar to the old TIFF format. You can put almost anything in it, and the "things" that you put in it pretty much define how they are stored. So, for each object type that is saved in the file, there is a call out that says what it is, and a DLL is used to actually read it.
It is possible for multiple groups within Microsoft to store data elements in the format without knowledge of how it is stored ever crossing groups or being "documented" outside the comments and structures in the source code that reads it.
This is not an "interchange" format like ODF, it is a binary application working format that happens to get saved and enough people use it that it has become a standard. (With all blame resting squarely on M$ shoulders.)
It is a great file format for a lot of things and does the job intended. Unfortunately it isn't intended to be fully documented. It is like a file system format like EXT2 or JFS. Sure, you can define precisely how data is stored in the file system, but it is virtually impossible to document all the data types that can be stored in it.
You're kidding right? That's been exactly Microsoft's marketing strategy for the last ten years. Remember the Win9X BSOD ads for Windows XP? Microsoft is in the difficult position where their only real competition is their own previous products.
Support Right To Repair Legislation.
At my company, our users do that every day. Excel spreadsheets embedded in Word or PowerPoint, Microsoft office Chart objects embedded in everything. It's what made the Word/Excel/PowerPoint "Office Suite" a killer app for businesses. MS Office integration beat the pants of the once best-of-breed and dominant Lotus 1-2-3 and WordPerfect. When you embed documents in Office, instead of a static image, the embedded doc is editable in the same UI, and can be linked to another document maintained by somebody else and updated automatically. It saves tremendous amounts of staff time.
I didn't say this. I said I don't see why the fact that OLE documents being like file systems (according to TFA), means that they must necessarily be complex. i.e., I'm saying file systems aren't necessarily complex concepts, and therefore it's not an excuse for a convoluted file format. Anyways, maybe it's straining his analogy further than he intended, so I'll give you that.
What makes you think I don't understand it? It's still bad programming. Not that I have statistics, but there were plenty of examples of software that used the same or less memory than Word but managed to have better document formats.
No, I understand them. I just don't think they made the right trade-offs. It's not like they had no competition at the time, other companies that a lot of people other than me still claim had better software. Anyways it's sort of a moot argument, since what's done is done. We don't really need to write these formats any more, just read them.
It reminds me of a recovery effort I tried last year, trying to recover some interesting data from some files generated on a NeXT cube from years ago. I realized the documents were just dumps of the Objective C objects themselves.
IMO the powerfull serialisation formats of modern langauges are even worse than just dumping out C structs. If an app just dumps out C structs then you can probablly figure out the binary format pretty quickly with just the source for the app and a pagefull or so of information on the C compiler used. The application designer still has to pay some attention to file format design because structures containing pointers can't be saved directly.
For a modern serialisation format things are typically far worse, the app developer is less likely to pay attention to KISS when he can serialise any arbitary graph of objects and you need both the apps code and a load of information on how the language serialises stuff.
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register
The Hungarian thing - no, I still don't see it. Hungarian should not be used in any language which has a reasonable typing system;
A "typing" system doesn't help you read and understand the code. It doesn't give you any clues to the types of data being acted upon in a section of code. While I never bought in to the whole hungarian notation thing, at the time it was an "ism" that people went nuts about, it did address a specific problem with code readability. The concepts addressed by hungarian notation are still valid and some of the naming techniques are still also valid.
One can look at code and see "szKeyName" and know, without having to find the declaration, that it is a zero terminated character string used as a key. That's the crux of hungarian notation, but IMHO Microsoft went crazy with it and focused more on the notation and less on the naming, which actually made things harder to read. Like I said, I didn't go crazy, but even today I still try to incorporate some clue to the type of thing a variable represents in its name.
Hungarian notation is an example of a good idea in moderation that completely destroys itself when overused.
people ... of course it's impossible for anyone, including MS, to produce perfect code, structures or output, in anticipation of future developments. Clearly, coding is evolving in response to a weakness. Wasn't a new standard, XML, engineered for just this exact reason? If everyone would look beyond complaing and just implement engineering standards, we would all be ok. After all, it is what it is, just deal with it.
"Apps Hungarian", which adds semantic meaning (dx = width, rwAcross = across coord relative to window, usFoo = unsafe foo, etc) to the variable, not typing, is what is good and what he is advocating.
What is the justification for putting that semantic meaning into a variable name, instead of incorporating it into class definitions?
For example, if a string can be "safe" or "unsafe", why not have "SafeString" and "UnsafeString" classes that extend String, and use instances of those, instead of having instances of the base String class names 'sFoo' and 'usFoo'?
Why is Outlook missing from the released formats? I've spent some time reverse engineering meeting requests myself and I'd love to see the complete .msg file specification. You could find some useful on MSDN already but it was nowhere near as complete as these releases appear to be.
Ok, I was going to respond to this but I will not get dragged into another one of these discussions. It's worse than tabs vs. spaces, I tells ya.
Since you're talking about C/C++ code though, I'm going to assert that that doesn't fall into the class of language I was talking about anyway. You're playing with essentially-untyped data there a lot more.
Interoperability goes both ways; this is often (and often deliberately) forgotten. There are a lot of programs that offer you the ability to import all manner of files or settings from other competing programs (just look at your favourite mail clients), but have no decent support for exporting the full data, as well. Same with web services and whatnot. You might just be trading in something bad for something worse if there is no avenue provided to export all the data into a standardized format, or at least a well-known one.
Ok, I was going to respond to this but I will not get dragged into another one of these discussions. It's worse than tabs vs. spaces, I tells ya.
I have to disagree, tabs and spaces are easily handled with an "indent" program.
On VERY LARGE projects where there are hundreds of include files and hundreds of source files, it is not convenient or even possible in all cases to find the definition of an object that may be in use.
Context and type information in the name makes it easier to quickly read a section of code:
for(int ndx=0; ndx nLimit; ndx++)
{
pnUsrData[ndx] = pnReceived[ndx];
}
To anyone versed in your prefixing, it is easy to see pnUsrData is an array of integers, and we are assigning values from another array of integers.
However:
for(int ndx=0; ndx nLimit; ndx++)
{
pnUsrData[ndx] = foobar[ndx];
}
In the above, it is clear we are assigning data to elements in an integer array from a subscript on an object, but what kind of object? Where do we find its definition?
Now, renamed it looks like this:
for(int ndx=0; ndx nLimit; ndx++)
{
pnUsrData[ndx] = mytypeFoobar[ndx];
}
No we can see it is a "mytype" object and we can easily find its reference and declaration.
That's what Hungarian notation provides and it is not useless, IMHO, it's over zealous use made code less readable. Rather than give hints, zealous proponents attempted to create a whole new language for specifying variable and function names that was virtually impenetrable.
It's funny, I've argued in the past that Java's very verbose typing has advantages in exactly the way you list in your post. In the case of Java, in fact, you wouldn't need the type warts since the types would be readily available.
No, XML is indeed that wonderpill. Not because it's some magic format, but because it's open and human readable, not some obfuscated binary format like .DOC . The apps doing the import will be open as well. And if they "die off" later, it's because no one is using them, so who cares? The rare need in the distant future for reading whatever does get left behind in those formats will be served for whichever archivist needs it by the more recent open converter apps that should still be archived somewhere, too.
But really we agree. That "standardized format" you prefer is XML. ODF is XML. That's what I'm talking about, and it seems what you're talking about, too.
--
make install -not war
If you reply, do so only to what I explicitly wrote. If I didn't write it, don't assume or infer it.
Hard to believe the programmers who did it that way were doing exactly the right thing. Separating data and representation is a basic programming skill.
You better believe it costs Microsoft quite a bit to keep it around. At the lowest level, having the codebase that big means the tools and practices needed to manage it have to be equal to the task. Here's a hint: MS does not use SourceSafe for the Office codebase. (They use the Team tools in visual studio, so they do eat their own dogfood, but not the lite food).
Far more insidious is the technical debt incurred by carrying around that backwards compatibility with Version-1-which-supported-123-bugs-and-all. Interdependencies that mean a bug either can't be fixed without introducing regressions, or can only be fixed dint of a complex scheme involving things like the 1900 vs. 1904 epoch split that Joel discusses.
Oh yes, it costs a small fortune to carry around that baggage, and only a company as big as Microsoft with Microsoft's revenues can afford it. The price might seem like 'nothing' in the billions of dollars that flow in and out of Microsoft, but ignoring the elephant in the room doesn't make the elephant go away.
I've seen multiple links to Joel's advice. It's bad advice. He is talking out of his ass.
Do not under any circumstances run a server that automates Microsoft Office unless you can afford to pay an intern or maybe a homeless person to babysit the server 24/7. They will have to close dialog boxes when it gets stuck waiting for user input and reboot the server ocassionally because of memory leaks. Anyone that has tried to do anything with the file formats has gone this route and given up.
There is a very robust and competitive market for third-party developer components that read and write Office file formats for most popular development platforms. This is the way to go. or use Apache POI if you can put up with the missing features.
Actually, when possible, you should do both. Hungarian notation is a grammar. In the same way that English has rules for writing which include capitalizing the first letter of a sentence, proper names, and so on, Hungarian notation provides visual cues to programmers that make certain types of semantic errors "sTanD oUt." There's nothing particularly unusual about the text "sTanD oUt," and it's meaning does not change by writing it that way, but it violates the English grammar and your brain's pattern recognition identifies it as an outlier. So too with Hungarian notation. Code that does not use at least some form of Hungarian notation looks devoid of the meta content I expect my follow programmers to provide, namely what decision they've made, and whether the code conforms to those decisions. To someone accustomed to Hungarian notation, finding "double fValue;" or "if (uCount < 0)" in the code prompts the eye to linger, the brain to reparse. Ultimately, many conceptual errors are identified and resolved this way, even if the compiler fails to catch them.
Also, like any grammar, the rules depend on the circumstance and should be followed in order to resolve an existing problem or ambiguity. Fully qualifying a variable name "caiIndex" to imply "constant array index" is silly. That is cargo cult mentality. Any of the following would be fine according to the guidelines at my company and each reflects a different decision by the coder: "int nIndex;" "unsigned int uIndex;" "index_t index;". The first works best if the index will be used backwards and the loop constraint is that the index is positive. The second works best if the index is random access, so that functions that use it can check the range with one comparison rather than two. The last case indicates that the semantics and nature of the index could be dependent on a variety of factors including processor architecture, and care should be taken. Therefore, the code "--nIndex," "++uIndex," and "next_index(&index)" look correct while "for (uIndex = 4; uIndex >=0; --uIndex)" looks very bad, and "++index" should make one immediately recognize that any of the following are possible: 1) the ++operator has been overridden, 2) index_t is typecast to an integer type, or 3) this won't compile as would be case if index_t was a struct.
And so, after 28 years of programming, dealing with all different styles of C and C++, I've come to recognize that understanding and using Hungarian notation correctly is a skill. Your productivity increases as you use it, eventually you don't even notice it, and the benefits come later, particularly when refactoring, or making changes to older code, especially if written by someone else. Like syntax highlighting for your brain, if you use it long enough, you'll know when there's an error in the code without having to compile it because it will look wrong. Supposedly for lisp programmers, the same epiphany comes when you no longer see the parentheses.
Happy Programming,
-Hope
OK, *you* design a system that allows incremental saves to a floppy disk that are fast enough for people to do it often.
dom
No, XML is indeed that wonderpill...because it's open and human readable, not some obfuscated binary format like .DOC
ASCII does not mean human-readable. Instead of an obfuscated binary format, XML documents end up in an obfuscated text format.
i'd hit it so hard, if you pulled me out you'd be the king of britain [bash.org]
And some humans are illiterate.
XML is human readable because it doesn't require a machine (or superhuman skills) to read its meaning. Its field names and structure are embedded in the data, not only in a decoder context. Of course it's up to the person specifying the XML dialect to make those tags and structure comprehensible by a normal person, but that's not the format's defect. Any format can be obfuscated by design or carelessness. XML is harder to do that in. And even the most basic tools that render XML data according to their embedded schema make most XML self-evident. And that's not cheating: even this post in English requires a reader app, as does all stored data.
--
make install -not war
I'm pretty sure MS doesn't know what its Word formatter does, and I even have proof for it:
:-(
If I switch printers between the Adobe PDF and the HP printers at the office, the layout of my documents I edit in MS Word 2003 changes slightly (line lengths, row breaks, distance between rows, etc). This has been a major issue when I've had to submit papers etc. and switched to the PDF printer from the HP printers (I like to read drafts on paper as it is easier to correct), just to see the paper that I had just crafted to barely fit under the 8 page limit now is 11 pages long (with 4 of them left half blank due to formatting issues).
I have a really elegant proof for Fermat's last theorem. If this sig was only a bit longer...
Weren't these specs released in response to criticisms about unspecified aspects of OOXML? It makes reference to legacy behaviors implemented as in various Microsoft (and, in a few cases, non-Microsoft) products, and I suppose these specs were supposed to help since they more or less specify some of that stuff.
But Joel basically tells us not even to bother trying to implement them. They were designed to be fast and to rely on Windows libraries, they're burdened by decades of legacy, and they were never intended to provide interoperability, he says. We should just use Office.
What does that say about OOXML? When you take these lock-in document formats and just translate them to XML, how does that help anyone? As OOXML's opponents have said time and again, it is a "standard" that will be meaningful implemented by exactly one party, Microsoft, and it will do nothing to promote interoperability.
It's a pity Joel didn't address this, but it's not hard to connect the dots.
And I think your ability to assess another's work is flawed courtesy of an over sized ego. That was my point.
You have yet to provide an alternative solution to the problem. Given that one constraint is memory, your inability to be concise suggests you're not capable of coming up with one either. Certainly your "squeeze out a few extra microseconds" comment suggests you have absolutely no clue what you are talking about. Yet you persist in calling it bad design. You are strangely smug about what was quite possibly an implicit assumption forced by tough constraints, with no actual interoperability requirements, at a time when they were rarely offered let alone expected. I would stop using "IMHO" - clearly there is nothing humble about your opinion.
Why the bit about metadata, out of interest? It's as if you think the more irrelevant things you can fit into the post, the more we're supposed to be impressed.
That is not a bug, it's a feature. I once helped someone with a problem where Word would crash (segfault) a few seconds after displaying the empty starting page. I tracked the actual faulting to a buggy HP driver that couldn't deal with printers that were connected to a powered off machine on a wireless network. Switching to a different printer fixes it.
Yes, MS Word makes calls to the printer driver while you're working and has its pagination algorithm adapt to its characteristics.
This is caused by the "WYSIWYG" feature. Your HP printer driver is probably set to choose fonts that are "close" to the ones Windows uses, but are instead native fonts for the HP printer. Your PDF uses the Windows, and/or Adobe, fonts directly. Word uses the printer driver settings while you're editing, and if you change printers, the document repages with any different native fonts.
In Windows 2000, you can open the printers control panel, choose "printing preferences" on your HP, poke the "Advanced..." button, and tell it to "Download as SoftFont". This should make changing between PDF and printer less painful, at the expense of increased memory usage and time to print with the HP. For the real advanced version, you can try and find which Adobe fonts are exactly the same size as the HP native ones, and tell the PDF writer to use those.
Yeah, I missed the part that it's supposedly a feature in my comment. Still it's quite obvious that the behaviour of MS' formatter is heavily dependant on how HP's printing driver works. Of course MS doesn't know how all printer makers' driver works, so that makes the behavior of the formatter unspecified. If MS calls it a "bug" or a "feature" I don't really care - to me it's an obvious design error, nothing else.
I have a really elegant proof for Fermat's last theorem. If this sig was only a bit longer...
Well, wait.
This is just OOXML without the angle brackets, isn't it?
My blog: http://www.seebs.net/log/ --- My iPhone/iPad app: http://www.seebs.net/seebsfrac/
They should rename it to WYSIWYGALAYDSP, methinks (What you see is what you get as long as you don't switch printers)... :-(
I have a really elegant proof for Fermat's last theorem. If this sig was only a bit longer...
Now, why would you preserve crufty code and file formats if you didn't have to? The only reason *I* can think of is to preserve backwards compatibility, which is a *major* user requirement!
How about this interpretation of MS's actions? They've offered an open, XML-based format for document storage. They've also just shared with us the old, crufty, proprietary formats they've historically used for document storage. I think they actually *want* to move away from their old proprietary methods and use an open format. They've clearly been reading the tea leaves and realize that a universal office document standard is in the works. All they're trying to do is make sure they're ready for it.
As far as OOXML vs. ODF, my guess is that they're pushing their own open document format because the alternative doesn't offer the functionality that their apps require, and if a universal standard is going to be adopted, they simply want it to be expressive enough to work with their applications.
Maybe MS does now want to move away from their old proprietary formats into new open ones. But the old formats were built over decades without that goal.
If you read Spolsky's analysis linked from this Slashdot story summary, you'll see how the formats "evolved" ("devolved" more like) withing MS goals often dictated by their unique marketing position.
The summary also points out with links to why this release might not actually indicate MS is really releasing their formats to break with that past after all.
--
make install -not war
That's why I often print to PDF, then print the PDF...
Benford's Corollary to Clarke's Law: "Any technology distinguishable from magic is insufficiently advanced."
No argument there.
The summary also points out with links to why this release might not actually indicate MS is really releasing their formats to break with that past after all.
No. The article doesn't make that claim. That's your own interpretation. The overall intent of the article is simply to convey a few simple points:
1) Why the MS office document format is so crufty (minus conspiracy theories).
2) How to work *with* the Windows OS to use those documents.
3) How to use better, more open, alternatives to creating office documents.
Nothing in the article contradicts anything I said earlier.
--
make install -not war
Yeah, I've also made sure to use that silly workaround. Fortunately, my work computer is equipped with Adobe...
I have a really elegant proof for Fermat's last theorem. If this sig was only a bit longer...
>> Separating data and representation is a basic programming skill
Since when? I have to say that in over 25 years of this stuff I never heard that as a basic programming skill.
There are applications (like html/web) where it is a good idea, but most of those are fairly recent (like the last 10 years? Even HTML was orginally designed to be all together).
But for a word document, what do you think is stored in the file? Data or presentation?
I'll give you a little hint, if you only want the data store it in a text file. If you want the document formatted then store both so it is available.
Before jumping up and down gleefully, those working on related open source efforts, such as OpenOffice, might want to take a very close look at Microsoft's Open Specification Promise to see if it seems to cover those working on GPL software; some believe it doesn't.
From MS's own mouth - and mind you that these quotes probably had to be vetted by a billion lawyer-types to ensure that MS wouldn't incur any sort of bizarre liability fifty years down the road by saying them. Based on what is said here, the only other thing that MS reserved is the ability to sue anyone who sues them for violating the patents that they already own, and are releasing to the public. That would be kind of like placing a legal disclaimer on your Halloween candy bowl: "Attention: You can all take as much candy from this bowl as you want, and I legally give up my right to prosecute anyone taking candy from this bowl of Theft, forever. But if any of you accuses me of Theft for eating candy from *my own candy bowl,* then I reserve the right to accuse that person (and *only* that person) of Theft, too." Here's a few pertinent excerpts:
Q: Is the Open Specification Promise intended to apply to open source developers and users of open source developed software?
A: Yes. The OSP applies directly to all persons or entities that make, use, sell, offer for sale, imports and/or distributes an implementation of a Covered Specification. It is intended to enable open source implementations, and in fact several parties in the open source community have specifically stated that the OSP meets their needs. Moreover there are already a significant number of implementations of Covered Specifications that have been created and/or distributed under a variety of open source licenses as well as under proprietary software development models. Because open source software licenses can vary you may want to consult with your legal counsel to understand your particular legal environment.
Q: Is this Promise consistent with open source licensing, namely the GPL? And can anyone implement the specification(s) without any concerns about Microsoft patents?
A: The Open Specification Promise is a simple and clear way to assure that the broadest audience of developers and customers working with commercial or open source software can implement the covered specification(s). We leave it to those implementing these technologies to understand the legal environments in which they operate. This includes people operating in a GPL environment. Because the General Public License (GPL) is not universally interpreted the same way by everyone, we can't give anyone a legal opinion about how our language relates to the GPL or other OSS licenses, but based on feedback from the open source community we believe that a broad audience of developers can implement the specification(s).
Q: I am a developer/distributor/user of software that is licensed under the GPL, does the Open Specification Promise apply to me?
A: Absolutely, yes. The OSP applies to developers, distributors, and users of Covered Implementations without regard to the development model that created such implementations, or the type of copyright licenses under which they are distributed, or the business model of distributors/implementers. The OSP provides the assurance that Microsoft will not assert its Necessary Claims against anyone who make, use, sell, offer for sale, import, or distribute any Covered Implementation under any type of development or distribution model, including the GPL. As stated in the OSP, the only time Microsoft can withdraw its promise against a specific person or company for a specific Covered Specif
DocRuby = poop
Love,
your slashstalker
Sorry, but no, XML is not "open" and "human readable". Without a proper format documentation, it's every bit as opaque as a binary format.
XML is not a standardized document format. ODF is standardized document format with documentation and smenatics. XML is just the language it's expressed in.
For instance, the following is a proper XML tree :
aa
Without any documentation on what those tags mean, it's every bit as opaque as
$$!51%5g33F1 (admittedly a bad analogue to a truly "binary" format, but you get the idea).
Sure, you can build a tree out of that. That is still useless, especially considering that you can put arbitrary linking formats into attribute or element values.
Hell,
qualifies.
Without a proper XML Schema to go with your XML document, you have nothing. And even IF you have a XML Schema, without documentation, you can only use it to validate stuff against it. And even IF you have documentation, it will have to be accurate. XML alone is not a silver bullet. OOXML all but proved that already.
of course, slashdot ate my markup.
/><d e="f" g="h" /> </b></a>
<a><b><c
and
<a> <![CDATA[ SOMETHING REALLY SCARYLOOKING HERE ]]></a
would be the codesnippets.
No, XML is indeed that wonderpill...because it's open and human readable, not some obfuscated binary format like
ASCII does not mean human-readable. Instead of an obfuscated binary format, XML documents end up in an obfuscated text format.
--
"So wait, who protects the people from their government?"
"Terrorists."
"...oh."
[ Reply to This | Parent ]
*
Re:Don't Adopt. Convert. (Score:2)
by Doc Ruby (173196) on Wed Feb 20, '08 01:26 PM (#22491236) Homepage Journal
And some humans are illiterate.
XML is human readable because it doesn't require a machine (or superhuman skills) to read its meaning. Its field names and structure are embedded in the data, not only in a decoder context. Of course it's up to the person specifying the XML dialect to make those tags and structure comprehensible by a normal person, but that's not the format's defect. Any format can be obfuscated by design or carelessness. XML is harder to do that in. And even the most basic tools that render XML data according to their embedded schema make most XML self-evident. And that's not cheating: even this post in English requires a reader app, as does all stored data.
--
--
make install -not war
--
make install -not war
Clearly, you are retarded and can't spell worth shit. No wonder you drool. Most institutionalized head cases do.
DOC is still not open. You can't get it by asking for it with an email. Not sure where you got that piece of information.
You are very much mistaken, sorry. XML is not that wonderpill. It's a tool. It can be abused.
Structural properties are WORTHLESS if you do not know what they mean. It's cool to build a tree, but without semantic meaning, you really have nothing. Nothing is self-evident if the designer hasn't taken care to make their format transparent.
Clearly you don't see that point so it does not bear discussing further. I wish you luck trying to figure out what a,b,c,d,e,f,g, k,m, i,l, x,y, and z mean as tags. Really, I do.
Both of you forgot Borland Java Builder was the better Java Windows RAD
Look, I'm sitting at a fucking computer. I know how to use a hex editor. I'm a programmer, I know how to write programs to do what I want. If, as the article states, MS office formats are designed to be copied directly into C structs, then that makes parsing simpler, not harder. I'm not going to load that fucking office document into my brain, so human-readable means absolutely nothing to me. I'm going to load it into a computer. And unless the file-format is designed with interoperability in mind, making it XML won't help one single bit. All XML would mean is that in addition to all the other work I have to do, I also need an XML parser.
Yo, XML is for fucking permanent storage, not your fucking tranitory C struct. Dumping your fucking C struct to your fucking computer will leave you fucking scratching your head in 5 years when you try to fucking decipher the fucking thing. Or you can use fucking XML and have a fucking clue later when you try to use that data on some other fucking platform where your other code won't fucking run and you don't want to fucking pore through the source to decipher the fucking one-off format you made to import Word data into that fucking program you never used again. Got it, fucker?
--
make install -not war
Of course it's up to the person specifying the XML dialect to make those tags and structure comprehensible by a normal person, but that's not the format's defect. Any format can be obfuscated by design or carelessness. XML is harder to do that in. And even the most basic tools that render XML data according to their embedded schema make most XML self-evident. And that's not cheating: even this post in English requires a reader app, as does all stored data.
Of course, my post was human readable, but since you can't even understand written English enough to stop making the spurious point that a human can make even a readable format unreadable, I don't expect you to accept that there is even such a thing as human readable.
Sorry you couldn't benefit from that simple insight no matter how easy to read I made it. Your loss.
--
make install -not war
When you're posting at -1, you can't afford to waste the limited number of posts you get per day. I've noticed that challenges to your assertion that non-Open software manufacturers upload and read the search indexes of users' PCs have gone unanswered.
You might want to respond to that sort of thing if you want to get out of this deep hole you're in. Or maybe apologize for lying to your fellow Open Source advocates.
You do realize that Slashdot karma is earned, right?
If the problem is the complexity of the format itself, embedding human-readable names for each field into the file-format, isn't going to reduce the complexity one bit. And if you already have a specification (even if it's reverse engineered), human-readability or embedded field-names is not of importance. Granted, XML doesn't make it any more complicated, so it doesn't hurt much, but a straightforward translation of word or excel file-formats into XML is not particulary helpful. XML has its uses. Relying on it like it was magic, is not one of them. First and foremost because it isn't magic.
I've written tons of library code for reading and writing old proprietary binary file-formats on newer incompatible computers with different byte-order, different floating point formats, etc... It's not particulary hard. Anyone can do it. It's code-monkey stuff, or a computing 101 exercise, not something you even need real programmers to tackle. The problem with office formats isn't that it's binary, it's that it's complicated.
All these problems have been solved a long time ago, really. Unless one is coding in Notepad (but why??), Hungarian notation serves no purpose whatsoever.
No, it's not. You hover the mouse over the variable name in your IDE
Noob, listen. Being able to read something without having to hover the mouse is far far easier. If every time I come across a variable I have to reach out and grab the mouse, hover over the variable and hope that the IDE can find it, which doesn't always work in abstract types. Its like actually having a vocabulary when you read a book instead of having to consult a dictionary every time you get a word with more than 6 letters.
Secondly, Visual Studio is probably the LEAST productive environment I have ever worked in. It is nothing more than the GUIfication of Microsoft's PWB which was universally referenced as "Programmers Waste Basket."
Lastly, most software development in the world is NOT ON Windows.
look at some binary formats you do not know in decent viewers
XML does not have embeddes schemas. It is semi-structured, true -- but that does not a schema make. You get a tree of nodes. That, in and of itself, does not make the contents self-evident. By that logic, I should be able to deduce what a btree stores if only I know that it's a btree. And that's not cheating: even this post in English requires a reader app, as does all stored data. You keep coming back to this analogy, and I fail to see how it applies. Sure, if you get a data format containing this post in one of its nodes (or even a nice tag soup with markup in it), you can probably deduce that it's a post and you can read what it says. Same is true of a pure binary format containing this text. The issue becomes less clear when you are talking about data that does not appear as utf8-text in its "natural" form. Of course, my post was human readable, but since you can't even understand written English enough to stop making the spurious point that a human can make even a readable format unreadable, I don't expect you to accept that there is even such a thing as human readable. If in doubt, go ad hominem, eh ?
I simply don't agree with your premise. Sorry you couldn't benefit from that simple insight no matter how easy to read I made it. Your loss. Getting defensive, are we ? Ah well, it seems the art of discourse is lost on you. Your loss, as it were.
It may be faster, but you waste time prefixing the variables in the first place.
There is no logic in this statement, typing 2 or 3 additional characters is hardly even measurable.
You don't. It's the easiest way, but not the fastest. Every IDE I've seen has some shortcut for that. Some also have the "Definition" window at the bottom, which is always synced with whatever is under the cursor.
As opposed to simply looking at it and knowing? There is no rational argument that can claim that an alternate contextual lookup of information is easier then just seeing it in context. No one with any intellectual integrity can pursue such an argument.
It does work just fine with all kinds of types short of specialized templates of templates and the like.
I have had many situations where it does not and can not correlate the variable with the definition. Surely you are not saying that it works 100.00% of the time, are you?
"Lastly, most software development in the world is NOT ON Windows."
References?
Bureau of Labor Statistics, excluding unaccounted open source Linux people, the majority of software jobs are system, web, embedded, scientific, with applications programming far down on the list. Remember, every electronic device that has a blinking LED has a computer in it, and that takes software development. There are far more cell phones than there are P.C.s There are for more Microwave ovens, far more automobiles (which have computers in them), there are far more everything electronics than P.C.s and WinCE is a small percent of that.
Like they said on T.V. "In Mayberry, he's world famous."