It occurred to me on the drive home that perhaps the reason people don't realise what a bad idea the 32 bit long is don't understand the implications of the difference. Both sizes for a long can require changes to code if the code has always been 32 bit, but there is a difference between the types of errors that occur with one approach as against the other.
A key thing to remember is that when undertaking a porting exercise you want any errors to show up at the earliest possible stage. It's better to catch the error in development than in QA. It's better to catch it in QA than to have the customers encounter it in a released version and then have to cop the help desk load and issue patches.
Having a 64 bit long means that you get two types of errors:
Structure alignment errors - because a structure that uses a long will change in size, and because any elements after the long will move. These occur predominantly in persistent data (save to disk) and IPC - in fact they are the types of errors the document you linked to mentions as being the impetus for the 32 bit long.
Sign extension errors - when a value stored in an unsigned int is transferred to a long, and then to an unsigned long, the result will have the high 32 bits set, giving a different number There are other sequences that can produce similar problems, but they all involve having the high 32 bits be set or unset such that a later transition between signed and unsigned gives an incorrect result.
Structure alignment problems show up early on. Usually in development, but if not in development then any reasonable QA program should find them, and if that doesn't then the beta almost certainly will. They show up early because they will normally cause variables to be populated from data that has next to nothing to do with the variable. If they don't cause a hardware exception when interacting with the 32 bit version, they're very likely to produce anomalous output or to cause a failure reading saved files.
Sign extension problems, where they turn into manifested bugs (a lot of the time the bug will be entirely latent due to other side-effects taking hold) tend to show up quickly because they result in values radically different to what is expected. They're also (in my experience) the rarest of the bugs caused in ports to 64 bit environments.
One thing this model won't cause is loss of bits on transfer from a long to an int, because any additional bits in the 64 bit integer weren't being used in the 32 bit integer anyway.
Having a 32 bit long gives rise to one type of error:
Bits get lost transferring from a 64 bit value to a 32 bit value.
The problem with this type of error is that it will only manifest when the high 32 bits are neither all 1s or all 0s. They may not, and indeed probably will not, show up at all during development, or during testing, or during QA. They are the most likely error to make it into released software.
Given the choice, the errors caused by a 64 bit long are to be preferred to those caused by a 32 bit long because the ones caused by a 64 bit long will be discovered earlier.
The 32 bit long may give an initial appearance of convenience because you don't see as many errors during development, QA and beta, but this is giving a false sense of security - you want the errors to be there, not in the field.
Forgive me, but I can't think of any examples of this in Win32
SendMessage is the most obvious example. What's passed in lParam for LB_ADDSTRING? You'll notice that LPARAM is defined as a 64 bit integer. Similarly, LRESULT had to be defined as a 64 bit integer (for things like LB_GETSTRING).
The reason Microsoft chose LLP64 to ensure transparent portability between 32 and 64 bit compiles of Win32 apps.
They may claim that as the reason, but the choice really doesn't assist that at all. In fact if anything it impedes it.
An application is far less likely to break due to a larger than expected capacity in an integer type than a smaller than expected capacity. And the places where breakage is likely (such as in structures that require a specific size due to being written directly to disk, for instance) are generally more likely to be found quickly through testing than places where being too narrow is a problem.
After living through the 16 to 32 bit migration, I favor int32, int64, size_t, wchar_t, etc.
I also use such types, but it doesn't guarantee that every developer on a complex project has always done so religiously, and there are bound to be places where this won't have helped.
All Microsoft APIs, and most third party APIs already use "Windows" types (BYTE, WORD, DWORD, QUADWORD, LPDWORD, etc) that are a fixed size. This is partly due to the fact that these DLLs are linked from many languages (VB, Pascal, C). They are not very C centric. Also, we are not used to recompiling everything to use a library: we likely don't have the source. We have to link carefully. Ints and longs won't do.
I don't see how you think these are arguments in favour of a shorter long.
a 64 bit [long]... would just be wasted space most often
Space optimisation is a matter for the application developer - the OS design should not select a model based on a preference for data size over source code compatibility.
a 64 bit [long]... complicates portability between 32 and 64 bit compiles.
It really doesn't - if this was their reasoning I would have to suspect the people involved in the decision didn't have much (if any) experience with porting to 64 bit environments at the time they made their decision.
can you recommend any good online resources that give pointers on how to write clean code that can be easily ported between 32 bit and 64 bit systems and various operating systems?
I'm not aware of any off the top of my head. Most of my experience with such things is empirical, having done 16-32bit and 32bit-64bit ports in the past. It also helps to have a background in assembly language so you can fully understand the implications of a lot of the constructs used in higher level languages. One of the great shames these days is that new programmers tend to have no assembly language skills, couldn't read a disassembly listing to save themselves, and don't realise the cost of some of the high level constructs they use.
But doesn't MS discourage you from using form example long datatype and tell you to use macro LONG instead? I thought those macros was designed just for situations like this where you are porting your app from 32 bit to 64 bit environment.
There are many, many valid reasons for using "long" rather than "LONG" - especially if Windows is only one of your platforms. Even if you were doing that as a rule, it would not prevent you from going through all the steps required for checking. Also LONG is defined as a 32 bit integer even for a 64 bit target - you probably mean LONG_PTR.
So you had existing code which wasn't 64 bit clean. And you "fixed" it by assuming that on any 64 bit system sizeof(long) == sizeof(void*)? I'd say you deserve what you get if that's the case.
I'm not saying anything about my own code, but you are making many assumptions you don't even realise. Not every piece of code that is affected by this moronic design decision would have been visited in a prior 32bit-64bit port (or a 16bit-32bit port). In anything other than a trivial application, you're going to have to re-examine potentially a large number of pieces of code that you wouldn't have had to deal with elsewhere - and you certainly couldn't say that just because you did the things you indicate that there are not going to be any issues created by this.
For instance, it has always been true that sizeof(long) >= sizeof(void *) (leaving aside systems with strange pointer formats like the AS/400 where you can't realistically port anything even moderately complex anyway).
Passing a pointer through an integer is very common in C, and not unheard of in C++, and the integer of choice for this has been the "long" for the past 15 or so years. In fact the Windows API requires passing pointers around in integers, and because they have made sizeof(long) 32 bits, have then made the API depend extensively on non-standard integer type declarations (no, it doesn't depend on a particular non-standard type, but it does require non-standard types to work).
The fact remains that making sizeof(long) < sizeof(void*) means people porting to 64 bit will have to do more checking, and more testing than if sizeof(long) were >= sizeof(void *). That's going to operate as a disincentive to vendors to move to 64 bits.
When Microsoft moved from win16 to win32, everyone had to upgrade all their apps to take advantage of Windows 95, Win32S and NT 3.5. It was quite a money grab for the application developers; many simply had to re-compile against the 32-bit libraries and do some minor tweaks to release their preemptive-enabled applications.
There was far more to it than that. When you're writing C or C++ code you often make implicit assumptions about the size of many objects. Also MS changed the layout of values passed to Windows messages in many cases, and that required extensive code changes.
I don't see many apps being ported to 64 bit though - only apps that have very heavy memory requirements. MS made a mindbogglingly stupid choice when they made sizeof(long) = 32bits in their 64 bit data model. Every other 64 bit operating system made sizeof(long) == 64 bits. That means that even if you've ported to 64 bits before (because you're a server app that works on thing other than Windows), you're up for porting work.
5 - The biggest objection of all is "we need to teach them what they'll encounter in the real world."
Never mind that for most of the school curriculum that's apparently not important - they'll claim it is. So you need some nice graphs showing OpenOffice use versus MS Office use so you can point out that by the time the students get into the real world OpenOffice will be at least as common as MS Office, and then be able to tell them why.
Their faces looked like this: Each of the four had the face of a man, and on the right side each had the face of a lion, and on the left the face of an ox; each also had the face of an eagle.
And Ra thought the helmets he had on his Jaffa were fancier than on those of all of the other system lords.
A license binds the the licensor as long as the licensee is in compliance.
That is not quite true. It is possible to revoke a license given in the past (even if done in breach of contract* - although this will have the consequence of liability for damages in contract), however as you point out there may be an estoppel. But to get an estoppel you have to show that you have relied on the license in a way that makes it unconscionable for the licensor to retract the license.
If you have only used the software, you will not normally be able to show this. But if you have created derivative works (provided the derivative is non-trivially different), you most likely will be able to show this and then get the benefit of the estoppel.
* Not all licenses are (or even need to be) contracts.
They are probably talking about the trademark law....
My reaction was a little stronger, and began with "for crying out loud...", or a something semantically equivalent.
But that's what you get when you use the term intellectual property, more confusion.
Actually, that's what you get when you have people using technical terms they don't understand. Perhaps/. should consider hiring somebody who has legal qualifications (as well as technical ones) to edit the "Your rights online" section.
Connecting to 66.194.210.2:80... connected. HTTP request sent, awaiting response... 200 OK Length: 611,616,768 [<b>text/plain</b>]
0% 622,640 1.28K/s ETA 30:42:43
Do I really want to spend 40 hours downloading a 600MB file only to risk finding out it got munged because its MIME type is given as "text/plain". I do not think so.
That's the cool thing about Democracy - we may not always elect the best candidate, but no one dares cross that line where no amount of advertising will fool people any more.
Luckily for those in power, the people are easily fooled. That is shockingly evident in United States politics today. That the people are easily fooled is nothing new - but the extent to which they can be fooled has reached new heights. They have been fooled about things that would have seemed unimaginable last century. Then the people of the United States could look at other nations and say "Why are you so easily manipulated? How can you believe what your corrupt regime is telling you." America now knows how - although most do not know that they know.
Well then... it's time for the return of my shotgun to active duty.
Either that or your lawyer. Deliberately bypassing a popup blocker is clearly a trespass. Whether it is also an actionable trespass depends on where you are, but according the the Supreme Court of California in Intel v Hamidi the fact that a trespass without damage or the threat thereof is not actionable there is not the same as saying it's legal.
Even if you're in a place where trespass without damage isn't actionable, you could argue that there is "damage" in that they are causing the computer to behave in a way that you have configured it not to and that its value to you is thereby diminished.
They're not just being annoying now - they're breaking the law.
I would contend that any law that is that unevenly enforced is completely unreasonable. I agree that a sports car is a work of art, as is the architecture of a building.
In fact most laws are selectively enforced, and most laws if enforced 100% of the time would have unreasonable results.
Taking a photo of your family in front of the thing is a newsworthy event, as far as your family is concerned.
Yet it is different to media reporting of an event, and courts can see that it is different just as well as anybody else can.
So how about... Charge them a fee for every letter they send addressed to you, again, because you own the copyright to "joe blow, 123 anystreet, anytown, whereever".
Then they will lock you up in the same asylum they use to store the people who think they can create a new nation by declaring that their farm seceeds from the United States.
To be fair, Microsoft's terraserver is just a bunch of computers with lots of disk space. It doesn't take pictures -- that's done by the US government, the geological survey people, most likely. Microsoft just puts the pictures online.
This of course raises an interesting issue. could somebody create a piece of art easily viewable from space, and then sue the US government for breach of copyright if they take satellite pictures of it? That sounds like it might be a good place to hide militarily sensitive stuff.
The Space Needle in Seattle - once part of the Seattle Worlds Fair, is now privately owned and its image is trademarked by the Space Needle Corporation, for just about every class of goods to which it conceivably could be applied. So when you're up there on Queen Anne Hill taking your pics of downtown Seattle, you're violating their trademark.
Please write out 1000 times, "Trademark is not the same thing as copyright."
If it's copyright you are worried about, then the restriction is on making copies, including derivative works.
If it's trademark you are worried about, then the restriction is on using the trademark or something close to it on or in association with goods or services in the course of trade.
Taking a photo of a trademarked thing is OK. Taking a photo and putting it on a T-Shirt to be sold is not, if the trademark is in any area vaguely related to T-shirts. Taking a photo of a trademark and putting it on your own T-shirt for personal use is also OK.
In the case of copyright, taking a photo is (subject to fair use and some specific exemptions) restricted. Putting it on a T-shirt for sale is restricted. Putting it on a T-shirt for personal use is restricted.
It should be apparent that copyright is far more restrictive than trademark.
Since in all these cases the structures are prominent public facts, this all seems an incredible violation of the right of the public to all of the visual space present from public vantages.
No contest. If universally enforced, copyright law would make many everyday uses of photography illegal without the permission of, in many cases, more than one person.
The design of a building is artistic and is probably copyrighted by the architect.
Not just probably. The design of a building is subject to copyright.
Taking this to its absurd conclusion, anyone who photographs city's skyline would have to pay royalties to the architects for each building in the shot.
Subject to fair use exceptions, yes, technically. But this is very rarely enforced.
What if a news event would happen next to this sculpture? Could they deny coverage? If not then who decides what is newsworthy?
Reporting of events falls within fair use in the US, and within related exemptions elsewhere. Taking a photo of your family in front of the thing may be fair use (I haven't looked into that). Taking a photo of the sculpture itself for the purpose of getting an image of the sculpture is not fair use, and technically requires the permission of the copyright holder.
Incidentally, taking a photo of a sports car is also likely to breach somebody's copyright. There are many absurd results that can arise from copyright law - the only reason they don't arise more frequently is that most people are somewhat more reasonable about when they enforce their copyrights.
What the heck is in that other 99.9% of the document?
Word documents are stored in DocFiles (also known by some other names including "LAOLA" and "Microsoft Compound Files). DocFiles are made up of blocks of 512 bytes, and an empty DocFile has two blocks - one for the header, and one for the root directory, for a mimimum of 1K.
Word documents are never empty though - they have streams for the "property" information you get when you right-click on them, sub-storages for OLE objects (even if there are no objects in the document), and a stream for the core data of the file.
Then on top of that there's the other stuff people are mentioning - fonts, styles and the like. Things necessary to convey information about how the thing has been formatted.
If you want to see this, there is a tool that comes with the MS SDK called "dfview" (it runs OK in WINE) - open a Word Document in that and you will soon figure out why the files are so large.
Running Microsoft programs is the hardest for Wine because they use secret function calls
Current CVS versions of Wine can install and run the major MS applications, including MS office and Internet Explorer. Why would you do such a thing, I hear you ask? Because users still use Windows and as developers we still have to write code that interfaces with those applications. Absent that, OpenOffice and Konqueror or Mozilla work perfectly well.
A key thing to remember is that when undertaking a porting exercise you want any errors to show up at the earliest possible stage. It's better to catch the error in development than in QA. It's better to catch it in QA than to have the customers encounter it in a released version and then have to cop the help desk load and issue patches.
Having a 64 bit long means that you get two types of errors:
Structure alignment problems show up early on. Usually in development, but if not in development then any reasonable QA program should find them, and if that doesn't then the beta almost certainly will. They show up early because they will normally cause variables to be populated from data that has next to nothing to do with the variable. If they don't cause a hardware exception when interacting with the 32 bit version, they're very likely to produce anomalous output or to cause a failure reading saved files.
Sign extension problems, where they turn into manifested bugs (a lot of the time the bug will be entirely latent due to other side-effects taking hold) tend to show up quickly because they result in values radically different to what is expected. They're also (in my experience) the rarest of the bugs caused in ports to 64 bit environments.
One thing this model won't cause is loss of bits on transfer from a long to an int, because any additional bits in the 64 bit integer weren't being used in the 32 bit integer anyway.
Having a 32 bit long gives rise to one type of error:
The problem with this type of error is that it will only manifest when the high 32 bits are neither all 1s or all 0s. They may not, and indeed probably will not, show up at all during development, or during testing, or during QA. They are the most likely error to make it into released software.
Given the choice, the errors caused by a 64 bit long are to be preferred to those caused by a 32 bit long because the ones caused by a 64 bit long will be discovered earlier.
The 32 bit long may give an initial appearance of convenience because you don't see as many errors during development, QA and beta, but this is giving a false sense of security - you want the errors to be there, not in the field.
SendMessage is the most obvious example. What's passed in lParam for LB_ADDSTRING? You'll notice that LPARAM is defined as a 64 bit integer. Similarly, LRESULT had to be defined as a 64 bit integer (for things like LB_GETSTRING).
They may claim that as the reason, but the choice really doesn't assist that at all. In fact if anything it impedes it.
An application is far less likely to break due to a larger than expected capacity in an integer type than a smaller than expected capacity. And the places where breakage is likely (such as in structures that require a specific size due to being written directly to disk, for instance) are generally more likely to be found quickly through testing than places where being too narrow is a problem.
After living through the 16 to 32 bit migration, I favor int32, int64, size_t, wchar_t, etc.
I also use such types, but it doesn't guarantee that every developer on a complex project has always done so religiously, and there are bound to be places where this won't have helped.
All Microsoft APIs, and most third party APIs already use "Windows" types (BYTE, WORD, DWORD, QUADWORD, LPDWORD, etc) that are a fixed size. This is partly due to the fact that these DLLs are linked from many languages (VB, Pascal, C). They are not very C centric. Also, we are not used to recompiling everything to use a library: we likely don't have the source. We have to link carefully. Ints and longs won't do.
I don't see how you think these are arguments in favour of a shorter long.
a 64 bit [long]... would just be wasted space most often
Space optimisation is a matter for the application developer - the OS design should not select a model based on a preference for data size over source code compatibility.
a 64 bit [long]... complicates portability between 32 and 64 bit compiles.
It really doesn't - if this was their reasoning I would have to suspect the people involved in the decision didn't have much (if any) experience with porting to 64 bit environments at the time they made their decision.
I'm not aware of any off the top of my head. Most of my experience with such things is empirical, having done 16-32bit and 32bit-64bit ports in the past. It also helps to have a background in assembly language so you can fully understand the implications of a lot of the constructs used in higher level languages. One of the great shames these days is that new programmers tend to have no assembly language skills, couldn't read a disassembly listing to save themselves, and don't realise the cost of some of the high level constructs they use.
There are many, many valid reasons for using "long" rather than "LONG" - especially if Windows is only one of your platforms. Even if you were doing that as a rule, it would not prevent you from going through all the steps required for checking. Also LONG is defined as a 32 bit integer even for a 64 bit target - you probably mean LONG_PTR.
I'm not saying anything about my own code, but you are making many assumptions you don't even realise. Not every piece of code that is affected by this moronic design decision would have been visited in a prior 32bit-64bit port (or a 16bit-32bit port). In anything other than a trivial application, you're going to have to re-examine potentially a large number of pieces of code that you wouldn't have had to deal with elsewhere - and you certainly couldn't say that just because you did the things you indicate that there are not going to be any issues created by this.
For instance, it has always been true that sizeof(long) >= sizeof(void *) (leaving aside systems with strange pointer formats like the AS/400 where you can't realistically port anything even moderately complex anyway).
Passing a pointer through an integer is very common in C, and not unheard of in C++, and the integer of choice for this has been the "long" for the past 15 or so years. In fact the Windows API requires passing pointers around in integers, and because they have made sizeof(long) 32 bits, have then made the API depend extensively on non-standard integer type declarations (no, it doesn't depend on a particular non-standard type, but it does require non-standard types to work).
The fact remains that making sizeof(long) < sizeof(void*) means people porting to 64 bit will have to do more checking, and more testing than if sizeof(long) were >= sizeof(void *). That's going to operate as a disincentive to vendors to move to 64 bits.
It appears you forgot the "-m64" flag (assuming GCC):
There was far more to it than that. When you're writing C or C++ code you often make implicit assumptions about the size of many objects. Also MS changed the layout of values passed to Windows messages in many cases, and that required extensive code changes.
I don't see many apps being ported to 64 bit though - only apps that have very heavy memory requirements. MS made a mindbogglingly stupid choice when they made sizeof(long) = 32bits in their 64 bit data model. Every other 64 bit operating system made sizeof(long) == 64 bits. That means that even if you've ported to 64 bits before (because you're a server app that works on thing other than Windows), you're up for porting work.
5 - The biggest objection of all is "we need to teach them what they'll encounter in the real world." Never mind that for most of the school curriculum that's apparently not important - they'll claim it is. So you need some nice graphs showing OpenOffice use versus MS Office use so you can point out that by the time the students get into the real world OpenOffice will be at least as common as MS Office, and then be able to tell them why.
And Ra thought the helmets he had on his Jaffa were fancier than on those of all of the other system lords.
That is not quite true. It is possible to revoke a license given in the past (even if done in breach of contract* - although this will have the consequence of liability for damages in contract), however as you point out there may be an estoppel. But to get an estoppel you have to show that you have relied on the license in a way that makes it unconscionable for the licensor to retract the license.
If you have only used the software, you will not normally be able to show this. But if you have created derivative works (provided the derivative is non-trivially different), you most likely will be able to show this and then get the benefit of the estoppel.
* Not all licenses are (or even need to be) contracts.
My reaction was a little stronger, and began with "for crying out loud...", or a something semantically equivalent.
But that's what you get when you use the term intellectual property, more confusion.
Actually, that's what you get when you have people using technical terms they don't understand. Perhaps /. should consider hiring somebody who has legal qualifications (as well as technical ones) to edit the "Your rights online" section.
Do I really want to spend 40 hours downloading a 600MB file only to risk finding out it got munged because its MIME type is given as "text/plain". I do not think so.
Luckily for those in power, the people are easily fooled. That is shockingly evident in United States politics today. That the people are easily fooled is nothing new - but the extent to which they can be fooled has reached new heights. They have been fooled about things that would have seemed unimaginable last century. Then the people of the United States could look at other nations and say "Why are you so easily manipulated? How can you believe what your corrupt regime is telling you." America now knows how - although most do not know that they know.
Poopers. Relates to what they're doing in the sand pit.
Either that or your lawyer. Deliberately bypassing a popup blocker is clearly a trespass. Whether it is also an actionable trespass depends on where you are, but according the the Supreme Court of California in Intel v Hamidi the fact that a trespass without damage or the threat thereof is not actionable there is not the same as saying it's legal.
Even if you're in a place where trespass without damage isn't actionable, you could argue that there is "damage" in that they are causing the computer to behave in a way that you have configured it not to and that its value to you is thereby diminished.
They're not just being annoying now - they're breaking the law.
In fact most laws are selectively enforced, and most laws if enforced 100% of the time would have unreasonable results.
Taking a photo of your family in front of the thing is a newsworthy event, as far as your family is concerned.
Yet it is different to media reporting of an event, and courts can see that it is different just as well as anybody else can.
Then they will lock you up in the same asylum they use to store the people who think they can create a new nation by declaring that their farm seceeds from the United States.
This of course raises an interesting issue. could somebody create a piece of art easily viewable from space, and then sue the US government for breach of copyright if they take satellite pictures of it? That sounds like it might be a good place to hide militarily sensitive stuff.
Please write out 1000 times, "Trademark is not the same thing as copyright."
If it's copyright you are worried about, then the restriction is on making copies, including derivative works.
If it's trademark you are worried about, then the restriction is on using the trademark or something close to it on or in association with goods or services in the course of trade.
Taking a photo of a trademarked thing is OK. Taking a photo and putting it on a T-Shirt to be sold is not, if the trademark is in any area vaguely related to T-shirts. Taking a photo of a trademark and putting it on your own T-shirt for personal use is also OK.
In the case of copyright, taking a photo is (subject to fair use and some specific exemptions) restricted. Putting it on a T-shirt for sale is restricted. Putting it on a T-shirt for personal use is restricted.
It should be apparent that copyright is far more restrictive than trademark.
Since in all these cases the structures are prominent public facts, this all seems an incredible violation of the right of the public to all of the visual space present from public vantages.
No contest. If universally enforced, copyright law would make many everyday uses of photography illegal without the permission of, in many cases, more than one person.
For the purposes of copyright law, a photo is a copy of a sculpture, since it is a derivative work.
Not just probably. The design of a building is subject to copyright.
Taking this to its absurd conclusion, anyone who photographs city's skyline would have to pay royalties to the architects for each building in the shot.
Subject to fair use exceptions, yes, technically. But this is very rarely enforced.
Reporting of events falls within fair use in the US, and within related exemptions elsewhere. Taking a photo of your family in front of the thing may be fair use (I haven't looked into that). Taking a photo of the sculpture itself for the purpose of getting an image of the sculpture is not fair use, and technically requires the permission of the copyright holder.
Incidentally, taking a photo of a sports car is also likely to breach somebody's copyright. There are many absurd results that can arise from copyright law - the only reason they don't arise more frequently is that most people are somewhat more reasonable about when they enforce their copyrights.
Word documents are stored in DocFiles (also known by some other names including "LAOLA" and "Microsoft Compound Files). DocFiles are made up of blocks of 512 bytes, and an empty DocFile has two blocks - one for the header, and one for the root directory, for a mimimum of 1K.
Word documents are never empty though - they have streams for the "property" information you get when you right-click on them, sub-storages for OLE objects (even if there are no objects in the document), and a stream for the core data of the file.
Then on top of that there's the other stuff people are mentioning - fonts, styles and the like. Things necessary to convey information about how the thing has been formatted.
If you want to see this, there is a tool that comes with the MS SDK called "dfview" (it runs OK in WINE) - open a Word Document in that and you will soon figure out why the files are so large.
Current CVS versions of Wine can install and run the major MS applications, including MS office and Internet Explorer. Why would you do such a thing, I hear you ask? Because users still use Windows and as developers we still have to write code that interfaces with those applications. Absent that, OpenOffice and Konqueror or Mozilla work perfectly well.