Why Can't We Reverse Engineer .DOC?
Good question.
I wonder if it has something to do with the mentality of the players involved. I don't think Sun, Corel or Lotus ever thought that they might be able to get together so that they could compete on the Office market, I think they all looked to carve out pieces of the market with their own suites, making such collaboration impossible. Despite popular misperception, Applix does not convert DOC, it converts RTF (which may be close enough for some people). Star Office is striving toward this holy grail, but they aren't quite there yet. So maybe it's not too late for folks to pool resources and finally get the job done. In fact, with the eyes of the court on Microsoft, now might be the perfect time.
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and its presentation really be that much more difficult to reverse engineer? I'd guess this depends more on the design behind said file format. If one of the main goals of the .DOC format is obfuscation, this could be difficult indeed, but I wouldn't say that it's impossible ... not for three big corporations, nor for thousands of loosely organized coders. It's one thing to have control of a file format, but it's another to be put into the position of having to change the format constantly in order to stay in the game. If Microsoft is placed in this situation, the onus would be on them to either concede the format until the next major release is made, or shorten the upgrade cycle on Office. How many businesses would stick with an office suite which forced users to upgrade every eight weeks just to remain compatible? If something like this were to happen, we might finally be able to put a dent in the everpresent Office monopoly.
So why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format. Have we tried, or has Microsoft's reputation, both professionally and legally, kept people from really thinking about it?
The first posts from the last 136 stories:
- 81 posts: Anonymous Coward
- 2 posts: Coma of Souls
- 2 posts: Sicknal 11
- 2 posts: Signal 12
- 1 post:
/ - 1 post: addbo
- 1 post: Anonymous Cowart
- 1 post: bapya
- 1 post: BgJonson79
- 1 post: bitchslapboy
- 1 post: BlowChunx
- 1 post: CardiacArrest
- 1 post: chandler
- 1 post: crazy_speeder
- 1 post: DavidOgg
- 1 post: Decklin Foster
- 1 post: dJOEK
- 1 post: Doofus
- 1 post: Dr Caleb
- 1 post: DrEldarion
- 1 post: erik umenhofer
- 1 post: FascDot Killed My Pr
- 1 post: flipppy
- 1 post: fluxrad
- 1 post: gdulli
- 1 post: gkAndy
- 1 post: gt_croz
- 1 post: jims
- 1 post: JKR
- 1 post: LinuxFreak12
- 1 post: Machina
- 1 post: MalaclypseJr
- 1 post: MaximumBob
- 1 post: mr_biggs
- 1 post: nerdling
- 1 post: Old Wolf
- 1 post: Ophidian Jones
- 1 post: osm
- 1 post: Paradox`
- 1 post: philipm
- 1 post: QBasic_Dude
- 1 post: qbasicprogrammer
- 1 post: rjamestaylor
- 1 post: rms
- 1 post: session
- 1 post: sheriff_p
- 1 post: Signal l1
- 1 post: Spameroni
- 1 post: stokessd
- 1 post: Stskeeps
- 1 post: tealover
- 1 post: Tim_F
- 1 post: TRoLLaXoR
I already took two firsts away from an Anonymous Coward.- .DOC is an OLE Document
- OLE Document parsers are available for most platforms. Theres even one for Perl
- The
.DOC format is documented on the MSDN CD's - where else would you expect this documentation to appear? - So no reverse engineering is needed. Just follow the spec
What truth remains is that the doc format changes from release to release of MS Word. So developers have to track these changes. The format is also a large and complex format, so its remained fairly niche in the open source world.Matt. Want XML + Apache + Stylesheets? Get AxKit.
DOC isn't a difficult file format. It's pretty well documented in various places around the web.
The thing is DOC is a compound file format. Meaning it is made up of various serialized data streams from embedded components. Word itself won't even know what many parts of a DOC file means, it'll just pass it on to Visio, Excel, Photoshop etc to read and understand.
DOC is a hugely extensible file format, and you can't support everything DOC can cause DOC can theorectically support just about anything...especially windows applications.
And no that was not done through evil intent. Believe it or not, integration of applications is very much something that good software engineers strive for.
If you have a problem with it, just wait a few years (or maybe a decade) for KOffice etc to mature, and watch people complain as documents created on the Linux version of KOffice won't work because someone decided to embed in their document some python code, or an xpaint image.
Here's another try to act professional, but bash microsoft at the same time type post. Pretty typical of Linux users...
.DOC format is obfuscation, this could be difficult indeed
... not for 3 big corporations, nor for thousands of loosely organized coders.
On the other hand, we have DWG, which is a fairly rich format that deals with the description of 3D objects. Could decoding a file format that deals with text and it's presentation really be that much more difficult to reverse engineer?
Well considering DOC can store ANYTHING - including the description of 3D objects yes.
I'd guess this depends more on the design behind said file format. If one of the main goals of the
I see, Microsoft == Evil, so DOC must be created to obfusticate. Very smart of you.
Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read? I guess Microsoft will go out of it's way next to obfusticate their source code to make it more difficult for the OSS community to read their source?
but I wouldn't say that it's impossible
Yes, those poor, poor companies like SUN with their open software like Java and Corel Office need to band together and blow up microsoft. resistance is not futile!
Please.
DOC isn't going to be very important in a few years anyway, Microsoft are moving to XML based everything. Serialization of com services will be XML based rather binary based as they are today as well. Just don't complain when your documents are 100MB.
Um, this is lame. DOC format is specified on MSDN. I remember in a C programming course, learning how to read and display MS Word 6 .DOC files.
.DOC files perfectly well? For example, Corel's word processor, and all those DOC -> PS convertors.
How do you explain the various programs for Linux that all read and MS Word
This article seems to be just FUD.
http://www.wotsit.org
| Ceci n'est pas une pipe.
Wouldn't this be made illegal under the DMCA? After all, we can only hack the .doc format by circumventing its encryption scheme.
.doc's aren't encrypted. But even their encoding scheme could be regarded as a form of data hiding.
OK, OK, you are going to say that
Oh, I forgot one minor detail. The government is in bed with the movie industry, not the software industry. So it's ok to bypass the encryption on anything except mp3s and dvd.
nuclear cia fbi spy password code encrypt president bomb
Friends don't let friends misuse the subjunctive.
Response2: Yeah! Let's just do it!
This question misses the whole point. The problem (from following the AbiWord list for a while) is not that the .doc file format needs to be reverse engineered, it's that the format is such a piece of crap that you can't implement the spec.
Basically, you have to emulate all of Word's bugs in handling it's own file format to get the expected results. And trying to copy 65,000 bugs is non-trivial. :)
--
-- Slashdot sucks.
When I used StarOffice I have seen horribly broken formatting that was magically cured when I have installed Microsoft/Monotype fonts into my Linux box with StarOffice. This suggests that Word formatting is very inflexible regarding changing parameters of the media (as opposed to, say, TeX that will adapt to any size of anything as long as it makes sense), and every slight difference in algotithms (never documented ones, not "packaging") can cause horrendous miscalculation of the formatting.
Contrary to the popular belief, there indeed is no God.
Might become a problem when companies start patenting file formats, like this ASF patent
LAOLA looks like a good solution.
___
Apple used to bundle MacLinkPlus with MacOS, so any Mac user could open any file from any program -- PC or Mac. (I used to annoy PC users by using my Mac PowerBook to translate files for them that they couldn't open, from programs that they didn't have and that weren't even available for the Mac, e.g., Lotus AmiPro. The stuff works.) Apple doesn't bundle it any more (?!) for their own inscrutable reasons.
There is no Linux version (yet) of DataViz's translator package, but they do offer translation packages for Palm users, so there's some indication that they're open to addressing "non-traditional" platforms if they see a market. I have hope.
The problem with MS Word is that the way to see how a command or comment would work is to try it on the screen. Will inserting this graphic cause the rest on the page to lose alignment? Not sure? Try it out!
This is fine for people using the software, but a nightmare for other people trying to write compatible software. Try it! Take out your old copy of MS 5.0 and write a fairly complex document (a 5-6 page research paper with graphs and annotation is a good example). Take that and use MS 6.0 to read it. Even MS themselves can't maintain consistency of conversion. That's becuase they basically made the document format up as they went along - no formal software engineering specs were ever written. If they were, then they obviously weren't detailed enough.
Contrast that to TeX. You don't have to copy a single line of TeX source to create your own teX compiler. All you have to do is to examine the picky formatting tests, and ensure that you write your code to reproduce the desired tests. And the specs were designed sensibly, if a little idiosyncratically.
The binary format of the .doc file is hardly the issue!!
Microsoft Word is made for lawyers. Try writing anything scientific with it and you're screwed. LaTeX is still the way to go.
The answer for why the big office suite vendors haven't banded together in the same manner as the OpenDWG Alliance seems pretty self-evident to me. I'm sure that each of these software manufacturers have at one time or another signed an NDA with regards to the MS Office file formats. Once they did that, they were precluded from sharing that information amongst themselves. End of question.
As for why they signed those NDAs? Again self-evident: early access. If Corel or Lotus wanted to be able to support the new file formats in a timely fashion, they need to know what the spec is well in advance -- TechNet doesn't get that sort of new information fast enough. For that matter, when you subscribe to TechNet, you're signing a limited NDA with Microsoft; I'd check the fine print before I depended upon TechNet information...
Are you moderating this down because you disagree with it,
We call it art because we have names for the things we understand.
... is http://www.opendwg.org, not http:///www.opendwg.org as is given in the article.
there should be a policy of the minimum number of cups of coffee the poster has to drink before posting...
For my purposes, and the purposes of the company for which I work: what good reverse-engineered DWG file formats if you still can't get a good, affordable CAD package on anything but Ms-Win? My company is presently standardizing on applications. And (I'm sure MS would be overjoyed to hear this) it looks like the Unix boxen are on their way out. Why? One of the reasons is AutoCrap. It's available only on Ms-Win. Our customers and vendors demand files in AutoCrap format. There are no price-competitive CAD packages available for Unix anymore. (Bentley has dropped support for MicroStation on Unix--in case you didn't know. Note to Bentley: you screwed up! By dropping MicroStation for Unix you removed any incentive for us to consider your product.) So bye-bye to our reliable, low-TOC Unix workstations and X-terminals :-(.
So even though we're evaluating StarOffice to use instead of MS Office, and even though we're evaluating non-MS email clients and other non-MS client apps: even if these pan out the Unix environment is still probably doomed because of AutoCrap :-(. (Then there's Visio and other stuff.)
IMO many vendors, by not making their apps available on non-MS platforms, are missing the boat by failing to differentiate themselves from the run-of-the-mill "Me too! I do Microsoft" crowd. With things happening like the surge of interest in Linux as potentially a viable workstation platform, Solaris for free and Sun hardware getting quite affordable: this seems to me to be narrow-minded. Particularly wrt to vendors like Bentley--who already had Unix versions of their products.
Sigh...
As far as reverse engineering the file format, its all but impossible. Now that UCITA is here it will get even tougher. I just hope AutoCAD knows to not shooting itself in the foot by suing its own users. If the peoblem ever amounted to a threat to AutoCAD's market share there would probably be quite a backlash.
Wordpad supports Word 97/2000 files, and so does MFC.
The source code for MFC comes with Visual Studio and the Windows Platform SDK.
You can also download the complete source code for Wordpad from MSDN (Under sample applications).
All you have to do is implement large portions of Windows, COM and Windows Apps to make it work. It uses OLE Structured Storage. OLE (COM/ActiveX) is a Windows thing. To make OLE Structured Storage work on other OSes, you have to make COM available, and use it to read and write the doc. Microsoft did this for the Macintosh, for example.
.doc files, you either have to:
.DOC is not open because the technology it depends on is not open. I'm sure the fellow who wrote a Word viewer in his C programming course did it on Windows, where COM and other Windows APIs are available.
So, to properly read and write
1) run Windows and Word
2) run MacOS and Word
3) port COM to anither OS and write a Word-alike
Yummy. Anyone written COM for Linux lately? TummyX's "it's open, it's open, stop whining" aside,
If he did it on Unix or BeOS or something, he should speak up.
Open file formats are important for interoperability and choice. Non-open ones are important for limiting choice and maintaining control. Knowledge shared is power lost, as Aleister Crowley said.
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
Applix does not convert DOC, it converts RTF
Not true. Applix does convert DOC. It doesn't write doc however.
At least the version that I use dayly.
Roger.
The troth is that those specifications are inaccurate and incomplete with regards to word 97 and 2K. Every person who has tried to implement an import filter has ran into that problem. The end result is that you sit down and create word documents on one PC ( or virtual PC with VMWare ) then go through with a hex editor to figure out what symbol dose what.
To put that all in perspective the two paragraphs above save to 1 KB ( minimum displayed file size on Win98 ) in HTM or text format. In MSWord
Everybody who dose this reverse engineering has to start from scratch. Every company that tries to read *.doc files has to put people to work doing it. A combining of efforts would be very prudent. Let's start by getting The Open Source teems together on this then we can invite IBM, Corel, Sun, etc... to join.
We need someone to advocate the benefits of an LGPL or even BSD licensed library set to corps who must otherwise do it all themselves ? This is what ESR is useful for so go and call him.
--= Isn't it surprising how badly I spell ?
1. .DOC is documented, this question is lame FUD. Quit bashing Microsoft.
Well, if its so well documented, then why can't I open a Word document in WordPerfect? And please don't tell me its because the Word document can contain embedded things like Excel and Access parts. I'm just talking about a regular word processing document with text and a little formatting. Our MIS guys tell me it does work but they apparently received this information from the WordPerfect 8 packaging rather than from experimentation because it doesn't work on my computer and they have been unable to show me where it works on their's.
2. Why are you picking on poor Microsoft? Do you really think they would purposely obfuscate their own code and make it difficult not only on the rest of the world, but themselves as well? Do you really think they're purposely trying to make it difficult for other companies to use the .DOC format?
Um, well yes, that's exactly what I think. What planet have you people been living on for the last 20 years. Of course Microsoft wants to make it difficult for other wordprocessors to use its format. They pretty much have a monopoly on in the Office arena and they want to keep it. If you could go out and buy WordPerfect for $100 less than Word and still be able to use the .DOC format perfectly, how would that help Microsoft? They have done things like this in the past and they will continue to do them as long as they can.
On a more positive note, I'll say that I do think that Microsoft Office is a good product. I mean it works and it does alot of cool stuff(even though that makes it bloated). The problem is in the way which Microsoft has used the power that Office has given them, not in the product itself. And I'm not just bashing Microsoft. I fully believe that if Sun or Corel were in their place they'd be doing the same thing. The bottom line is that consumers are suffering because of proprietary formats. This is one of the big reasons why computers have not made us more productive (or at least as productive as we could be). I can't count the number of hours I've spent simply trying to convert documents from one format to another.
Check out AbiWord.
Comment removed based on user account deletion
Likewise, feeding broken documents to the "strings" program to recover the text, or doing to a list with sed(/gawk) in one line and two seconds what would take a Word(/Excel) user weeks by hand. Highly amusing. (-:
Got time? Spend some of it coding or testing
So because MS wants to keep out competitors, it is entitled to make you find another job simply because you wanted to exercise your choice in software. In my book, hurting innocent people is EVIL!
There was a company (I believe they were called INSO) that once upon a time made a set of Windows DLL's for file conversion. One of the things they supported was converting to/from .DOC '95 and '97 formats. '97 was the last one I saw, so they may have disappeared.
I don't know if they had some sort of agreement with Microsoft or they came up with the converters on their own, but they were indeed out there.
The analogy is actually more apt than you'd think.
.doc file format is fairly well documented, as these things go, although there are some proprietary aspects, like the VBA streams. It's not that tough to open up a Word doc in your own program and parse the file correctly.
The
The tough part comes when you actually want to display the document. Now all sorts of little details that aren't in the file format but are idiosyncrancies of MS Word pop up. And, as anyone who's used Office extensively knows, Word will display the document differently depending on which version you're using, what printer you have connected, phases of the moon, etc.
Parsing and display are two different things. While half a million apps can parse HTML, no two of them seem to display it in quite the same way. The question here is a bit like pointing out that no browser displays things like (IE|Netscape). Well, no they don't, but that has nothing to do with an inability to reverse engineer the file format.
So... they not only get the last word, but the previous few words as well?
Got time? Spend some of it coding or testing
Now that's what I call followng your convictions...
Got time? Spend some of it coding or testing
I have no problem with storing an image (standard bitmap or scalable, e.g. (E)PS) in a document, plus a reference to the application that created it and the source file. Then you can have your convenient little diagram in a portable format, and when you double-click (or, right-click->edit) on the image the originating application is started with the sourcefile as the first parameter.
That way no application means no editing rather than no picture, which is how dear old "we know what you want" MS have done it.
Got time? Spend some of it coding or testing
Why sit here and waste the energy with 10 million differant formats. Why not create a standard. I know MS would not like that, but, thats there problem. Yet, another reason why people should think about not using there products. I think a standard is long overdue for, documents, spreadsheets, etc.. Remeber 10 years ago where there was 10,000 differant types of databases. Foxpro, dbase, etc.. Now we have SQL, which makes life sooo much easier. I hate it when people use .doc for sending stuff to me. I had a company send me there price list in that format. I told them that unless they sent me it in plain text format, I refussed to buy from them. After a number of emails back and forth, I was told that I am using an OS that is garbage becuase its not MS. Don't get me wrong, I could have converted it. Bt, its not worth my time. PDF is another example. I won't use it, becuase its a stuiped properity format owned by 1 company. If it was an open standard, I would use it. If the file format goes with there product only I can understand that. But, if there is 10 products very simular to it, I think its a waste of my time and everyone else. Ok, I have stepped off my soap box now...
until (succeed) try { again(); }
... by at least one project, WvWare which has a very functional word to html converter available online, and the routines behind it are all open source.
The problems remaining are two - the .doc format keeps changing every release, and second, honestly, it sucks. Even converting it to a real format can be interpreted as giving it credence. I have used the link above in a couple of cases where it was really necessary, but generally, when I get sent a .doc document I reply please send me the data in a standard format. This usually gets the point across. It isn't like word can't output to rtf or txt formats, but for the rare occasion when you don't dare insist some PHB converts his data to a real format, this is a viable converter. And of course if you are writing a GPL Word Processor you are free to use the routines published here to create your own conversion filter...
There are also links on the page to all sorts of resources related to the ms word document format.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
"why hasn't .DOC been reverse engineered? I would think that if this can happen to the DWG format then it can happen to any proprietary format."
Not necesserily true. A format can be encrypted with PGP and a connection to the Internet may be required to read a document encoded with this format. Try and reverse engineer that.
You can't handle the truth.
You know, I've been thinking about this.
.DOC format; it's in the fact that Word itself reads the statements contained within the .DOC format in confusing and illogical ways.
The obfuscation isn't actually in the
Yet, this readability has been maintained from Office 95 thru Office 97 to Office 2000. (Lets not even talk about Word for Mac!)
This just isn't possible unless Microsoft has internal conformance specifications that they follow from revision to revision.
We know the specs exist because it literally would have been impossible for Microsoft to have functioned without them.
98% of Word documents don't use any advanced Word features. In fact, 98% of Word documents should be saved in RTF format, and lose nothing of value in the translation. With these specifications, the #1 thing companies could do would be to implement a DOC->RTF filter *at the mail gateway* and be done with 98% of Macro Virii.
Will it happen? Nah. The Word Monopoly is just too critical to Microsoft's success. It really is.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
What's wrong with the wv library - also known as wordview which was packaged with many Linux distributions. www.wv.com IIRC.
The default translation is DOC to HTML - which sucks. However, it internally generates XML which can be transformed into anything - one of the examples is TeX.
Ross
Now that I've got that off my chest:
Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?
That's a good question I'd like to know the answer to. Having tried in the past to plumb its depths in order to output database reports in it, I really don't know.
Are these formats really open, however, or are they like the shrinkwrap on the Kerberos license? Back when I got the format, I had to sign an NDA and promise not to use the format in a word processing program.
Microsoft are moving to XML based everything.
Yes, and have you looked at how they use it to save HTML? Biggest bunch of spaghetti known to man.
Look, I'm not an agnostic here: I use Windows 98 everyday, and Office, too, and am an MCSE. But it is futile to defend a company who attempts to maintain their monopoly by making things complicated. I would defend them more if they merely trusted the quality of their own products, instead of doing their best to lock you in by means that go against computing's best practices: keeping things simple, and allowing you to move your data however you like.
I think the quote on the OpenDWG Alliance page sums it up:
Who should have control of your DWG files?
You should.
It is very scary that a good chunk of word processed documents are stored in an overly complicated binary format.
stored on computers from birth to the grave
These other suppliers (Lotus, Sun-Staroffice, Applixware and other opensource projects) all try to offer their own proprietary formats (ok, maybe not the opensource projects, but certainly the big boys) and also offer to support MS formats.
An alternative solution that I believe may be the answer is to create an open community equivalent to the W3C (is that right, the body that maintains the HTML standard, whether or not it is followed, (it's late and my brain is getting foggy)) for office document formats. This way there would be an open, universal specification for document formats that all the main products could conform to and would be available to any newcomers. The bigger suites could support open standards and proprietary if they wished (for advanced formatting that they thought was important), but it would mean that if you wanted to ensure that your documents could be read by anyone, you could use the universal format. Also a document format such as this would (hopefully) be outside the control of any one developer.
It's not a perfect solution by any means, but it could just be a step in the right direction .
"I'll take the red pill. No! Blue! AAAaaaahhhhhhhhh"
- Monty Python meets the Matrix
"Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"
;-)
It's very simple really. Unlike Autodesk, which uses some form of logic to create their file formats, Microsoft uses heavy encryption seeded with a semi-random number.
This number is based on the millions of dollars Bill Gates is worth at the year of release. In fact, the file formats for Office 95, 97, and 2000 are identical - it's just that Bill Gates has been worth more at the time of their release, so the file was encrypted differently.
This is why it's so important for the Microsoft stock price to jump around, if it stood still then the file formats wouldn't change, which means people wouldn't buy the latest version of Office, which means the stock price woudln't change, and so forth in an infinite loop.
strings file.doc ;)
It's called TeX - check it out.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
--
I think you are 100% right, and if I hadn't used up my last moderator point on a good post yesterday, I'd bump you a point myself. Not that your comment was particularly informative (no links to back up your point, shame shame) but it definitely qualifies as insightful. MSDoc format is an abomination, while it is a good thing there are in fact decent converters available (see WvWare) for those occasions when we just have to read a .doc file, but the goal should not be conversion of this disgusting format, but elimination of it in favour of open standards (text, TeX and html, depending on the document.)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Win98's Wordpad doesn't read most of the Office 2000 .DOC files I've been sent, at least not properly. Usually, these files will contain many spurious characters and the formatting will be lost.
"I love my job, but I hate talking to people like you" (Freddie Mercury)
go take a look here: http://www.winehq.com/Apps/details.cgi?id=2097
New things are always on the horizon
Have "we" managed to get Netscape and IE and Opera (etc etc) to display HTML exactly the same yet? Of course there might be some minor disagreements on what each tag is exactly supposed to do, but it still must be easier than reverse engineering the DOC spec. I'll bet Microsoft couldn't even create a app to read the Doc spec just as Word does!
So close and yet so far from the world's perfect ID number
i dont know what the big fuss is anyways.. because StarOffice 5.1 DOES decode .doc files. I dont know how, and i dont care. All i know is i can read my pesky marketing peoples files in Linux. boind...goinbb..
pavementrocks.
Why can't we reverse engineer .doc?
The answer is 43.
jungle is massive
jungle is massive
Mmmm...send them files in TeX format?
Seriously though, knowing what the person can read on the other end and sending them that is courtious. Unfortunately, too many people think that everyone uses what they do.
A firewall can not protect you from yourself. Turn off what you do not need. Do not use the firewall to do your work.
Enough said. Corel WP is really the best WP out there, and it bothers me that people say, "Oh, that feature is in Word so it must only be in Word."
I can't stand it any longer. Please, go help Corel out and buy a copy of WordPerfect Office 2000 -- it's better than Word (Reveal Codes!!!) and it's cheaper. Not to mention OEMs don't force it down your throat....
|/usr/games/fortune
I have the misfortune of having to work with Word on occasion, and my wife uses it in her consulting business. Formatting problems crop up constantly when trading files with other Word users. Wasted time, wasted paper are just part of using Word. If m$ can't make two versions of Word that format the same document the same way, or even one version of Word that runs on two platforms (MacOS and Windows) and formats the same document the same way, why is anyone surprised that no one else can get it to work? It should be nuked from space and forgotten.
It's a daunting task to reverse-engineer the .doc format because even MS's developpers would be hard pressed to re-implement their engine.
The best way to deal with .doc files, IMHO, is to make them irrelevant. That's why I totally agree with the people that say we should have one office documents format.
I believe this format should be based on XML, so it would be very easy to extend documents. You might even put a whole site inside your doc! Doc size wouldn't be a problem: we just have to compress it in the end with a standard metod.
I have had to put a great deal of research into this because I'm doing a project for a client right now that requires converting .DOC files to HTML and inserting them into a MySQL db. So far I've found plenty of worthy solutions for converting the text, but none of them will handle the linked TIFF graphics in the documents.
Here are a few of my bookmarks:
WVWare - GPL library for reading .doc files, used by AbiWord, currently incomplete
W3C's list of converters
HyperNews' list of converters - really old
Filtrix - Good commercial, closed-source converter, now available for Linux, great price, but doesn't handle linked TIFF files :P
InfoAccess - Makers of HTML Transit, the Cadillac of closed-source commercial document converters, also exorbitantly expensive ($5000+) and AFAIK not avail for Linux
KOffice (KDE2) filters page - not much here, but AFAIK they intend to ship with MS-Word import capabilities
So, is anyone aware of any open-source MS-Word filter projects that I don't know about? Especially one that recognizes/converts linked graphics contained in the document?
- phutureboy
I personally don't see anything particularly amazing in .doc that I don't see in other document formats. I only prominent "feature" i notice in it is the fixed page width thing, making it truly wysiwyg. However, over time i've begun to increasingly appreciate .lyx motto wysiwyM (m stands for "mean"). I realized how rarely i need a document to be truely wysiwyg. The most common reason i can think of that people need to have a rigid text/page appearance is for making flyers and etc, which i guess a program like gimp of photoshop may be better suited, since many flyers feature lots of images and stuff anyway. The argument for .doc may be that lots of people are using it, and it'd fascilitate things better if say a windows user can read something from a linux user, or vice versa. But the root of this problem is simply that big word again "STANDARDS." Why reverse engineer a proprietry format, when one could spend the time promoting and developing open standards like html or xml, or even TeX? Another problem with .doc is that it is not a typesetting language. My favorite scenario is that: you make one typo in your resume. You are in a situation that all you have is a telnet program that allows you to connect to thr server where your resume is stored. If your resume is done in a something like html of TeX, u can fix the typo in no time. I'd love to know what i can do in this situation if my resume is done in .doc. Thanks for listening.
At one time, you could download the specs for the binary file format. Now, according to:
http://support.micro soft.com/support/kb/articles/Q211/6/41.ASP
You need to write to an e-mail address and explain why you want it. It also says that the formats for earlier versions of Word are no longer available.
For what it's worth.
Comment removed based on user account deletion
Disclaimer: I am not a lawyer, and I don't know the laws of many countries, above is what I understand of Swedish and Norwegian law.
I believe the reason why HTML documents look different on any given browser is because the spec is too vague.
...
For example, in Netscape 4 you can't put something right in the top-left corner without using frames. Is this right? No. Is this wrong? Where is it in the spec?
Another example is the amount of space that the individual browsers use for frames themselves. If you specify a speicific height for a horizontal frame at the bottom of a page (see http://www.wired.com for an example of this), it will be different heights for IE and NS. Where is this in the spec? Exactly
Clearly, HTML isn't up to the task. We need to design a markup language specific enough so that it won't choke when a designer makes a neat and interesting design.
Graphic designers are increasingly becoming web developers. They shouldn't have to worry about the different versions of different browsers and why the same shit looks different all over the place.
If people keep falling back on the "if you don't like it, post a PDF" reply, then nothing will be done! We clearly need a new markup language.
rLowe
----- rL
Your loss.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
At Wotsit. Microsoft Word 6.0, 8.0, Word 97, and Palm Pilot doc files where all reverse engineered.
I believe there are actually 2 problems here:
1) As I think several people have touched on, the problem here isn't the documentation, since Microsoft through MSDN etc. has documented the Word file format. The problem is that the only specifications on how to correctly render the Word documents are the Word rendering engine itself. Without the ability to see the exact logic that Word uses to render certain formatting codes (read: source code), it is impossible to reverse-engineer a 100%-compatible converter/viewer. It is a similar situation to what the Samba team faces: the SMB/CIFS protocols have been documented by Microsoft, but the only implementation of those protocols is Windows NT/2000, so Samba in reality must be coded to re-implement NT, not implement the CIFS specifications. The difference here, of course, is that CIFS apparently has a complete spec that Microsoft simply ignores, rather than the Word situation where they purposefully keep people in the dark on how things should be done.
2) the reason that you can't just watch what the Word rendering engine does and duplicate it is because it's stupid. From my experience working with Word itself and wvWare to convert Word files to HTML, it's obvious that Word just throws odd formatting codes where ever it pleases, and never bothers to clean them up. Often tags to end bold formatting (converted to </b> by wvWare) are just randomly placed in the document, nowhere near where any bolding is supposed to occur. The same goes for font sizing/coloring: Word seems to place odd, irrelevant font codes in places, only to override them with the correct codes a few lines later (often without canceling the first codes). In other words, it's a mess. With the Word source code, one may be able to figure out the (supposed) logic behind the mess; without it, I fear anyone is simply grasping at straws, especially since MS continuously changes to Office keeps everyone guessing about what Word is actually doing underneath it all.
My US$0.02 of course.
--Mythos
Summing up we've this situation: 1. MS uses a proprietary storage format: Ole Storage. It's structure is meanwhile well known, one can retrieve the actual application dependent documents easily. 2. The application dependent documents are partially documented by MS, partially by others. 3. Documentations aren't complete anyway. The binary documents contain most relevant undocument data portions.(It's obviously due to automatic serialization strategies applied by MS: easy to apply but practically not documentable; not even by MS themselves. This leads to the funny situation that people reverse engeneering the file formats understand them better than MS ;-)). I'm working on Word, Excel and PowerPoint intensively for about six years now and can say: it is possible to understand all of these portions. 4. The WMF/EMF/PICT image formats are not sufficiently supported on alien platforms. Even this: on Macs xMF looks ugly, on Windows PICT drawings look ugly. Not a too big problem compared to the rest, but it's not yet solved. 5. MS XML support simplifies the understanding of the docformats even more. 6. Quite a bunch of information is not stored in the documents but in the application; only the variations from default are stored in the documents. It requires quite some efforts to rebuild this data yourself, but is is possible. Summed up: The knowledge about document formats is no longer a problem. The problem is rather to get the knowledge focused on free applications. I'm afraid it requires management actions from this side. PS: Did you know that MS stores GIF files as PNGs in their documents? :)
There is a lot of confusion here about whether or not the .DOC format has been documented, because there are two layers to the file format. First, there is the Word document format itself, which Microsoft has published in some MSDN CD versions. It also available from places like www.wotsit.org. This specification is inaccurate in places but close enough to make Word document conversion possible. Caolan McNamara has a very good start on a Word-to-HTML converter at www.wvware.com. The Word document format changed in the transition from Word 6 to Word 97, and is the same in Word 2000.
However, Word documents since version 6 are wrapped in OLE Compound Documents, which Microsoft also uses for .XLS files. The Compound Document format is not officially documented anywhere in Microsoft documentation, as far as I can tell. (But see below for a patent that might disclose this structure...) The MSDN library samples invariably use Windows system calls to access data in Compound Documents, and reveal nothing about the file format.
There have been some efforts to reverse-engineer this format:
http://arturo.directmail.org/filtersweb/ and
http://snake.cs.tu-berli n.de:8081/~schwartz/pmh/guide.html,
A Compound Document contains a tree structure of data streams, which seems like a simple enough structure but it is implemented using a very complex file format. The lack of complete documentation of this format is a major impediment to development of robust open-source code that will access the Microsoft Office file formats.
A second potential impediment is a nest of patents that Microsoft has built around the Compound Document format. These are just a few:
US5467472: Method and system for generating and maintaining property sets with unique format identifiers
US5715441: Method and system for storing and accessing data in a compound document using object linking
US5506983: Method and system for transactioning of modifications to a tree structured file
US5706504: Method and system for storing data objects using a small object data stream
There are a fair number of patents (IBM seems to have some possibly related ones as well). You can find them here: http://patent.womplex.ibm.com/home. A search for "((compound document) and microsoft)" lists 24 patents. It would not be surprising if a serious effort to provide open-source access to Microsoft Office documents ran into legal threats because of these patents.
Interestingly, the last one looks like it might disclose the Compound Document format, which Microsoft would have to disclose to satisfy the patent office. The description looks right, but the diagrams do not seem to be available from the IBM site. Looks like I'll have to dig some more -- anyone know how to get the full text and images for U.S. Patent 5,706,504?
I don't want a word processor that spits out files that are readable only with extraordinary effort. I do want a word processor that will produce files in a standard format readable on any machine. The much maligned EMACS is in this respect far superior to MS-Word. I don't believe MS would have any problem putting out a far superior product, but they refuse. They would rather keep using an ever-changing format whose only "virtue" is that it is NOT readable to anyone on a different platform. This is "progress?" This is "innovation?"
It is not, but the only way to turn Microsofts admittedly great collective talent to doing things right is for the users of their software to send them a collective and loud message that we will not put up with this crap anymore.
Will this happen? I don't know. I'm not betting on it. I'm learning to use Linux. I'm about 90% there. Once I learn ITCL and figure out how to set up a linux box as a usable multimedia platform I'll be all the way there. Anyone want to help?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Microsoft developed an open specification for sharing of word processing documents years ago.
It is called Rich Text Format(aka RTF), the specifications are available from MSDN.
It is not uncommon for application vendors to support open specifications for data sharing, but not provide the details of their internal file formats. There are a lot of reasons for this, and not all of them are "evil".
For instance, searching the Corel website I can't seem to find the file format for WordPerfect.
Obviously Corel must be an evil corporation. Oh wait, they can't be, they're going bankrupt and supporting Linux.
The very nature of a component object model makes transferring document across different platforms, even different computers on the SAME platform aggravating at times.
You just HAVE to have some component that can interpret a stream available to completely decode a document. This is true of ANYa component model. You want to see how difficult decoding a compound document is? Try grabbing the dead OpenDoc spec at look at their bento container. It's design goal is exactly like *.doc. And that was designed from the get-go to be cross platform.
Think of it as component hell. And it is unavoidable no matter who does it. This goes for KOffice as well. Complexity is a run away train. I should say entropy. Since we're tending towards chaos here.
And while your at it take a crack at the Visio file format! (Which just recently got swallowed up by Microsoft, damn it all.)
Reverse engineering a format tailor made for
a specific application to a point where even
upgrades to said application break format
compatibility may be futile or a waste of
resources.
What is needed, IMHO, is an education campaign
to use RTF for file exchange. If a few big
corporations adopt a policy of only accepting
RTF files for communications and only generating
RTF files for communications, then it may start
propagating in the corporate world. The only
reason I think this is realistic is because there
is clear financial insentive to do so for everyone
except MS itself.
This is true. Pull in a TeX document which contains a figure which is in a format your 'puter doesn't understand, you'll see blank space. Same thing with a .doc file in the same situation, right?
Explain to me just how the proprietary .doc format is superior to the open TeX format then. This obviously is not it, because they both react the same to this situation.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
Right you are, sir! In today's "free" market, there are a slew of businesses which wield monopoly power, but which they don't want you to know about it. Consider:
Cisco Systems has a market value comparable to Microsoft's, and has even exceeded it at times, by maintaining a total stranglehold on the network hardware market. Although they would have us believe that Cisco's strategy is "providing a reliable, top-quality product and good support," a number of internal memos have recently been leaked indicating that Cisco plans to start including support for the "upgraded" IPv6 "extension," putting them in a position to use the "embrace and extend" strategy to leverage their large market share into an almost total monopoly on the Internet's physical infrastructure.
The Lego corporation has a long history of introducing new block designs which render the old blocks almost totally useless from an aesthetic perspective. "I spent all my lawn-mowing money on the medievel set," said a sniffling little boy who asked not to be identified, "but then the Technics came out, and all my spears and stuff wouldn't fit anywhere on the walking robot I built unless I mixed those brown spear-holder blocks in, and then my robot looks yucky." He also pointed out, as is well known, that Lego has broken Technics color-compatibility with their new Mindstorm upgrade, by switching red dye #5 for #8, and yellow #2 for #7. Alas, the legal hassles that await anyone foolish enough to reverse-engineer Lego's proprietary block-connection protocols have ensured that Lego has reigned unchallenged as the only source for toys you can build cool shit with, despite their inferior product. The "accidental" death of Abe Fromage and the subsequent collapse of Tinkertoys spelt the end of competition, even before Lego started blatantly cloning "CPU" and "robotics" technology from the computer industry for use in their "innovative" Mindstorm toys.
Furthermore, Red Lobster, Denny's, and other chain/corporation/restaruant/franchise establishments regularly use unconscionable terms in the dining agreements they make with their patrons. As a large corporation, they play from a position of strength: With their high-priced lawyers and large bankrolls, they can freely impose their will on the consumer (commonly by the use of so-called "walk-through" agreements: the restaurant posts it dining agreement on its wall, you and are considered to have "agreed" simply by choosing to dine there, regardless if you have read or even noticed the sign). Examples of this include:
It is sad, but the powermongering megacorporations who really run our country also have merciless teams of wedgie-men and noogie-goons at their command, and they have bamboozled the media and the government into abusing Microsoft to benefit their own bottom line. What with communistic government interference, backlash from the misinformed public, and the software piracy that is rampant in today's industry, Microsoft can barely stay afloat, let alone research more of the innovative, professionally engineered products the software community has come to expect from them, like Microsoft Bob, the dancing Office paper clip, and email clients that do it all at the click of a mouse! Yay Microsoft! Go Bill! One world, one web, one program!
There are other formats, such as .XLS and .PUB which also lock people into using Microsoft stuff.
--
$x='S24;r)>63/* h@<5+oZ)32"5cz';$me='phroggy'x$];
$x=~y+ -xz+\0-Tx+;print$_^chop$me for split'',$x;
Not that I like .doc format or anything but the principle of the .doc format is the same as any component object container.
So, what makes .doc better is also what makes the KOffice format better or any other component storage format. If memory serves me, Tex is more closely related to Postscript than a document format. Sure it can create Postscript output, but it can just as easily drive a typesetting machine directly. The Tex engine is hardcoded to understand certain formats of resources like fonts or graphics. I think there are macros that get around this, but I'm not sure. (It's been a while so if this isn't true anymore, then I stand corrected.)
A compound file and the program that interprets the streams is generic. It doesn't know anything about the streams in the file. It defers that knowledge to the reader/writer which may have a better way of storing it's document than the generic format. The gist is that new formats can be added without changing any of the underlying mechanisms.
Functionally, there probably isn't any real difference between a Tex file and and a compound doc file. I just think that with a compound file layout, things that you have to explicitly state in Tex is handled by the low level machinery automatically. Sort of comparing C to C++. Example, because of the class of stream in a COM or OpenDoc file, your application menu changes to allow you to edit the stream.
It boils down to the fact that no matter what format you have, without the necessary renderers on your machine, you're out of luck. (As you indicated.) Tex is cool because documents are specified using Tex's macros. But it forces your application to think in frames and columns and runs when a tree or a graph might be better.
But you can't save it in another format, can you?
Nor can you edit it. All you can do is view, and only if you own windows. So once somebody has done their work in Office the only choice is to either a)commit to the scarily expensive and frighteningly short upgrade cycle and be doomed forever to keep draining out cash or b) do that work over in another program.
Note dumb-ass m$ defenders that none of us would give a shit if they wouldn't keep changing the format and forcing upgrades. But evidently bill needs more money than he already has so he can buy another house or something.
Ever get the impression that your life would make a good sitcom?
Ever follow this to its logical conclusion: that your life is a sitcom?
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
That pretty much sums it up, and from a Microsoft VP, no less. You can pretend that Microsoft is a benevolent company all you want, but that doesn't change the facts.
The wheel is turning but the hamster is dead.
The wheel is turning, but the hamster is dead.
I worked on Bento. I was not the designer. Jed Harris was the designer (Ira Reuben the coder). Jed said Bento was an experimental first cut prototype that was pushed into production, and I agree with this view.
The design goal was only rather similar to *.doc. Unfortunately, since Bento was a version one prototype, it never had a redesign for ease in reading and writing until I designed one.
Gat1024: And that was designed from the get-go to be cross platform.
It was technically cross platform, but Bento was very unfriendly as a clearly understandable format. It's big mistake was to use phsyical stream embedding instead of logical embedding, so the recursive flow of control was a nightmare to analyze. The format had physically discontiguous streams embedded inside other physically discontiguous streams, which would give almost anyone the shudders.
Gat1024: Think of it as component hell. And it is unavoidable no matter who does it. This goes for KOffice as well. Complexity is a run away train. I should say entropy. Since we're tending towards chaos here.
You are correct that every open format can embed opaque content that cannot be understood, so all component systems suffer from the risk of component hell.
I would not accept any amount of money to reverse engineer the Office doc format as a regular job, because it would tend to be too hard and frustrating to deal with the complexity under ongoing changes.
Furthermore, I would not trust any junior engineer who did accept such a job, so I would avoid the product based on such work, under the theory it would be fragile and buggy. Am I a pessimist, or what?
David McCusker, former Bento guy
Values have meaning only against the context of a set of relationships.
That ruled. I would mod you up but I have no points at the moment...
For that matter, there's also text format! That's perfectly well documented and 100% open!
Truth be told RTF is only about a half-a-step up from text. In fact I prefer to send and recieve text because I can deal with it in emacs or ae. That and when I tried d/ling abiword its rtf filter wouldn't work.
Ever get the impression that your life would make a good sitcom?
Ever follow this to its logical conclusion: that your life is a sitcom?
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
I really think this was my entire point. They both do the same thing. The only difference is that one (TeX) is an open format dating to the early 80s, while the other (.doc) is a proprietary format that changes every 2-3 years. They both do the same thing, so what possible justification could there be for using the second? Assuming, for a moment, that M$ is, as they claim, concerned with producing real benefits to their customers, I don't see any point to .doc. Do you? If so, please explain it.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
in it, even if it's just blank. And blank lines are always double height because a carriage return becomes
 
instead of just. The placement of their tags is just bizarre!
I usually just use that filter only for long documents in order to not have to retype all the text (they hand us things in
Ever get the impression that your life would make a good sitcom?
Ever follow this to its logical conclusion: that your life is a sitcom?
"I don't care about the Constitution!" --Bill O'Reilly, November 17, 2009
MS already does this, with HTML^H^H^H^Hrubbish that gets spat out from Word2k, when you do a "save as html". It's rather frightening, actually, to see the actual code.
Well. You can always use Catdoc to view doc files in console. Just catdoc and it outputs it in plaintext.
Just making a note of a neat little utility I found.
Cheers,
SlapAyoda.
# SlapAyoda
# SlapAyoda@yahoo.com
# wrote sig.txt, 23 lines, 31337 chars
Does the format provide any real benefit to the customer? I don't know. The format seems to be geared more towards programmers (MS's) than customers. Which (in an ideal world) would lead to more efficiently designed applications.
I do know that the format is a red herring. Look at how MS is so excited about XML. It's all the components that need to be duplicated. A Office2000 document in TeX would be no more translatable than the .doc one it produces right now.
If its compatibility we are looking for here, why would expect MS to do it?
Because their customers expect them to make decisions that make their software better for the user, particularly when those decisions would come at little or no (or negative, in the case of maintaining a consistent document format) cost to Microsoft. The fact that Microsoft repeatedly changed the Word format costs themselves and their competitors money for additional programming work on filter and import/export code, and costs their users money for repeated unnecessary upgrades, incompatibility hassles with other programs. Looking at the Microsoft+competitors+users system as a whole, there is no benefit to anyone for Microsoft to use a poorly documented, convoluted format without an accurate public specification.
However, looking at MS, competitors, and users independently, it's obvious that while the value of the system as a whole is reduced by Microsoft's decisions, the handicap that it gives to competitors and the additional revenues it generates from users causes more of that value to end up as cash in Microsoft's hands.
This isn't the way a free market is supposed to work. If someone makes an inferior product, I'm supposed to be able to switch to a different producer and not be adversely impacted by said product. (and as a side effect, my readily available choices encourage all producers not to produce inferior products) Unfortunatly, when you add network effects, i.e. the requirement that my new product be compatible with the old, suddenly Microsoft has the ability to use an existing large marketshare as it's own "benefit", to make it self-sustaining, to reduce or eliminate that choice.
I'm not saying that, after thinking about it, it doesn't make sense for Microsoft to do just that. I'm just saying that, to consumers used to having a wide selection of companies competing solely based on price and quality for their purchasing dollars, it certainly counts as "unexpected".
I don't believe this is quite true. If Office2000 was putting out TeX documents then any problems in translating them would be perfectly opaque. Programmer time could be allocated most effectively in that case, to real problem points, rather than to red-herrings.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Friends don't let friends enable ecmascript.
This was a bug in Word 97 RTF handling, which has since been fixed.
--
JADBP
Why is the reverse engineering of DWG any different from the DeCSS? What are the legal ramifications?
... if you combine it with CSS.
All the replies here seem to say that not only do all browsers render HTML differently, but that's how it should be. However, that's not the case when CSS accompanies a document - in fact, it's just the opposite. CSS performs all the page-layout and style description that HTML wasn't meant to do. Also, there are specific standards for how HTML+CSS is meant to be graphically rendered; check the various test suites available at http://www.mozilla.org/newlayout/te stcases/css/. (Yes, I know that different browsers have different levels of CSS conformance, but that's to do with buggy and incomplete implementations, and nothing to do with lack of clarity or general unsuitability of the standard. There's only one released browser that has full CSS1 apparently, and that's IE for Mac.)
Please bear this in mind when reading the other posts in this thread regarding graphic designers, especially those that suggest that we need a completely new format. We don't. We just need proper implementation of an existing one.
What's the difference between a mercenary and a prostitute?
The reason for this is probably pretty simple: no one has ported the .doc format entirely becase it is fundamentally a DOS API. Unless you have the core code for Word, it would be theoretically impossible. Also, since the practically OWN the software industry (GNU excluded, of course), they can boss everyone arround and assimilate the competition. If the Tech industry was StarTreck, M$ would of couse be the Borg. Windoze software boxes would be their cubes.
-----------------------------------------
Perversely greped and groped by PowerPenguin
I've been using Corel Office 2k at home and at work exclusively since it first came out. And I have not yet had one problem with it's conversions. In fact, there are a number of errors that can occur that make the file unreadable in MS Office, but it still opens perfectly in Corel. So while the 2 might not offer identical renderings, I think Corel does if anything a better job at displaying .Doc than M$ does.
-- Braeus Sabaco
Member of the Roman Legion
Customer/worker at Phenomenal Internet Solutions
This is SO educational! -- Kintaro Oe
Two steps: .sig should say something to the effect of: "All files ending in .doc are from my account before opening due to a. it being a proprietary file format which my machine does not understand, and b. the likelyhood of viruses embedded in them. Take your Pick. Please use .rtf when sending your data." .docs! The "I'm ignoring you!" tactic. children have used it for years. Because it works;) .doc junky once a week asking them to post in another format.(for instance, all those parts manuals that are in .doc: email the company) Be sure and be very polite, and be sure and check to make sure when they post in RTF or your favorite non-proprietary format. The idea is to rattle the can, not to ring a gong.
1. Your
2. filter out any emails with an attachment named *.doc.
Hey no more
3. (I can't count) With access to a prompt, simply write a script to email your favorite
These are very simple pro-active steps that you can take that will help change the digital world.
Also, make sure you change email addresses once in a while on that script, so they don't just filter you. Spam is a tool for good, too;) Just dont be obnoxious, then you won't get what you want.
Drop me a line at:
Key ID: 0x54D1D809
I must have missed something... StarOffice 5.2 beta PERFECTLY converts from almost any format known to mankind. You can open a M$ office 2000 document without any loss of formatting at all. There is only a slight loss of formatting when going from StarOffice formats to M$ formats, and that's only in the extremely advanced features that M$ office doesn't have (like rotated text, transition effects that powerpoint doesn't have) So if I missed something, let me know! But as far as i can tell, this case was closed before it opened.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
The problem isn't the format. It's the components. How a .doc file is created and read is pretty well documented and understood. The problem is that all the streams or virtual files inside of them are created by components that change from version to version.
.doc files.
Even in TeX you have to hunt down the correct macro includes (a "component" in the TeX sense) if they're not installed inside the document. 'Twould be a lot of duplicated macros and wasted space if every document carried with it the requisite macro libraries that it needed to draw itself. That's why TeX has a bunch of standard macros that you don't have to send along with the document.
So let's say you only embed the macros that the document uses -- not every single one in a library. What do you have? A document that is renderable but not 100% editable. Because the macros that support some additional features are not present.
As soon as you start relying on external libraries (macros for TeX, components for COM or whatever) you run into component hell. The problem isn't figuring out the file, it's providing the required library/component functionality that a document needs. Each component has it's own operating context that itself depends on the document context.
All of this is why you get blank glyphs in TeX. And why translaters can only get 85% there for
"Why would a company with the smartest people in the world make life more difficult on themselves by making their own formats hard to read?"
Why do you think Microsoft has the smartest people in the world? There is no convincing evidence of that. In fact they have spent many years and have not been able to produce a stable operating system.
They have, however, marketed the hell out of what they have managed.
Compared to places like IBM's Thomas J. Whatson Labs or Lucent. Microsoft has done not significant research or produced anything original. They have been very good at seeing computing trends and buying companies that have products that do what they want.
Really, I suspect you know nothing of recent computing history.
shit, micros~1 can't even get new versions of Word to read old doc files correctly. micros~2's own filters seem to fall off exponentialy with time. I swear they don't even test those things.
Conversion % = exp(-dt)
A bit of background:
.doc format and the entire idea of files within files has a lot of merit, as does the concept of only dealing with content supported by ones hardware. However, given the lack of openness by Word file formats, by COM, and the lack of security, Microsoft strikes out.
Word is a COM object and uses COM extensively. OLE was at the roots of COM but these days OLE is just another set of COM objects that one either implements or uses. So.. from the get go, one needs to implement COM, and also IStorage/IStream, on Linux, to get at Word. This would be ok if COM were an open standard, but it ain't. Where in MSDN is the VTABLE format for COM? It isn't there.
Strike 1 against Microsoft. By strike I mean that they are doing the usual evil empire thing by not opening up COM.
Philosophically, IStorage / IStream are a set of COM objects (read Libraries), for divving up a file into its own directory mechanism. The rationale for doing this is that end users want to copy documents as entire entities, and not deal with 200 or even 2000 subdirectories or small files that might comprise a total document. In Microsoft speak, a document must be a moveable entity, and in that regard, COM library based documents are entirely defensible. However, what goes into each of those subdirectory entries, or streams, is free to remain largely undocumented. It is the design intent of COM to ensure interopability between closed interfaces. At this, COM does stunningly well. You can script against COM in any language... but the medium for interchange is an application that you must always have in order to view the document.
Strike 2 against Microsoft. COM IS an excellent piece of software engineering, but it is engineered to do the hypocritical thing. The easiest way to make things interoperable is to post the source...
Much ado has been about Word changing file formats. The critiques of Word say that it is unnecessary to change file formats between releases. This is non-sensical hogwash. New features mean new data requirements, and new data requirements mean new file formats. Every other application on the planet has versions of formats and downward compatibility problems. Have you tried looking at a style sheet page in Netscape 2.0? That Word changes file formats is reasonable.
Hit: Microsoft.
Some criticism has been made about how a Word document changes appearance based on the display or print device. This is in keeping with the philosophy of Windows - which is to enable software features only if the hardware is present to support them. This is radically different from Unix, but this hardware-centric approach of Windows IS defensible on many merits.
Hit: Microsoft.
Word has, in effect, an autoexec scripting mechanism with no sandbox and no security besides that which the user security context of the OS offers. Since Windows 98 effectively runs everyone as root, the vast majority of Windows Word users are flying blind into a cliff.
Strike Three: Microsoft.
The bottom line is this. The
The bottom line is this:
If Microsoft had opened the Word file format, then Word files would have been the defact web page of the Internet, not HTML. That we are doing HTML and HTML rendering engines is testimony to how badly Microsoft missed a golden opportunity with Word. To protect their Word Processing IP, they made sure a non-Word file format (HTML), would become the lingua fraca of the Internet. That by itself is a compelling argument in favor of open file formats.
This is my sig.
As for a UNIX version of Windows, see WINE:
http://www.winehq.com/
My Web Page
You really didn't answer my one question, which is where is WordPerfect's file format located on Corel's website?
Similarly is the Lotus WordPro file format located on the IBM website?
I guess both Corel and Lotus are using monopolistic practices to...
Oh wait a minute, you really don't know what you're talking about do you?
Personally I've never liked Office, but I don't think my like or dislike for a product should influence this discusison. Unfortunately your dislike for a product has blinded you to the reality of the industry.
Oh, and I don't know what difference restrictions will make. Even if they were to survive the appellate court.
What do you mean by "opaque"?
Ben "You have your mind on computers, it seems."
By now we know that both the DWG and DOC format have been reverse engineered. We also know that it really does not matter. Autodesk/MS control the data formats. Their rendering of the data is the reference implementation -- and they both change the format at will. They both exploit run-time and new version peculiarities in their rendering of the data.
When it comes time for a company to decide which product to invest in, when it's time to choose if they want to use the proprietary product or some wannabe cheap-o competitor, the answer is alway the same. Go with the standard bearer. And that really is the correct answer. The price differential is completely and totally irrelevant. Corporations invenst a lot more in labor and data than they invest in any one version of a software product. The "open source" factor is -- if not irrelevant -- not appreciated. It is secondary at best.
Look at IntelliCAD. They attempted to commoditize R12 AutoCAD. Supposedly nobody wanted any of the features crammed into post-R12, post-multiplatform AutoCAD. R13 was a bitter pill for AutoCAD customers and loyalists. Supposedly IntelliCAD would allow drafter/designers to draw basic 2D engineering drawing just as well as R13++ for half the price. More importantly, they thought they had given companies that had huge investments in DWG data a viable alternative -- a way out. They could jump from the ship they were supposedly dissatisfied with and seek alternatives.
But you know what? Nobody took the offer.
Not before IntelliCAD was "open source" and not after.
It turns out that Autodesk was able to pull off R14 and salvage their reputation Turns out customers were not all that dissatisfied with Autodesk -- which they correctly saw as a well entrenched, healthy (==rich) partner, committed to investing in both AutoCAD and other forward looking design products and technologies. Turns out AutoCAD is very capable of getting the drafting job done. Besides, IntelliCAD was for shit. Still is. And when Visio sacked the original ItelliCAD development team - a very idealistic and motivated group -- because ICAD was released prematurely with bugs and feature gaps -- any idealism or customer loyalty went out the window. ICAD was exposed for what it had become -- a cheap knock off with no future. The so-called open sourcing of IntelliCAD was just window dressing. The fact was that Visio had interred it's mistake in preparation for acquisition by MS. (It also parted ways with the folks that had inspired IntellCAD, FWIW.)
So what does this have to do with .DOC?
You could come out with a .DOC compatible word processor without a super-human effort. But wihtout the VBA, without the quirky rendering, without all the nuances and endless litany of features of Word it would be nothing more than a knock-off. It would have to beat Word on functional terms in order to be attractive. That would be a very tall order. Like it or not, Word and AutoCAD are very mature products. Maybe they attempt to do too much. Maybe they are bloated with features that any one customer does not want or need. But a whole lot of customers are well served by these products. They get the job done for a broad spectrum of customers.
They are both going to be very, very hard to disslodge.
It's their game to loose.
Beating them on the merits will be damned hard, and possibly not enough.
And, just to goad anyone still reading, being "open source" or not has nothing to do with it.
If open source is a strategic advantage, it will hvae to do with stamina and longevity. Eventually MS/Autodesk will find it hard to keep milking their cash cows. Eventually they will find it harder and harder to justify continued investment in these products. Eventually the WinX platforms both producst are married to will fade. At that point, when Word and AutoCAD stagnate, they may be vulnerable to an open source comminity that can run endlessly on no cash, that can build bridges to newer, more current technologies.
I'm not holding my breath.
In fact, I've changed jobs to get out of the CAD industry. The action is elsewhere. I may not live long enough to see AutoCAD take a fall. It may never happen.
PS: In the CAD space, the most intersting open source activity is not IntelliCAD. The Matra folks have a more interesting offering. IntelliCAD is a corpse. OpenDWG may prove useful if and when the action moves beyond AutoCAD. If that future is to involve open source, it will more likely be centered on Matra than OpenDWG.
"it's obvious that Word just throws odd formatting codes where ever it pleases, and never bothers to clean them up"
:-) I remembr reading an article in a mag a while back where an anonymous MS code slave explained the basics of how .DOCS are created. This is vastly simplified and from an failing memory, so no flames please where I screw up :-)
.doc is sort-of a diff file. You start with an (almost) empty file, into which Word inserts your text and formatting as you type and click. So far, so good, but when you go back to change things, it doesn't do the obvious thing and change it in the file, it actually appends your changes onto the end of the file.
That's almost just what it does do
From memory, he said that
When you reload the file, it sort-of starts from the beginning again, and applies your changes in the order they occurred. That's why you find the format commands all over the place: the file holds location details for the target and the action to be applied. This is also why old, frequently edited documents get so large (and so slow to load).
Try it. Create a document. Do lots of editing, make lots of changes. Save it. See how big it gets. Now use 'Save As'. Watch it shrink as Word goes through and trys to clean up the mess it's made.
My favorite quote from the MSloth: "Word docs are mostly space". Sort of sums M$ up nicely, don't you think?
Of course, I could be wrong, it's a frequent occurence...
John.
I myself developed a syntax directed editor in 1985 called ALICE -- see this page to download it for DOS or Linux -- which still 15 years later does more than Intellisense.
There are some MS innovations but this is also 20 year old stuff.
Has it been over a year since you last donated to the Electronic Frontier Foundation
I have noticed that all the pro MS posts always get moderated up pretty good. Atleast three usually five. Something is going on. I think all the astroturfers are moding each other up.
War is necrophilia.
It is very difficult to have a setup with real time feedback as to changes you make to a TeX document.
Furthermore, it is even harder to write an editor for TeX that allows you to use the extensibility without the possibility of breaking the document (i.e. the TeX file the editor spouts needn't be able to fit through a TeX compiler)
The strict structure enforcement possible with XML together with the possibility of database backends makes for a far brighter future. (p.s. suppose you want to search through a bunch of TeX documents and extract all definitions, say. The ability to do this at all requires discipline from the document author, and so realistically limits itself to a single author set-up or a closely knit group. (SG/X)ML with DTD's doesn't suffer the same fate, and can still use a TeX formatter as the backend.)
I'll end with a quick point, very worthy of note.
John
John_Chalisque
Plus, the guys from NeXT (a software company) had a mild allergy to paying for and bundling software from other companies, and that hastened the termination of some of the preexisting bundling deals.
Good Idea.
But make them document the Win32 API as well, and make Source Licenses available for their products, doesnt have to be free, but AVAILABLE, like in the Unix days.
Why the hell is the default formatting HTML for posting messages, what the hell is slashdot thinking?
Fear the government that fears your guns. Fear the government that fears your computers. Remove them from my email.
If they were to do that, then I'd like to see them to put their money where there mouth is and open all their own proprietary formats first.
Its GPLed, granted it needs work. So scoot onto the abiword mailing list and cvs down the latest version, get hacking on it and sort it out.
ole2 is fully sorted out with libole2, excel is being handling by gnumeric.
What is not handled by wv is not by lack of documentation or design, its simply a matter of spending some time at it. Easy peasy. Info on the MSDN docs can be got from here. They can be gotten off the MSDN 1998 July cd, or you can get some of them from wotsit.org. I even wrote ivt2html for you to convert the office.ivt file into html. Like what else do you need.
90% of all the hard work has been done, wv can parse fast and simple with no bother to it, which was a nightmare to do, it can construct the correct PAP (paragraph properties) and CHP (character properties) for a given run of text. Feed you the correct characters and charset and font, the TAP (table properties), graphic properties and handle to graphics. The correct OLE handle for embedded objects. Document properties etc. There is an example html conversion program included for reference (wvHtml).
I put together libwmf to convert wmf file into something useful as well. Theres a half done implementation of an Escher (the graphics for Office) importer floating around in there as well.
Theres also an implementation of a Summary Stream displayer for all ole2 documents.
I even bust my ass and dragged together the right bunch of motivated people to help implement the decryption module for word 97, 95 and 6, and that was not fun at all to say the least
The hard work is done, if you want something improved you have a very very solid base to work from. Yes the spec is confusing, yes its not a great format, yeah is sort of moves over time, but in a fairly rational way that can be supported with some work. There are any number of equally crap formats with weak documentation supported in various tools.
There is just this false myth that the Microsoft formats are inpenetrable and/or not available. Just download wv, fair enough there might be problem documents, if there are, just debug wv and get onto the abiword list and work it out with them. If something fails it can be fixed and improved, its not a case of "ah well, its a MS format, nothing can be done". If you truly want to handle Microsoft formats there are a number of people working on it that you can help.
So its right there for the right bunch of motivated people to work on. C.
I sometimes write stuff
This is really a side issue. When you say wvware converts bold (style) to bold tags you're only looking at the html converter. Html conversion is always going to be poor because there isnt a one-to-one mapping to word features. (which is no excuse for word 97 converting 'heading 1' to html font tags, and 'h1' to bold style!). However, wvware's real strength is the conversion to a neutral xml format which you can mess with to your heart's content. You're generally better off starting with the xml then using a (XSL-T) stylesheet to get nice html out of it - and write your own CSS.
BTW I've contributed code to wvware and there were, last time I looked, features of the spec which remain unimplemented (I was only doing optimisation patches so I don't know if the features really were undocumented - but Caolan had put in comments to the effect that he didn't know what some flags were for).
Frankly I don't care if wvware doesn't make the document look (in html or whatever) like the original, which a lot of people seem to want; its real job is to extract the data from that crazy format into one which mortals can use. If we can extract the style tags, then convert them to something sane in a.n.other tool, what more do you need?
This the exact kind of attitude that should turn people away from MS. Why ? Because it is Bill Gate's explicit goal (and he goes to TV to say this) that MS wants to bring computing to the masses.
Pretend you are him, and you want to achieve this goal. By what means should you use? Closed file formats with lousy specifications? How does that bring computing to the masses when they are prevented from speaking to the Unix Priesthood?
If you, as a MS lackey and worshipper, believe that this is not MS's responsibiilty, then please go take it up with Bill, your prophet. He has stated publicly and many times that this is his goal. Remind him that MS's duty is to the stockholders and they should make as much money as possible. Please tell him that, and also tell him to STOP LYING to the American public.
"Now, why can't Corel, Lotus, Sun, etc. band together and reverse-engineer Microsoft's file formats properly?"
Because the formats suck...?
It's 10 PM. Do you know if you're un-American?
As someone who has spent much time trying to convert drawing between AutoCad and Microstation, I can tell you that there is not a single product on the market that comes close. None even attempt to guarentee they can convert the files.
.DOC file is read the plain text. However if you wish to recreate the entire rich document with all annotaions, footnotes, headers, footers, graphics, etc... your going to be in for a major task. Microsoft even has difficulties converting large, rich word files.
Yes, the basics of the DWG files are well understood. But I guarentee you if you take a professional engineering AutoCad drawing with all the different layers and all the "smarts" attributed to the different elements, and you try to convert it to any format, you WILL lose most of the "smart" information stored in the DWG file. You will also probably have problems with the different layers and linewidths, colors, etc...
DWG is not a good example unless all you trying to do with the
Quack
I think you are right. Completely in line with Microsoft's practices and ethics, or lack of same.
in short, by agreeing to "run" and "install" this application you are not permitted legally to reverse engineer it.
same thing applied to the "ellison challenge" at comdex a year or so ago, users of sql are permitted to publish benchmarks without explicit permission from microsoft.
1. GRANT OF LICENSE. This EULA grants you the following rights:
Applications Software. You may install, use, access, display, run, or otherwise interact with
("RUN") one copy of the SOFTWARE PRODUCT, or any prior version for the same operating system,
on a single computer, workstation, terminal, handheld PC, pager, "smart phone," or other digital
electronic device ("COMPUTER"). The primary user of the COMPUTER on which the SOFTWARE PRODUCT
is installed may make a second copy for his or her exclusive use on a portable computer.
Limitations on Reverse Engineering, Decompilation, and Disassembly. You may not reverse
engineer, decompile, or disassemble the SOFTWARE PRODUCT, except and only to the extent that
such activity is expressly permitted by applicable law notwithstanding this limitation.
Separation of Components. The SOFTWARE PRODUCT is licensed as a single product. Its
component parts may not be separated for use on more than one COMPUTER.
- http://www.wvWare.com/, maybe the best open source Word converter? Formerly "mswordview", it's a library and a front-end app, which is currently AbiWord's converter.
- word2x
- AbiSource, a company producing an open source, cross platform, comercial office suite. Their motto was "SHOW ME THE SOURCE!!!", which we had to scream at the March 1999 Linuxworld Expo in order to get their t-shirt.
- Adobe FrameMaker for Linux -- Not sure if it does Office, but it's a commercial word processor!
- VistaSource / ApplixWare -- Cross platform, partially open source, complete office suite and integrated development environment in the form of either a local app, or as a Java-based thin client plus app server architecture. Compare to StarOffice. My experience has been that you can send an un-convertable Office document to Applix's closely-monitored community support mailing list, and they will attempt to modify Applixware's import filters around it, and send you a patch. How cool is that?
- S un StarOffice. Very good as well. Complete office suite. StarOffice and Applixware are capable of replacing Microsoft Office for literally most people.
- Corel Wordperfect -- See also Corel's Linux distribution.
- KDE's KOffice -- Open source office suite.
- Freshmeat.net's index of office apps
Here is a list of how to buy books for tutoring you on how to use these products, including reviews and price comparisons, and free shipping from Buy.com. In order of my personal preference.-
- Special Edition Using StarOffice, replaces htt p://www.amazon.com/exec/obidos/ASIN/0789719932/re
f =sim_books/002-2291160-6260020. -
- http://www.us.buy.com/books/pr oduct.asp?sku=30400392 $14.99 ($1 less than amazon.com) Replaces: http://www.amazon.com/exec/obidos/ASIN/0672314126
/ ref=sim_books/002-2291160-6260020 -
Any others? Perhaps some that are embedded in the ton of entries in Freshmeat's office index? Let's hear some authors pipe up! Slashdot's html submitter seems to be busted, so try to fix the above urls by removing their spaces!Fortunately, in the networking arena, customers value whether or not the products they are buying support open standards. That way they know that the switch they buy from Foo Inc. will be able to talk to switches from Bar Corp. in case Foo Inc. goes out of business.
This is not to say that there are not proprietary formats used in networking. There certainly are. However, proprietary functionality in the networking world usually comes in the form of additional features that are built on top of existing standards. If you have devices from Foo and Bar talking to each other, they just won't be able to use those extra features that Foo devices provide rather than causing the entire system to break.
In the case of word processors, we have a much different situation. None of the proprietary document formats are supersets of an open standard format. This means that in order to have absolute confidence that you will be able to read data saved by M$, you better have a M$ reader.
Network device providers of course have more of an incentive to follow standards because network admins can't take 20 minutes with every packet editing them by hand to convert from the Foo to the Bar format. ;)
there were a few that weren't. As for reverse engineering. I was refereing not to a Linux implementation, rather a standalone implementation.
thanks
INSERT INTO comment VALUE('Doh!') WHERE user='you';
A lot of graphical designers are bad web designers, but that it only because they misunderstand the medium. Truly good overall designers will understand the medium they are presenting on and design something suitable for it.
...
In my opinion, design doesn't just include graphics as some people think, but also navigation and element placement on UIs and such. Careful design in these areas seperate good designs from ones that are like "broken windows with nice curtains", so to speak.
Anyway, the point that I was trying to make in the original post is that HTML is unsuited for the "mainstream web of the future". You know, the one people use for e-commerce, news, sports, the TV and movie theatre of the future and all that. Geeks will still have HTML-based web sites, but the mainstream ones need something with a bit more kick and functionality built-in.
Plain text is boring and outdated. We need to look ahead a bit and formulate a new solution
----- rL
Someone is doing something about free tools for the .doc file format, rather than whining, and he only gets a +2 (as of the time of this writing)? Have the moderators recently had a brain transplant operation, on the donor side?
Meept!
LILO boot: linux init=/usr/bin/emacs
This is a comment from our Product Manager for Filters, Joe Dunbar(joe@vistasource.com):
Applixware imports both MS Word formats, RTF (Rich Text Format) their easy to use ASCII format, and .doc their native binary format (both DOS and Windows, all five versions from 2.0 - 2000). Both formats support all features including embedded MS Office OLE objects which our Applixware filters import. Applixware exports to RTF, but does not export to MS Word native .doc due to the complex proprietary format and our choice to go with Open Standards. Note RTF supports all data, layout and formatting information like the .doc binary format. Since there are no features to be gained by directly exporting to their proprietary .doc format, why do two when one will do?
Applix continues to work with Microsoft who have made past versions of their file formats (or at least portions) available to us, however we are still waiting for their latest 2000 formats which they said would be available soon as of last January. While it would be helpful to have the latest Microsoft file formats, Applix is commited to open standards and shared file formats like HTML and XML which will be useful to all applications and users.
Joe DunbarVistaSource, Inc.,
subsidiary of Applix, Inc.