Linux Office Suites
Cowculator writes: "Sun Microsystems will release the beta version of StarOffice 6.0 in October, with the development version already available. This ZDNet article has some more details, including a link to the development version..." Other submitters sent in notes about Gobe Productive and Hancom Office 2.0, not to mention KOffice and the Gnome office applications. As far as I know all of these are lacking the single most important thing, a robust and complete set of import filters for Word, Wordperfect, Excel, Powerpoint, etc.
The lack of import filters is regrettable, but hardly surprising - as soon as they do work properly, Microsoft will make bloody sure to change the format again.
.html or .rtf anyway - even if you rename the extension to .doc. So the poor lusers don't even know it's not MS word format...
.rtf to .doc , and "pretty graph" style applications need something more powerful.
:-)
The other factor is that even if the word/excel/powerpoint import is working, people act all surprised if their embedded Viso drawings/ autofcad dxfs etc don't work. It's pretty silly to expect them to work too, unless you've got some magical linux version of autocad (come home to unix autocad!) or visio installed. KDE's KParts framework is as capable as OLE on windows (although I wish they hadn't dropped CORBA), but it can't embed applications that don't exist.
Export filters are pretty irrelevant for the majority of word or excel documents -
MS Word will silently load files saved as
Excel loads CSV fine, even CSV with embedded formulae in standard enough infix notation. Once again, this covers a large number of cases, although it's not as transparent as just renaming a
Powerpoint is more problematic - although I've noticed that the flashier and more advanced the powerpoint presentation, the less likely it is that it's saying anything useful.
They've never seen any other mail clients, and don't understand why people outside the company can't read their HTML mail with embedded OLE objects and attached vCard files. I play games with them... they send me Rich Text email, I change it to plain text and send it back. Their client is set to send Rich Text by default, so it gets changed back. Then if I reply again, I change it back to plain text. They must wonder what the hell is going on.
Many people could get by just fine with an "alternative" office suite, if they didn't have to exchange files with the computer illiterate.
>>Further, the only really important Microsoft Office applications are Word, Excel and Access.
Actually I'd say, for most businesses this is the order: Outlook, Word, Excel, Powerpoint. For more some, (without a decent IT staff) Acess would be after word. Do you wonder why people don't leave outlook after numerous virus attacks? It's that useful, that's why.
DO NOT DISTURB THE SE
I for one would like to put aside the KDE & GNOME bias that pushes many to adopt this word processor or that.
Our fundamental problem to be solved is a lack of UNIVERSAL and fully functional MS-Office import *and* export filters. At this point, I would say it's the biggest problem Linux users must struggle with (emphasis on "users" here... the administrators must still struggle with Linux's crappy font management, etc).
RTF, HTML, and the "other" semi-formatted languages don't support popular features very well, such as tables and frames. Would YOU export your resume from a Linux app as HTML or RTF, and leave it to Office to render correctly? HR people are the most "clingy to Office types", and if your resume looks shitty - it's YOUR fault not theirs (world is not fair).
If your RTF resume looks bad in Office, *obviously* you are not a good candidate. You show little attention to detail to allow your resume to overlap characters and corrupt text. I've seen Office mangle some RTF docs that look PERFECT elsewhere -- it's an anti-competitive feature of MS Office. RTF documents from Office, re-import perfectly.
SO... to get to my point, we need good filters. The KDE Office and AbiWord folks should get together on the OpenOffice mailing list, and work to make sure the OpenOffice filters are exactly what they need. There's NO EXCUSE for not standardizing our I/O filters now.
As a great example of co-operation between KDE and GNOME applications, look at gPhoto. This started as a Gnome digital camera app, but the code became something better... a standard Linux API for cameras. Now there's a ton of KDE and Gnome apps, all of which run on top of gPhoto.
Just because KDE and GNOME use fundamentally incompatible desktop libraries, does not forgive these folks for not working together on EXTENSIONS to the desktop. We need more success stories like gPhoto, in areas like Printing, Font Management, pretty Wizards for Samba, etc.
I think about the lack of such examples in Linux, and the thought depresses me...
_Scott
When is Slashdot moderation going to favor less frequent "signal" posts, over "dozen posts a day" noise accounts?
Someone pleeeaasse setup a site dedicated to writing really _good_ MS Word 97+ serialization routines in ANSI c. I would but I'm alread sidetracked on a tangent of a subproject and the stack is just too high right now. This is not hard folks. I know it sounds like a boring project but it's not!
Are you familar with the principle of Recursive Composition (a.k.a The Composite Pattern)? This is without a doubt my favorate programming construct. The key here is that you define an object that can be a child as well as potentially contain children itself. If you can uniformly parameterize the properties common to a set of these objects you can use the priciple of Recursive Composition to build a tree of these objects and then serialize it back using preorder depth first search tree traversal.
For example, a binary networking protocol might have a header, some parameters, and a data payload area. The header has an arbitrary block of security information, which in turn might have a DES encrypted key and an integer describing the length of the payload. So to encode this message using Recursive Composition, define a packet_t type that has the three sub components such as the arbitrary security block, which in turn has an encrypted DES block as a child component. See the tree? Now, if you can parameterize the temporal properties of these objects you can delegate the responsibilty of encoding certain areas of the network message to functions like: enc_security_block(struct security_block *sb, char *dst, size_t off, size_t len) would then call enc_des_key(struct des_key *dk, char *dst, size_t off, si ....
The classic example of Recusive Composition is that of GUI components. You have an abstract object called say Component. Components can contain other components. Sub types would be ButtonComponent, TextComponent, TableComponent, etc. These components might contain subcomponents as well (e.g. ButtonComponent might have a TextComponent for it's label). See the tree again? Now, when it comes time to draw these components you don't have one big block of speggetti code that considers all of the different component types but rather delegate that responsibility to method of the component itself. This greatly reduces the complexity of the problem (actually making it feasable whereas it was not before). Again, we just have to parameterize *where* these components are to draw themselves such as FrameComponent_draw(Window *win, int x, int y ...etxc.
So what does this have to do with writing serialization and deserialization routines for Word documents? Microsoft Words format (and the format of just about every other sophisticated document format out there) is flattend by serializing an internal tree of nodes (like the GUI Components and more so the network packet encoding described above). The tree of nodes is no different from the trees used above to describe Recursive Composition. So by recusively delegating the resonsibilty of encoding/decoding a region of a MS Word document you can parse it into a tree and then do preorder dfs tree traversal to serialize it into any format including .doc.
The hardest problem here by far is determining what the primative types of the document are (e.g. like the security_block and the payload length integer in the network packet). If you don't know what the leaves of the tree look like you cannot start to write a lexer. Find out everything you can about the format of each of Word's elements. There are several projects that claim to have decoded the format to a certain degree. These would be a great start. However I have spoken to these guys and the problem is they are only interested in supporting their own product (Abiword and the KOffice guys talked about a calaborative effort but got hung up on choosing libraries and language and other trite crap). An group independant from these organizations should be established so that the library is not tied to one product.
Once you have a good idea of the bits and bytes behind the layout of nodes in the format you can write a (at first crude) lexer or Lexical analyser. This is simply a peice of c that will break the format into tokens. It's simple in the respect that it doesn't have to worry about the logical layout of elements at all. It's only concerned with nibbling off the primative elements (tokens) themselves. The interface might be as simple as init(char *filename), gettoken(struct lexer *lex).
Now you have to write a parser. This is what bison/yacc is for. This is non trivial but theres a great book called _lex & yacc_ by John R. Levine that can describe how to write a yacc grammer in 200 lines that in convential c would take several thousand lines, take twice as long, and still not work. Ahh yacc grammers to me are like dougnuts to Homer Simpson.
Once you have a working lexer and parser (probably a 1000 lines of code), you can start to build a tree. You need a tree structure. The W3C has written a specification for representing documents as a tree of nodes in memory called the Document Object Model (DOM). Mozilla uses the DOM. It's XML and HTML centric but it's really totally arbitrary. A DOM tree could easily be constructed by adding createNode, appendChild, etc calls to the yacc parser. It just so happends that I have written a DOM implementation in ANSI c. Its called DOMC and it would be perfect for this task.
If you do this much you are sitting pretty. You can just traverse the tree and spit out whatever the analigous elements are for say ps, html, sgml, xml etc.