M. Dillon writes:
Open-source operates behind the scenes far more then it operates in the public eye, and it's hard to sell support to hackers who actually have *fun* trying to figure out a problem. In some respects Linux and the BSDs are poor commercialization candidates because they are *too* good... that they simply do not require the level of support that something like Windows-NT or Oracle might require in a back-office setting.
This sounds like sane reasoning but conraditory to quite a few "service and support" business models (e.g Red Hat). It will be interesting to see who's right. Perhaps proprietary solutions build as userspace applications running on top of Free platforms would be a better? Would that be frowned on by anyone? Not me.
is that what you're talking about? Wondered what that was. Mine's a 75GXP 30G purchased about 9 mo ago but this happended after 3 mo. I did a low level format with IBMs utility and reinstalled but recently I discovered I cannot make an isofs of my/home. I have't formatted the upper 20G of the disk. I'm worried. I think I'm just going to go out and buy something different unless IBM wants to give me something other than more of the same.
Heads up, a Word document (at least 97) is wrapped in OLE streams. This is something these documents fails to mention (common Windoze knowledge). There are libraries for decoding the streams (libole). One you do that you can start decoding the FIB and beyond. Good luck.
You misunderstood. I'm talking about the c library level. Take gnome for example. I was looking for a Document Object Model (DOM) implementation. There is one for gnome that looks fairly advanced. But I choose not to use it because it was tied to the gnome environment. They have typedefed everything to use gthis and gthat. It uses gstrings and gints and on and on. No one exception other gnome developers are going to use this because it locks them into one environment. They should have created a plain DOM implementation that was highly portable from Linux i386 to Mac to Windows. That's usefull software.
1: The documentation is late, so of course filters for old versions can be done, but new versions are not publicly documented, yet.
No. The documentation has been around for a while (years). You can see here: http://www.wotsit.org/search.asp?s=text that there are references to the Word 6 format as well.
2: The documentation has some sort of licensing provisions that are unacceptable, therefore is effectively useless for building a competitive product.
No. There are no license restrictions to writing filters for MS Word file formats that I know of.
3: The only good programmers work for Microsoft. So even with documentation, nobody else can make import/export filters that work well.
Well, good programmers don't necessarily work for MS but it's a big format and it's not a task for a hobbyist coder. But I think the main problem is that there is a somewhat inappropriate focus on rendering the output. IMO I think that an internel representation should be chosen such that it can be traversed like a tree and output in any format. Writing a converter is then a matter of interpreting the attributes of a node in the tree (a paragraph, an image, a sequence of characters) and genereting the appropraite output wheather it be ps, html, or most importantly another internal representation of a document used by another office package such as star office.
The CreateNamedPipe call creates a pipe that can be connected to a pipe potentially on another host addressed by UNC name. MS admits that this is slow and that sockets should be used instead if raw performace is desired. The benifits are that they are authenticated and mediated by the CIFS networking layer (thus the slow down).
To more accurately compare pipes as IPC mechanisms they should have used the CreatePipe call which creates an anonymous named pipe that only goes through the Kernel and back. These should be quite fast by comparison. Of course a much more interesting comparison would be to compare shared memory -- a much more critical IPC mechanism used by high performace appclications like databases.
BTW if you want to access NamedPipes and TransactNamedPipes in 100% Java the http://jcifs.samba.org project has implemented everything necessary to interoperate with MS NamedPipe servers.
Microsoft understands that their market grip is in proprietary file formats and protocols.
Actually I use to believe that too. But surprisingly there is documentation on a lot of this stuff that's quite good. I have written a CIFS client (MSs networking proto) and I must say the spec is pretty good. People have argued it's not good enough but it's got the essentials in there. Also, there is a very nice spec on the MS Word binary file format. I started to implement a parser but got side tracked. I didn't see anything earth shatteringly complex about it. It's just a bunch of serialized tables all of which are documented pretty well IMHO. Of course there is quite a bit of MSs stuff that is not documented. What we really need is a MSRPC implementation with DCE/RPC and NDR buffer libraries...etc. Then we need the IDL for all the different MS calls. Then you can talk to just about anything such as Exchange..etc.
There are some sound ideas here for future directions in Linux development - and they've already been thought up for you here.
There's nothing innovative or clever about this article. This is old news. The problem with doing this stuff on Linux isn't with ideas it's getting people on the wagon and implementation details. And when we do start to get something remotely like it they go and stick a 'g' or 'k' in front of everything binding users to an 'environment'. Bahh. This sort of thing requires a tremendious amount of coordination. The statement "based on a few principles pervasively applied" is great. It's well known at this point that this sort of approach is good. But it requires that everyone agree what those principles are that will be applied. This is why working groups like the IETF, W3C, and other standards bodies are great. Unfortunately they are not thinking at this level because it's not very practical and likely to cave under customary skeptacism. This is what you do need a Cathedral for. It's like saying "let's respecify libc". This wouldn't be such a bad idea. The c library is very simple about what it addresses. IMHO it could use some higher level standard functionalty. But try asking that on comp.lang.c and see what happends:~)
The Boston Globe is reporting that a car was found at Boston airport (name?) with Arabic flight manuals. Apparently they got into an altercation with someone in the parking lot. That person notified authorities about the incedent after hearing of the tragedy when he landed at his destination leading them to this car. They are fairly certain that passports, the flight training manuals, and possibly other information in the car link these people to Bin Laden's "base". One of the suspects was a trained pilot and a member of the Arabic something Leage (?).
Anyway, there's DOM-based XML parsers already in C, like gno...
Well, keep in mind my only point about the DOM is that it can be used without XML. It's really just a tree of nodes with operations to build/modify it. It can be used to represent a tree of nodes for a MS Word document as easily as it can an XML one. And once you have it as a DOM tree you can get to XML, ps, html, word, rtf,...
...documents are not dead trees. They can do the darndest things,...Sounds, animations,
And you know how Word does it? Recusive Composition. Meaning Word doesn't do it at all! It delegates the resposibilty of playing that sound to another component. Within the document that sound is probably represented as some arbitrary chunk of bytes flagged as 0x52{media-unknown/joebobssoundformat[TGS%%@Y@*(SJ ESIEW&*EY...]} which Word plucks out and creates a node in the tree for, and passes it to some subsystem function to return a OLE component to satisfy the blob. Do you think they completely refactor the.doc format to accomodate an anamation? No! This is well known information guys, comon someone back me up here!
...if rendering was unambiguous, pages would look the same on IE and Netscape...
Well this is a different issue. They render stuff differently because the specifications for stuff like CSS where just getting started when NS4 was released. It's been a while. I believe Mozilla and IE should render things exactly the same way minus font metrics. That is if they both conform to the standards established by the W3C and friends. And I think they do. But this is incedental and I'm not talking about rendering (see other response to this thread).
First, if you want to know what the AbiWord and KWord folks are up to, look at:
Well, I have not checked lately but when I looked into this problem the last time I didn't see a lot of interest on the various mailing lists and I tried what I believe was considered to be the best working code and it didn't work to hot for me.
the lexing problem is being solved by *generating* the lexer from the specifications themselves
Ok. Good to hear. But the documentation I saw on MS website didn't look like much of a "specification". Do you have a link?
you talk as though it would all be trivial if we had used compiler-compiler tools.
You obviously didn't read my anaysis too carefully or you would have seen that I specifically stated; "This is what bison/yacc is for. This is non trivial but theres a great book...".
Word syntax requires a huge amount of semantic knowledge to drive the parse,
Well, I don't know for sure because I have yet to find any really good information on the actual format but I find it extreemly difficult to believe that it is not based on Recursive Composition. It may seem obfuscated because of backwards compatibility issues but MS's language support is very good. So to dismiss using a yacc grammer shows me you are either clueless about the topic or you wrote the filter for Abiword or KOffice and this is just hand-waving.
Figuring out how to render it on the screen so that it looks and acts like it looked and acted on Word is the problem.
Well, I'm not talking about rendering or actually editing (implied by the "acted" word). At the very least you could convert it fairly well to just about any format (e.g. postscript). But presumably the Office suite using the "filter" would have rendering capability that is flexible enough to render a word document as it would appear in word. If this is indeed true then the real problem is generating a suitable document tree for the viewer in question. This is simply a matter of traversing the tree generated by the filter and translating it into the tree the viewer uses. You don't need to do any rendering at all. You just have to get your node-for-node translation routines to tweek it's attributes in the translation. If the viewer doesn't do a perfect job it should still be quite functional. If the veiwer doesn't support some OLE mumbo jumbo you can quitely skip those nodes in the tree and you still have a functional document. You could then edit it and reverse the translation.
HOWTO: Write A MS Word "Filter"
on
Linux Office Suites
·
· Score: 3, Interesting
Someone pleeeaasse setup a site dedicated to writing really _good_ MS Word 97+ serialization routines in ANSI c. I would but I'm alread sidetracked on a tangent of a subproject and the stack is just too high right now. This is not hard folks. I know it sounds like a boring project but it's not!
Are you familar with the principle of Recursive Composition (a.k.a The Composite Pattern)? This is without a doubt my favorate programming construct. The key here is that you define an object that can be a child as well as potentially contain children itself. If you can uniformly parameterize the properties common to a set of these objects you can use the priciple of Recursive Composition to build a tree of these objects and then serialize it back using preorder depth first search tree traversal.
For example, a binary networking protocol might have a header, some parameters, and a data payload area. The header has an arbitrary block of security information, which in turn might have a DES encrypted key and an integer describing the length of the payload. So to encode this message using Recursive Composition, define a packet_t type that has the three sub components such as the arbitrary security block, which in turn has an encrypted DES block as a child component. See the tree? Now, if you can parameterize the temporal properties of these objects you can delegate the responsibilty of encoding certain areas of the network message to functions like: enc_security_block(struct security_block *sb, char *dst, size_t off, size_t len) would then call enc_des_key(struct des_key *dk, char *dst, size_t off, si....
The classic example of Recusive Composition is that of GUI components. You have an abstract object called say Component. Components can contain other components. Sub types would be ButtonComponent, TextComponent, TableComponent, etc. These components might contain subcomponents as well (e.g. ButtonComponent might have a TextComponent for it's label). See the tree again? Now, when it comes time to draw these components you don't have one big block of speggetti code that considers all of the different component types but rather delegate that responsibility to method of the component itself. This greatly reduces the complexity of the problem (actually making it feasable whereas it was not before). Again, we just have to parameterize *where* these components are to draw themselves such as FrameComponent_draw(Window *win, int x, int y...etxc.
So what does this have to do with writing serialization and deserialization routines for Word documents? Microsoft Words format (and the format of just about every other sophisticated document format out there) is flattend by serializing an internal tree of nodes (like the GUI Components and more so the network packet encoding described above). The tree of nodes is no different from the trees used above to describe Recursive Composition. So by recusively delegating the resonsibilty of encoding/decoding a region of a MS Word document you can parse it into a tree and then do preorder dfs tree traversal to serialize it into any format including.doc.
The hardest problem here by far is determining what the primative types of the document are (e.g. like the security_block and the payload length integer in the network packet). If you don't know what the leaves of the tree look like you cannot start to write a lexer. Find out everything you can about the format of each of Word's elements. There are several projects that claim to have decoded the format to a certain degree. These would be a great start. However I have spoken to these guys and the problem is they are only interested in supporting their own product (Abiword and the KOffice guys talked about a calaborative effort but got hung up on choosing libraries and language and other trite crap). An group independant from these organizations should be established so that the library is not tied to one product.
Once you have a good idea of the bits and bytes behind the layout of nodes in the format you can write a (at first crude) lexer or Lexical analyser. This is simply a peice of c that will break the format into tokens. It's simple in the respect that it doesn't have to worry about the logical layout of elements at all. It's only concerned with nibbling off the primative elements (tokens) themselves. The interface might be as simple as init(char *filename), gettoken(struct lexer *lex).
Now you have to write a parser. This is what bison/yacc is for. This is non trivial but theres a great book called _lex & yacc_ by John R. Levine that can describe how to write a yacc grammer in 200 lines that in convential c would take several thousand lines, take twice as long, and still not work. Ahh yacc grammers to me are like dougnuts to Homer Simpson.
Once you have a working lexer and parser (probably a 1000 lines of code), you can start to build a tree. You need a tree structure. The W3C has written a specification for representing documents as a tree of nodes in memory called the Document Object Model (DOM). Mozilla uses the DOM. It's XML and HTML centric but it's really totally arbitrary. A DOM tree could easily be constructed by adding createNode, appendChild, etc calls to the yacc parser. It just so happends that I have written a DOM implementation in ANSI c. Its called DOMC and it would be perfect for this task.
If you do this much you are sitting pretty. You can just traverse the tree and spit out whatever the analigous elements are for say ps, html, sgml, xml etc.
A crystal is a single molecule (but not all single molecules are crystals). These facts are old. Very old.
If this (and the comment before it) suggest that crystals cannot be formed by ionic bonds then you are totally incorrect. Most crystals are formed by ionic bonds. Diamond and graphite are exceptional in this respect. The result is not "one molecule". If it is placed back into a suitable solvent it will dissolve (e.g. NaCl).
Would you be surprised if Intels compiler produced faster code than GCC? I believe Linus has stated that GCC is a bit "bloated". I wonder if you can compile the Linux kernel with it (minus assembly of course). That might be interesting, particularly for P4. Linux could get an instant speed boost. And such a radical switch in compiler might expose flaws in the code. Definately a worthwhile excercise if nothing else. And even though the average user isn't going to buy it to compile their kernel, the distro's might for their precomiled kernels (err, wonder how that would work;-/).
Re:Congratulations on displaying a lack of clue
on
Netscape 6.1
·
· Score: 3, Interesting
about the meaning of the words "Open Source".
Answer me two questions.
1) How did Netscape benifit from Open-Sourcing their code?
2) How did the Open-Source community benifit from the Open-Sourcing of Netscape?
[Note: Before you mention Galeon, remember that it was born in response to the poor performace of earlier Mozilla builds.]
I think carrying the burdon of the Open-Source initiative was why the development processed has dragged on as it has. Do you remember the first couple of builds? Is it possible that they would have made more progress without this burdon?
Don't get me wrong (again), I am quite pro-Open-Source and manage two 100+ dl/month OSS codebases myself. I'm simply stating the fact that in the case of Netscape, it turned out to be a poor example of why companies should Open-Source and share their code. Companies should share this code in the name of progress but they should be more sophisticated about how.
Re:Mozilla ... Netscape ... what't the difference?
on
Netscape 6.1
·
· Score: 2
No, you're wrong. Mozilla is CERTAINLY an open-source project.
Well, in the obvious sense of the word, yes, most of the code can actually be downloaded and shared (although there are a lot of modules that are not; e.g. the e-mail spell checker). But the point I'm trying to make is that the benefits of the community process did not bare fruit in the slightest. And yet this was supposed to be one of the primary motivations for companies to share their code. The fact is; a web browser is too sophisticated and the existing codebase was accordingly insurmountable for even the most dedicated weekend code warrior. Had Netscape been constructed of a highly polymorhpic and modular design, the code might have been partitioned cleanly enough for other individuals to participate. But this was simply not the case. Sure, you can download the code, but I don't see people outside of Netscape participating in the development of Mozilla to the point where it would be considered a community process (although I do see a lot of usefull technical discussion). Fortunately the W3C does have a thriving community process and much of their work has been implemented in Mozilla. But this is largely due to the fact that many people from Netscape make up the governing bodies of working groups within the W3C.
And about Cathedrals vs. Bazaars. I don't think either is good. I prefer a dictatorship. Put one person that truely has vision at the helm and follow them unconditionally. Implement their vision. This is something the bureaucracy will not allow in a Cathedral. The Bazaar does not work because no one person is influential enough or has the power to make changes that trancend a codebase and these changes are invariably necessary.
Mozilla ... Netscape ... what't the difference?
on
Netscape 6.1
·
· Score: 2, Flamebait
... based on the Mozilla.org open-source development effort,..
I have submitted bug reports for Mozilla and besides the obvious hanger-ons it's very clear that all of Mozilla's developers work for Netscape. Mozilla is not an Open-Source project like everyone's been preaching. Sure people have submitted their own little gizmo to add but thankfully the've abandoned all that crap and are getting down to the metal now. The Open-Sourcing of Netscape was a failure and it's time we fess up and wrote it off as a necessary experiment.
Don't bash Netscape because you'll be bashing Mozilla in the process. The're one and the same.
M. Dillon writes:
Open-source operates behind the scenes far more then it operates in the public eye, and it's hard to sell support to hackers who actually have *fun* trying to figure out a problem. In some respects Linux and the BSDs are poor commercialization candidates because they are *too* good... that they simply do not require the level of support that something like Windows-NT or Oracle might require in a back-office setting.
This sounds like sane reasoning but conraditory to quite a few "service and support" business models (e.g Red Hat). It will be interesting to see who's right. Perhaps proprietary solutions build as userspace applications running on top of Free platforms would be a better? Would that be frowned on by anyone? Not me.
just before failing to boot again was:
/home. I have't formatted the upper 20G of the disk. I'm worried. I think I'm just going to go out and buy something different unless IBM wants to give me something other than more of the same.
screech, screech, screech, clickidy, chickidy, clickidy
pause
screech, screech, screech, clickidy, chickidy, clickidy
pause
...
is that what you're talking about? Wondered what that was. Mine's a 75GXP 30G purchased about 9 mo ago but this happended after 3 mo. I did a low level format with IBMs utility and reinstalled but recently I discovered I cannot make an isofs of my
Heads up, a Word document (at least 97) is wrapped in OLE streams. This is something these documents fails to mention (common Windoze knowledge). There are libraries for decoding the streams (libole). One you do that you can start decoding the FIB and beyond. Good luck.
You misunderstood. I'm talking about the c library level. Take gnome for example. I was looking for a Document Object Model (DOM) implementation. There is one for gnome that looks fairly advanced. But I choose not to use it because it was tied to the gnome environment. They have typedefed everything to use gthis and gthat. It uses gstrings and gints and on and on. No one exception other gnome developers are going to use this because it locks them into one environment. They should have created a plain DOM implementation that was highly portable from Linux i386 to Mac to Windows. That's usefull software.
1: The documentation is late, so of course filters for old versions can be done, but new versions are not publicly documented, yet.
No. The documentation has been around for a while (years). You can see here: http://www.wotsit.org/search.asp?s=text that there are references to the Word 6 format as well.
2: The documentation has some sort of licensing provisions that are unacceptable, therefore is effectively useless for building a competitive product.
No. There are no license restrictions to writing filters for MS Word file formats that I know of.
3: The only good programmers work for Microsoft. So even with documentation, nobody else can make import/export filters that work well.
Well, good programmers don't necessarily work for MS but it's a big format and it's not a task for a hobbyist coder. But I think the main problem is that there is a somewhat inappropriate focus on rendering the output. IMO I think that an internel representation should be chosen such that it can be traversed like a tree and output in any format. Writing a converter is then a matter of interpreting the attributes of a node in the tree (a paragraph, an image, a sequence of characters) and genereting the appropraite output wheather it be ps, html, or most importantly another internal representation of a document used by another office package such as star office.
The CreateNamedPipe call creates a pipe that can be connected to a pipe potentially on another host addressed by UNC name. MS admits that this is slow and that sockets should be used instead if raw performace is desired. The benifits are that they are authenticated and mediated by the CIFS networking layer (thus the slow down).
To more accurately compare pipes as IPC mechanisms they should have used the CreatePipe call which creates an anonymous named pipe that only goes through the Kernel and back. These should be quite fast by comparison. Of course a much more interesting comparison would be to compare shared memory -- a much more critical IPC mechanism used by high performace appclications like databases.
BTW if you want to access NamedPipes and TransactNamedPipes in 100% Java the http://jcifs.samba.org project has implemented everything necessary to interoperate with MS NamedPipe servers.
Microsoft understands that their market grip is in proprietary file formats and protocols.
Actually I use to believe that too. But surprisingly there is documentation on a lot of this stuff that's quite good. I have written a CIFS client (MSs networking proto) and I must say the spec is pretty good. People have argued it's not good enough but it's got the essentials in there. Also, there is a very nice spec on the MS Word binary file format. I started to implement a parser but got side tracked. I didn't see anything earth shatteringly complex about it. It's just a bunch of serialized tables all of which are documented pretty well IMHO. Of course there is quite a bit of MSs stuff that is not documented. What we really need is a MSRPC implementation with DCE/RPC and NDR buffer libraries ...etc. Then we need the IDL for all the different MS calls. Then you can talk to just about anything such as Exchange ..etc.
There are some sound ideas here for future directions in Linux development - and they've already been thought up for you here.
There's nothing innovative or clever about this article. This is old news. The problem with doing this stuff on Linux isn't with ideas it's getting people on the wagon and implementation details. And when we do start to get something remotely like it they go and stick a 'g' or 'k' in front of everything binding users to an 'environment'. Bahh. This sort of thing requires a tremendious amount of coordination. The statement "based on a few principles pervasively applied" is great. It's well known at this point that this sort of approach is good. But it requires that everyone agree what those principles are that will be applied. This is why working groups like the IETF, W3C, and other standards bodies are great. Unfortunately they are not thinking at this level because it's not very practical and likely to cave under customary skeptacism. This is what you do need a Cathedral for. It's like saying "let's respecify libc". This wouldn't be such a bad idea. The c library is very simple about what it addresses. IMHO it could use some higher level standard functionalty. But try asking that on comp.lang.c and see what happends :~)
And of course, it would be scalable and secure.
Next.
Does anyone know if the 6 stories of mall and subway below ground level caved in? If so, that's a lot space for rubble.
The Boston Globe is reporting that a car was found at Boston airport (name?) with Arabic flight manuals. Apparently they got into an altercation with someone in the parking lot. That person notified authorities about the incedent after hearing of the tragedy when he landed at his destination leading them to this car. They are fairly certain that passports, the flight training manuals, and possibly other information in the car link these people to Bin Laden's "base". One of the suspects was a trained pilot and a member of the Arabic something Leage (?).
Anyway, there's DOM-based XML parsers already in C, like gno...
Well, keep in mind my only point about the DOM is that it can be used without XML. It's really just a tree of nodes with operations to build/modify it. It can be used to represent a tree of nodes for a MS Word document as easily as it can an XML one. And once you have it as a DOM tree you can get to XML, ps, html, word, rtf, ...
And you know how Word does it? Recusive Composition. Meaning Word doesn't do it at all! It delegates the resposibilty of playing that sound to another component. Within the document that sound is probably represented as some arbitrary chunk of bytes flagged as 0x52{media-unknown/joebobssoundformat[TGS%%@Y@*(SJ ESIEW&*EY...]} which Word plucks out and creates a node in the tree for, and passes it to some subsystem function to return a OLE component to satisfy the blob. Do you think they completely refactor the .doc format to accomodate an anamation? No! This is well known information guys, comon someone back me up here!
Well this is a different issue. They render stuff differently because the specifications for stuff like CSS where just getting started when NS4 was released. It's been a while. I believe Mozilla and IE should render things exactly the same way minus font metrics. That is if they both conform to the standards established by the W3C and friends. And I think they do. But this is incedental and I'm not talking about rendering (see other response to this thread).
First, if you want to know what the AbiWord and KWord folks are up to, look at:
Well, I have not checked lately but when I looked into this problem the last time I didn't see a lot of interest on the various mailing lists and I tried what I believe was considered to be the best working code and it didn't work to hot for me.
the lexing problem is being solved by *generating* the lexer from the specifications themselves
Ok. Good to hear. But the documentation I saw on MS website didn't look like much of a "specification". Do you have a link?
you talk as though it would all be trivial if we had used compiler-compiler tools.
You obviously didn't read my anaysis too carefully or you would have seen that I specifically stated; "This is what bison/yacc is for. This is non trivial but theres a great book ...".
Word syntax requires a huge amount of semantic knowledge to drive the parse,
Well, I don't know for sure because I have yet to find any really good information on the actual format but I find it extreemly difficult to believe that it is not based on Recursive Composition. It may seem obfuscated because of backwards compatibility issues but MS's language support is very good. So to dismiss using a yacc grammer shows me you are either clueless about the topic or you wrote the filter for Abiword or KOffice and this is just hand-waving.
Figuring out how to render it on the screen so that it looks and acts like it looked and acted on Word is the problem.
Well, I'm not talking about rendering or actually editing (implied by the "acted" word). At the very least you could convert it fairly well to just about any format (e.g. postscript). But presumably the Office suite using the "filter" would have rendering capability that is flexible enough to render a word document as it would appear in word. If this is indeed true then the real problem is generating a suitable document tree for the viewer in question. This is simply a matter of traversing the tree generated by the filter and translating it into the tree the viewer uses. You don't need to do any rendering at all. You just have to get your node-for-node translation routines to tweek it's attributes in the translation. If the viewer doesn't do a perfect job it should still be quite functional. If the veiwer doesn't support some OLE mumbo jumbo you can quitely skip those nodes in the tree and you still have a functional document. You could then edit it and reverse the translation.
Someone pleeeaasse setup a site dedicated to writing really _good_ MS Word 97+ serialization routines in ANSI c. I would but I'm alread sidetracked on a tangent of a subproject and the stack is just too high right now. This is not hard folks. I know it sounds like a boring project but it's not!
Are you familar with the principle of Recursive Composition (a.k.a The Composite Pattern)? This is without a doubt my favorate programming construct. The key here is that you define an object that can be a child as well as potentially contain children itself. If you can uniformly parameterize the properties common to a set of these objects you can use the priciple of Recursive Composition to build a tree of these objects and then serialize it back using preorder depth first search tree traversal.
For example, a binary networking protocol might have a header, some parameters, and a data payload area. The header has an arbitrary block of security information, which in turn might have a DES encrypted key and an integer describing the length of the payload. So to encode this message using Recursive Composition, define a packet_t type that has the three sub components such as the arbitrary security block, which in turn has an encrypted DES block as a child component. See the tree? Now, if you can parameterize the temporal properties of these objects you can delegate the responsibilty of encoding certain areas of the network message to functions like: enc_security_block(struct security_block *sb, char *dst, size_t off, size_t len) would then call enc_des_key(struct des_key *dk, char *dst, size_t off, si ....
The classic example of Recusive Composition is that of GUI components. You have an abstract object called say Component. Components can contain other components. Sub types would be ButtonComponent, TextComponent, TableComponent, etc. These components might contain subcomponents as well (e.g. ButtonComponent might have a TextComponent for it's label). See the tree again? Now, when it comes time to draw these components you don't have one big block of speggetti code that considers all of the different component types but rather delegate that responsibility to method of the component itself. This greatly reduces the complexity of the problem (actually making it feasable whereas it was not before). Again, we just have to parameterize *where* these components are to draw themselves such as FrameComponent_draw(Window *win, int x, int y ...etxc.
So what does this have to do with writing serialization and deserialization routines for Word documents? Microsoft Words format (and the format of just about every other sophisticated document format out there) is flattend by serializing an internal tree of nodes (like the GUI Components and more so the network packet encoding described above). The tree of nodes is no different from the trees used above to describe Recursive Composition. So by recusively delegating the resonsibilty of encoding/decoding a region of a MS Word document you can parse it into a tree and then do preorder dfs tree traversal to serialize it into any format including .doc.
The hardest problem here by far is determining what the primative types of the document are (e.g. like the security_block and the payload length integer in the network packet). If you don't know what the leaves of the tree look like you cannot start to write a lexer. Find out everything you can about the format of each of Word's elements. There are several projects that claim to have decoded the format to a certain degree. These would be a great start. However I have spoken to these guys and the problem is they are only interested in supporting their own product (Abiword and the KOffice guys talked about a calaborative effort but got hung up on choosing libraries and language and other trite crap). An group independant from these organizations should be established so that the library is not tied to one product.
Once you have a good idea of the bits and bytes behind the layout of nodes in the format you can write a (at first crude) lexer or Lexical analyser. This is simply a peice of c that will break the format into tokens. It's simple in the respect that it doesn't have to worry about the logical layout of elements at all. It's only concerned with nibbling off the primative elements (tokens) themselves. The interface might be as simple as init(char *filename), gettoken(struct lexer *lex).
Now you have to write a parser. This is what bison/yacc is for. This is non trivial but theres a great book called _lex & yacc_ by John R. Levine that can describe how to write a yacc grammer in 200 lines that in convential c would take several thousand lines, take twice as long, and still not work. Ahh yacc grammers to me are like dougnuts to Homer Simpson.
Once you have a working lexer and parser (probably a 1000 lines of code), you can start to build a tree. You need a tree structure. The W3C has written a specification for representing documents as a tree of nodes in memory called the Document Object Model (DOM). Mozilla uses the DOM. It's XML and HTML centric but it's really totally arbitrary. A DOM tree could easily be constructed by adding createNode, appendChild, etc calls to the yacc parser. It just so happends that I have written a DOM implementation in ANSI c. Its called DOMC and it would be perfect for this task.
If you do this much you are sitting pretty. You can just traverse the tree and spit out whatever the analigous elements are for say ps, html, sgml, xml etc.
Take the GPL'd beta and write you're own forked version.
A crystal is a single molecule (but not all single molecules are crystals). These facts are old. Very old.
If this (and the comment before it) suggest that crystals cannot be formed by ionic bonds then you are totally incorrect. Most crystals are formed by ionic bonds. Diamond and graphite are exceptional in this respect. The result is not "one molecule". If it is placed back into a suitable solvent it will dissolve (e.g. NaCl).
Mmmm ...
Understood. Actually, I just realized the Makefiles would present an insurmountable problem in themselves.
Compare this cost to what it would cost you to pay an engineer to optimize his code.
The optimizations that an engineer would make would have a much more dramatic effect than tickling some opcodes.
Would you be surprised if Intels compiler produced faster code than GCC? I believe Linus has stated that GCC is a bit "bloated". I wonder if you can compile the Linux kernel with it (minus assembly of course). That might be interesting, particularly for P4. Linux could get an instant speed boost. And such a radical switch in compiler might expose flaws in the code. Definately a worthwhile excercise if nothing else. And even though the average user isn't going to buy it to compile their kernel, the distro's might for their precomiled kernels (err, wonder how that would work
about the meaning of the words "Open Source".
Answer me two questions.
1) How did Netscape benifit from Open-Sourcing their code?
2) How did the Open-Source community benifit from the Open-Sourcing of Netscape?
[Note: Before you mention Galeon, remember that it was born in response to the poor performace of earlier Mozilla builds.]
I think carrying the burdon of the Open-Source initiative was why the development processed has dragged on as it has. Do you remember the first couple of builds? Is it possible that they would have made more progress without this burdon?
Don't get me wrong (again), I am quite pro-Open-Source and manage two 100+ dl/month OSS codebases myself. I'm simply stating the fact that in the case of Netscape, it turned out to be a poor example of why companies should Open-Source and share their code. Companies should share this code in the name of progress but they should be more sophisticated about how.
No, you're wrong. Mozilla is CERTAINLY an open-source project.
Well, in the obvious sense of the word, yes, most of the code can actually be downloaded and shared (although there are a lot of modules that are not; e.g. the e-mail spell checker). But the point I'm trying to make is that the benefits of the community process did not bare fruit in the slightest. And yet this was supposed to be one of the primary motivations for companies to share their code. The fact is; a web browser is too sophisticated and the existing codebase was accordingly insurmountable for even the most dedicated weekend code warrior. Had Netscape been constructed of a highly polymorhpic and modular design, the code might have been partitioned cleanly enough for other individuals to participate. But this was simply not the case. Sure, you can download the code, but I don't see people outside of Netscape participating in the development of Mozilla to the point where it would be considered a community process (although I do see a lot of usefull technical discussion). Fortunately the W3C does have a thriving community process and much of their work has been implemented in Mozilla. But this is largely due to the fact that many people from Netscape make up the governing bodies of working groups within the W3C.
And about Cathedrals vs. Bazaars. I don't think either is good. I prefer a dictatorship. Put one person that truely has vision at the helm and follow them unconditionally. Implement their vision. This is something the bureaucracy will not allow in a Cathedral. The Bazaar does not work because no one person is influential enough or has the power to make changes that trancend a codebase and these changes are invariably necessary.
I have submitted bug reports for Mozilla and besides the obvious hanger-ons it's very clear that all of Mozilla's developers work for Netscape. Mozilla is not an Open-Source project like everyone's been preaching. Sure people have submitted their own little gizmo to add but thankfully the've abandoned all that crap and are getting down to the metal now. The Open-Sourcing of Netscape was a failure and it's time we fess up and wrote it off as a necessary experiment.
Don't bash Netscape because you'll be bashing Mozilla in the process. The're one and the same.