Slashdot Mirror


Fulfilling the Promise of XML-based Office Suites?

brentlaminack asks: "Almost a year ago Tim Bray of XML fame said 'when the huge universe of MS Office documents becomes available for processing by any programmer with a Perl script and a bit of intelligence, all sorts of wonderful new things can be invented that you and I can't imagine.' Now that MS has dropped the ball on the XML Office front, and StarOffice has fulfilled its XML promise, where are all those 'wonderful new things?' Is anybody out there writing Perl/Java/whatever programs to take advantage of StarOffice XML? Could this be an opportunity for Free/Open/Libre software to leapfrog MS Office in real productivity as XML proponents have promised all along?" What kinds of new and wonderful things can you come up with?

87 of 432 comments (clear)

  1. XML... by ewombatnet · · Score: 5, Insightful

    I think one of the main problems with the embedding of XML architecture into office productivity software is unfortunately the end user. I mean, how long have programmes like MS Word had "document properties" contained in them, and how many people are actually using them? I'm currently working on a project to retrieve documents accross a company's backed-up data from the past 10 years, and there is very very little metadata available for us to do any searching on. Unless the embedded XML contained within office suites is brought more "to the fore" and in the face of users, instead of being a behind the scenes 'option', people just are not going to use it

    1. Re:XML... by Trolling4Dollars · · Score: 5, Insightful

      There are two ways to look at this. ONe way is to make the assumption that the problem lies with the user and the other is that the problem lies within the computer. Even though computers have gotten easier to use, they aren't really easy at all for the average user. The barriers to ease of use are plenty:

      -Feature overload (many features that users will never use)
      -PCs are incredibly complex because they are so flexible and can do so many things.
      -User interfaces are pretty poorly designed and don't seem to be getting any better.
      -Humans don't "interface" well

      If the mode of interacting with computers was like interacting with another person, they would be considerably easier to use. I often joke with my wife that *I* and the ultimate user interface. If you think about it, the best interface for the average user would be a very human-like avatar. Yes, this interface would suck for someone like me (a real computer user), but that's not who it would be targetted at.

      Getting back to the XML subject, these same problems are what keep it from gaining any ground with the average user. The average user still doesn't "get" electronic documents. That's why they always resort to printing them out on paper. To be sure, there are times when a document SHOULD be printed on paper, but that's only really about 20% of the time. The other 80% a document is much better to keep in electronic format. With XML, so much the better. But if the average user has trouble understanding even a basic text file, the ultra-documents that XML can lead to will be completely bewildering. How do we solve this? I've argued this before over and over again: we need new input devices and now I will extend that to new output devices. If we had more variety with the output device, XML documents would be the next "great thing". The XML document has arrived too soon. If we had electronic paper that XML docs could be loaded into, there would be a revolution. It will happen, not just yet. And when it does happen, look for some big corporation to be backing something that looks a lot like XML, but it will have a different more friendly name and will be claimed as innovative.

    2. Re:XML... by chiasmus1 · · Score: 5, Insightful
      The important thing about XML is not the end users. As an end user I could care less about the formation of the document as long as I knew I would always have an application that could read the document.

      With XML documents, if the file format is well known, there will be filters for it. Major Office Suites will support well known file formats. If the file format is not as well known, but it is simple XML, there are high chances that smaller applications will also have filters for it.

      I like to write web software and I was discouraged when I discovered that I could not find a Perl library to create OpenOffice.org files, so I created one of my own. Granted it is not the best library, and is probably full of bugs, but it was easy to create and the research was painless. It does the job I made it for and I use it.

      Compare that to the time when at work my boss asked me to take a Pick Basic binary datebase file and extract the data from it. I had to play around a while to figure out which bytes meant what and how to get the information out.

      XML not only makes creation easy, but makes reverse engineering trivial. XML is not for the end users, it is for the developers why do not have the time to sit and read the 500 pages of the file format spec.

    3. Re:XML... by rgigger · · Score: 3, Insightful

      I just had a thought. What I really want to do is generate some sort of office documents on the web. That way I can make word processing documents, spreadsheets, charts, graphs etc that my clients can download. Now I would love to just generate Open Office XML files and have them use those. The problem with that is that none of my clients use Open Office and they are not going to for the foreseeable future.

      Here however is my super cool idea that I just came up with:

      An open office server. If open office can export to MS Office Formats what's to stop me from doing the following (other than time).

      1) create my templates in open office XML format
      2) extract the parts of open office that import from the OO XML format to it's internal format, and export to MS Office format.
      3) Create a PHP extension (or maybe apache module) to expose this functionality to my web apps.
      4) Insert dynamic database driven content into my OO XML templates, convert them to MS Office format and stream them out to a client.

      Maybe not the product of an ideal world but given the fact that MS Office is both closed an ubiquitous this seems to be a great way to leverage the capabilities open office in handling XML and MS Office import/export.

    4. Re:XML... by ReelOddeeo · · Score: 3, Informative

      Once you learn how to do it, it is definitely possible from, say, a Java program to connect to a running OOo (OpenOffice.org==OOo) make it open a document and re-save it in Word format. You can even make OOo do this without flashing anything on the screen.

      There is a definite learning curve. You need to learn Uno.

      IMHO, despite the learning, this would be way easier than trying to extract the parts of code you need from OOo and building a "converter" program. Maybe I say this because I have spent the time learning Uno and can now program OOo functionality from multiple languages, and how to integrate it into a web server like this seems obvious to me.

      I have personally programmed OOo to do things from: OOo-Basic, Java, Python and MS Visual FoxPro. I know from postings from others that it is most definitely possible to use Delphi and VB.

      Just as an example of what can be done, I built a Maze generator in java. You can run the maze generator on a different computer. Even a different OS. It connects to a running OOo, and then creates a multi page drawing of complex mazes. (You can get it at www.OOoMacros.org or at www.OOoExtras.org.)

      --

      Those who would give up liberty in exchange for security and DRM should switch to Microsoft Palladium!
  2. standardization by Unregistered · · Score: 4, Insightful

    one missing thing is standardization accross OSS. When abiword (and koffice?) support oo files, then we might see more of this. Also, i personally can't think of a use offhand that oo.org can't already do. Once people begin to find uses for this, then more people will actually try to write scripts to take advantage of XML.

    1. Re:standardization by chill · · Score: 5, Informative

      The next major release of KOffice is supposed to adobt the OO file formats as their own standard.

      --
      Learning HOW to think is more important than learning WHAT to think.
  3. anything that will translate manager speak? by hattig · · Score: 5, Funny

    Maybe a script to de-buzzword meaningless missives from above?

    E.g., "We wish to engender a positive business atmosphere" => "Free beer at lunchtime"

    1. Re:anything that will translate manager speak? by cptgrudge · · Score: 3, Funny
      Maybe a script to de-buzzword meaningless missives from above?

      Not a script, but perhaps a free (as in beer) Word plugin? Bullfighter

      --
      Qualitas edurus commercium, nullus penitus net rimor, nullus deus beneficium
    2. Re:anything that will translate manager speak? by bigdavex · · Score: 4, Funny

      #/usr/bin/perl
      print "We're doing more layoffs and getting more bonuses.";

      --
      -Dave
  4. Well... by Otter · · Score: 4, Informative
    ...when the huge universe of MS Office documents becomes available for processing by any programmer with a Perl script and a bit of intelligence, all sorts of wonderful new things can be invented that you and I can't imagine.

    Well, I'm taking a break right now from generating new Excel graphs by copying old ones and changing the source data, which isn't so bad, and those fucking error bars, which is. Oh, and the scatter plot points are superimposed so you can't click on the back ones.

    So if I could do a find&replace on a flat file, I'd have been done an hour ago.

    Other than that, no, I can't imagine either. VBA exists now and it's not like we're all flying around with wings and harps.

    1. Re:Well... by BlueGecko · · Score: 4, Funny
      VBA exists now and it's not like we're all flying around with wings and harps.
      True, but after extensive work with VBA, I grew these sharp red horns and a big red tail with a spike on the end...
    2. Re:Well... by croddy · · Score: 4, Insightful
      MS won't stand for an XML file format -- it's human-readable. the last thing MS wants is for their file format to be easily convertible and transformable. it's a pity, because switching Office files to XML would quickly make them insanely useful.

      imagine you write an outline in word. file -> export as -> presentation... or in access you select some rows and export to a spreadsheet. this is where staroffice stands to beat them.

      but MS Office derives its profitability from incompatibility -- you have to use their products to get full use of their file format. so using MS Office will necessarily sacrifice this functionality.

    3. Re:Well... by Anonymous Coward · · Score: 2, Funny

      I have not heard of BSD for a long time ... is it dead or what?

    4. Re:Well... by YrWrstNtmr · · Score: 2, Insightful

      magine you write an outline in word. file -> export as -> presentation... or in access you select some rows and export to a spreadsheet. this is where staroffice stands to beat them.

      This is what Office does (rather) well. Use an xls as a data source for an MDB, a word doc, and a presentation, all at the same time. Or link database info to a remote presentation.

      And while Office prefers Office, you CAN link to and from bare text files. Whether delimited or fixed length.

      Way back with Office95 we were pulling backend data off a UNIX box into a VB/Access frontend. Seamless to the user.

    5. Re:Well... by perlchild · · Score: 2, Interesting

      The fact that the format is XML doesn't say anything about the format being "open". That's why Microsoft was proposing XML to standard bodies, and trademarking DTDs and Schemas...
      What other people in the thread is for Microsoft to give us 100% of the schema, and so far Microsoft has shown zero will to do so witout legislation compelling it to. 100% of the schema would allow Corel and/or IBM to feature-copy 100% of Microsoft's Office features, and they certainly will of course say that legistlation to force them to give away their competitive advantage would be anti-american.
      Someone with a different agenda would probably say such a thing would have provided a better, more balanced punishment to Microsoft's monopoly than the minimal slap on the wrist they had.
      I personally think they should have been made to refund 50% of the purchase price of all Windows licenses, as half of the value was created by "Everyone else is using it" and that advantage was gained through illegal monopoly, and very creative enforcement of copyright laws. But that's neither here nor there.

    6. Re:Well... by Korgan · · Score: 2, Funny

      Ahhh... so you got MSOffice to run on WINE in a BSD environment then? ;-)

  5. Not a big innovation by Doug+Merritt · · Score: 5, Interesting
    documents becomes available for processing by any programmer with a Perl script and a bit of intelligence, all sorts of wonderful new things can be invented

    This is just a return to part of what made Unix so powerful in the first place: text formats that can be manipulated by the whole suite of command line tools. "Those who don't understand Unix are doomed to re-invent it, poorly" (Henry Spencer).

    Back in the 70s we used nroff/troff for document formatting, producing in some cases professional-quality camera-ready books...but the source code was easily fed to spell checkers, formatting-command-strippers, sort, wc, etc etc etc.

    XML is ok...not bad as a meta-format...but it's not some kind of new magic; it's just more of the same as what we always used to do.

    The great step forward is moving away from the crud that happened in the middle: proprietary underdocumented binary formats that couldn't be fed to filter pipelines.

    In this case, moving backwards is progress. But expecting something amazing to be invented is a bit much; it was already invented a long time ago.

    P.S. pet peeve...people credit Knuth (admittedly an amazing guy for the Art of Computer Programming) for reinventing typesetting with TeX. Now, TeX is nicer than nroff/troff in multiple ways, but it's worse in some others (TeX is not set up for command line filters!), and in any case is only an incremental improvement, not a revolution over the older Unix tools. Credit is not properly being given.

    --
    Professional Wild-Eyed Visionary
    1. Re:Not a big innovation by Anonymous Coward · · Score: 3, Insightful

      Now, TeX is nicer than nroff/troff in multiple ways, but it's worse in some others (TeX is not set up for command line filters!), and in any case is only an incremental improvement, not a revolution over the older Unix tools. Credit is not properly being given.

      I see your point. But have you tried doing mathematical formulas in groff? In (La)TeX they're a breeze (relative to just about everything else out there). Right tools for the right job I guess.

    2. Re:Not a big innovation by brentlaminack · · Score: 2, Interesting

      I'll agree on Tex. I remember the day I gave up on it. I attended a lecture by Knuth himself on abstract graph theory. Guess what he used to generate his overhead transparencies with? Colored felt-tipped markers. Here is the great Knuth himself, the creator of TeX with near-infinite computing resouces available, and he hand-draws equations with felt-tipped markers!! At that moment, I knew TeX was dead.

    3. Re:Not a big innovation by pigscanfly.ca · · Score: 2, Insightful

      What do you have against TeX?
      TeX is god [ok maybe not $DIETY god , but fairly high up there] .
      TeX , along with latex , allows me to do wonderful things with documents generating into multiple formats. Although I have had some eps integration problems (who knew plot utils used some funky ass default font that know one has ever heard of before) it was my fault for not checking to make sure that I had the right fonts installed. TeX is wonderful for typesetting , it puts the control back in the user .

    4. Re:Not a big innovation by kfg · · Score: 5, Insightful

      The great man himself gave you a clue to great wisdom. Not everyone has that chance.

      And you blew it, Grasshopper.

      The lesson was, "The right tool for the job."

      Sometimes the right tool, despite all the modern technolgical advances, is still a rock.

      KFG

    5. Re:Not a big innovation by sharkey · · Score: 4, Funny
      Sometimes the right tool, despite all the modern technolgical advances, is still a rock.

      When all you have is a rock, everything looks like Bill Gates' head.

      --

      --
      "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next.
  6. MS Office is required by generic-man · · Score: 3, Insightful

    XML is not a selling point for an office suite. Users expect a good user interface and an easy migration. OpenOffice is not there yet. Its help assistant spawns 1024x768 help windows to say as little as "I have automatically capitalized the first letter of your sentence." It has no integrated PIM software to unseat Microsoft Outlook. It has no easy migration path for the millions of users who open documents with useful macros and scripts. OpenOffice has no drop-in replacement for Microsoft Access-driven applications; primitive as Access is, many companies use it to develop simple database applications that would need to be recreated from scratch in another suite.

    At this point in time, there's no reason to switch from Microsoft Office to another office suite simply because this new suite uses XML. XML is best suited as a tool for the back-end developer, not an excuse to migrate to a product that has so many rough edges in its current form.

    --
    For more information, click here.
    1. Re:MS Office is required by An+Onerous+Coward · · Score: 3, Funny

      So that's what OpenOffice has been missing all this time. I knew there was something a bit off about it, but I could never put my finger on it.

      The answer, my friends, is an integrated E-mail/Calendar suite. Integrated right into OpenOffice. This is what will finally drive a stake through Microsoft's undead heart.

      Integrated E-mail. Integrated Calendaring. Right in the office suite. All integrated and everything. You all know you want it. Now go, my toiling minions! Build! Build, I say!

      --

      You want the truthiness? You can't handle the truthiness!

    2. Re:MS Office is required by Malcontent · · Score: 2, Insightful

      Reading your post I get the impression that you are unaware of some important facts. For example you seem to think that open office is not scriptable. I can assure that it is. It can be scripted in java or basic (not VBA) or even python. Both the Open office variant of BASIC and Python are open source languages and Java is available from many vendors. What open office can't do is to implement VBA which could get them arrested or sued. BTW most people who have knowledge of multiple programming languages seem to agree that python and Java are vastly superior to VBA.

      Now getting back to the point which you seem intent on ignoring.

      No matter how much you spend, no matter how complex your application if as a result of buying something you become a slave to your vendor then you lose. Vendor lock is bad for business. It's doubly bad if your competition is not also locked in.

      In business you have to have to control your vendors, you have to play them against each other, you have to constantly keep them on their toes by threatening to drop them and move on to somebody else. A business has to cowtow to clients and has to bully their vendors not the other way around.

      Your CIO is locked by Microsoft. He can't leave, if he threatens to leave MS will laugh in his face, audit him and then charge him double just for fun.

      I shudder to think what CIO thought that a multi million dollar application built on office macros was a good deal though. When I hear of stuff like this I wish there was a law that forced you to tell me what company he works for. I would like to know so that I don't have any stock in a company that is being managed so badly. A CIO who is that clueless is probably making bad decisions left and right and obviously the CEO or the board of directors are too stupid to call him on it.

      --

      War is necrophilia.

  7. Apache module by codepunk · · Score: 5, Interesting

    I sure would like a apache module that can CSS and display native open and star office documents.

    --


    Got Code?
  8. PHP Script that generated reports by brandonp · · Score: 5, Interesting

    I created a PHP script a few months ago that allowed a client to upload StarOffice templates for company documents. Then the the script automatically generate documents by pulling data from a database and inserting it into the StarOffice document.

    Was really easy, StarOffice documents are zipped files that contain the XML files. I just unzip'ed the file, inserted the appropriate data into the content.xml file and zipped it back up.

    I was absolutely amazed by how easy the StarOffice files were to work with. I'm really excited about the possibilities that are in store for us, especially ones that are better than my little hack.

    Brandon Petersen

    1. Re:PHP Script that generated reports by hattig · · Score: 2, Interesting

      Sounds cool. Now is there a command line tool that can take said resultant XML file and create a PDF from it?

      (would be great for certain automated server applications where there is no display, etc, and running StarOffice isn't an option because you want it automated)

    2. Re:PHP Script that generated reports by awtbfb · · Score: 2, Funny


      It would be nice to not be constantly pestered about TPS Reports. Now where's my red stapler...

  9. Yes, Standardised Financial Reports by jechonias · · Score: 5, Interesting

    The biggest dream that the financial world has ever had with an XML concept has been the concept of standardised financial reports.

    Imagine a world where any finacial (excel based or otherwise) report from any public company can be compared with any other company report and we can all be sure of how the figures were calculated and what they mean.

    AND they are fully comparable. And fully importable into any financial package. No longer is any one company dependant on one financial package. Come to think of it there is no way the vendors of such products will ever allow this to happen!!!

    http://www.xbrl.org/

    jech

  10. Command line rendering by pirodude · · Score: 4, Interesting

    If there was a way to render out the open office/star office documents on the command line it would explode in the reporting area. Being able to have the end user making a really nice template and have a perl script fill it then pass it off to a pdf or printer is key.

  11. Reporting is a great use of OOo's XML format. by Gravatite · · Score: 5, Interesting

    My team & I just got done building some billing software for one of our customers, and OpenOffice.org's XML based documents turned out to be perfect for generating reports. Our customer is able to open up the document and change the formatting of any report at will, and then we have some Ruby code on the backend that parses the XML document, fills in all the real data from the database and then uses the CLI interface to OpenOffice to render the document as postscript. It was a quick easy way to get powerful report generation with a format that non-technical people could edit that required just a little bit of glue code on the backend, and it's the XML format that made it all possible.

  12. Difficult by iMMo · · Score: 2, Interesting

    I did take some time and decompress a StarOffice document -- I was attempting to write a couple of modules for manipulating StarDraw images to create dynamic flowcharts.

    It took some time to get up to speed, as the compressed XML is split across four different files (content, meta data, settings and styles). Mostly, I was concerned with modifying the content document.

    Each of the documents is written with space in mind, and for the document I was dealing with, the content was 20K on a single line. I had to process the XML just so I could understand the physical structure. Once that was done, it really wasn't that difficult to manipulate the doc by hand, re-zip the content and open in StarOffice.

    (Unfortunately, I didn't have the time to even start, much less complete, the modules. Damn day job).

  13. True WYSIWYG HTML editor by Delirium+Tremens · · Score: 2, Interesting

    XML developers and Web designers are now able to work on some XML-to-HTML transformer that matches closely what the average office user is spending his time creating with the WYISWYG Writer program. This could be a nice alternative to Frontpage, for example.
    Of course, OpenOffice 1.1 already comes with a nice HTML tool, but that doesn't stop anyone from trying to do better.

    1. Re:True WYSIWYG HTML editor by delta407 · · Score: 2, Interesting
      are now able to work on some XML-to-HTML transformer that matches closely what the average office user is spending his time creating
      The guys at Typo3 have done exactly this. They write an extension that takes a normal Office 2003 XML document (like this one) and displays it as normal HTML (like this). The resulting HTML is subject to the same rules as all of the other HTML produced by Typo3, which means the appearance of everything can still be changed by modifying a template.

      Typo3 has always been feature-rich (though terribly complex), and an XML-based document interchange system that can handle documents made in common word processors is a very useful feature indeed.
  14. Automatic Generation of Pretty Reports by pjack76 · · Score: 5, Interesting
    You know, with charts and graphs and your corporate logo on them. The charts and graphs are populated from a database somewhere. Suitable for your board report.

    I bring it up because my organization paid Crystal reports $10,000 to be able to do this. If I could have written a little perl script that connects to the database and emits an OpenOffice doc, then I could have saved the organization ten thousand dollars, and saved myself a world of pain. (The only thing more evil than Crystal Reports is crystal meth.)

    You might be wondering why I wouldn't just use HTML and some library that automatically creates chart PNG images -- the reason is we have to email the report to our board members because they're demanding like that. So we use Crystal to generate pretty PDFs with all the charts. We also let the board members log into our system to generate their own reports via the web, which they can then email to the group.

    So having an XML-based document format for this would be wonderful, especially if OpenOffice would provide a command-line utility for converting from OO format to PDF.

    --

    Wow, a lucrative publishing contract! I don't have to be evil anymore. --Meteor

    1. Re:Automatic Generation of Pretty Reports by mabhatter654 · · Score: 2, Insightful

      Only problem is that it doesn't import any metadata. hyperlinks, bookmarks, etc...It's just a cold rip of the pages. That limits it's usefullness because you can't do anything with the resultant PDF [i.e. HR manual, reports, manuals] just look at it. That's severly limiting for corperate use.

    2. Re:Automatic Generation of Pretty Reports by merlin_jim · · Score: 4, Funny

      The only thing more evil than Crystal Reports is crystal meth

      Funny you should mention that... I'm at work right now (10:00 PM local time; been here since 9:00 AM) for that very reason! And I'll give you a hint, I've never touched crystal meth

      --
      I am disrespectful to dirt! Can you see that I am serious?!
    3. Re:Automatic Generation of Pretty Reports by Ugmo · · Score: 2, Interesting

      I used to make PDF's with Perl Scripts from Database reports. I made HTML Documents from the database queries and then used HTML2PS to make Postscript files. I could make PDF's from the Postscript files, see GSView it comes with a script ps2pdf. The results were mailed to interested parties.

      I made use of "Programming Web Graphics with Perl and GNU Software" O'Reilly Book and some extra research on the Web. It was mostly a pretty print of lots of HTML tables as PDF's + text.

      Some customers demanded Word docs.
      I tried using RTF to produce Word doc files and found it was easier to output HTML and put a .doc extension on the file. I found MS-Word will automatically open it up and it will look nice.

      I did not output Graphs. You could try using Gnuplot to output graphs in postscript. A little cutting and pasting of the Poscript files ( tables, text from HTML2PS, in one file, graphs from gnuplot in another) paste them together with perl and turn the whole thing into a PDF (html2ps then ps2pdf) should produce something, though, I do not know if it would duplicate your Crystal Reports.

  15. Word to RTF to XML to HTML by PeterHammer · · Score: 5, Interesting

    At my company, once a failed startup with new life under the wings of a huge corporate parent, we have been using a homebrewed Web publishing system that takes Word 2000 or XP documents, saves them in RTF format, then uses a utility created by Majix to transform the document to XML. From there we use perl, and some XSL to get the document into XHTML combined with some JSP to produce documents that we deploy on our production env. The good part: the system was entirely free of license fees (other than office and Windows of course). The bad: it was a pain in the behind to get all the parts together.

    The steps to produce valid XML from Word are the biggest hack I have ever been a part of as an engineer. We had to write a custom VB DLL we run inside (what else) an IIS server which takes the documents uploaded by authors, then saves the documents as RTF. Control is then handed over to Tomcat, which takes the RTF and uses some custom classes that make Majix a server to transform the documents into XML. All in all we had to use VB, VBA, Java, JSP; two separate server configurations (IIS and Tomcat) and a bunch of really ugly glue to stich all the parts together.

    I for one, and I am sure I speak for my entire team, would love a solution which saves us this ugly cludge.

  16. Two Things... by Serapth · · Score: 2, Interesting

    First Off
    Microsoft did not drop the ball with XML. Microsoft disappointed the slashdot crowd by not going completely open... geee...... big shock there. Microsoft maintains dominance to their office suite by controlling the file formats behind it. Opening that up, without reason would be absolutely stupid from a business point of view. Granted, its an un-popular stance, but that doesnt make it any less true. MS played along with the XML game to be able to use XML as a buzz word... and in some ways, they truly have embraced XML... just not in their holy cash cow called Office. Take a look at Visual Studio (dot) Net, and you will see how strongly MS has infact embraced XML.

    Secondly...
    XML is perhaps one of the most over hyped technologies ever. Self describing datatypes are nothing new. The only really remarkable thing about XML is how embraced by the industry it was. In all honesty... the difference between XML and CSV files really isnt that signifigant. Granted... XML is far beyond anything a CSV ever did, but they all present the same result. In the current work environment I am in, all our enterprise systesm support input/output now via CSV. In addition, im in the auto industry, so the whole hype of Webservices+XML really isnt that special either. RIght now, they have ANX and EDI... granted... XML + Web Services would be much more straight forward... but in 20+ years of evolution... has it really come that far?

    Sorry for the anti-status-quo opinion, but I cant help but believe that XML is way overhyped. Useful... sure... but definatly overhyped!

    1. Re:Two Things... by TummyX · · Score: 4, Insightful

      What are you talking about?

      CSV? LOL.

      Does CSV have a transformation language (XSLT)?
      Does CSV have an easy to use parser & object model (SAX, DOM)?
      Does CSV have an in document addressing language (XPATH)?
      Does CSV have a standard way of supporting hierarchical data?

      Just cause you think it's overhyped doesn't mean it isn't worth every bit of that hype. I've been using XML since 1998. I shudder when I think about the pre-XML days.

    2. Re:Two Things... by Serapth · · Score: 2, Interesting

      I think you misunderstand me here. You say you shudder to think of the pre-XML days... well, the pre-XML days, well... they were CSV.

      Now... the thing is, many of the things you have mentioned are already expressed by Relational databases, which is generally what the CSV file is generated from in a batch based system. In alot of ways, that stuff already existed... just not in the file format, but in the process of creating said file format!

      Dont get me wrong, im not saying that XML is shitty... im just saying that XML is way over hyped. For a replacement for a system that has been around for 20 years... is XML really that special? Does XML really grant us that much beyond what CSV and good databases behind the scenes really help that much??? The proprietarity of XML schema's really dont make the standard just that open, now does it? It has the capacity of being an open standard, but on the whole, you often need to know the format in advance... how is that much different from CSV's and standard batch outputs have already presented?

      XML is not much of a step forward really... it is a step forward no doubt... but perhaps the best solution would be to make the data self enacting. Namingly, couple the logic to the data... so that code and data can exist as one.

    3. Re:Two Things... by TummyX · · Score: 2, Insightful


      Does XML really grant us that much beyond what CSV and good databases behind the scenes really help that much???


      Yes because XML fits in places where databases aren't even worth considering. If you think XML is a replacement for relational databases then you're a bit lost IMO.

      How many generic CSV parsers are there? Are the fields (tabs?) self describing?

      Think of an OS and applications today and the various files they use. Think of configuration files, shortcut files, bookmark files, document files, project files etc. Think of all those files that have until recently all been stored in proprietry, hard to interpret and sometimes buggy binary files.

      Yeesh.

      XML is a huge step forward.

  17. Ease of XML Document Formats by DJ+Rubbie · · Score: 4, Interesting

    XML does make it extremely easy to create documents on the fly, whether a plain old document or a slideshow presentation, all it needs is some template XML, original text, and some programming language to put it together.

    I wrote a song lyric storage system using PHP and MySQL, and I had the idea to have it be able to be put onto a slideshow to teach it to a group of people (or whatever). With the XML format provided by OpenOffice.org, I was able to quickly put it together and show it off, impressing quite a few people in the process. Of course, those people think Word/PowerPoint run the world, and the file format is all but a mystery to them. Hence having something generated on the fly via a webpage has its cool factor, and not to mention it was a good chance to introduce this free word processing suite to them. Also a good chance to tell them that if I were to rely on ASP/PowerPoint it would have costed much, much more.

    Open document format is the way to go in the future, because it definitely allows interoperability.

    --
    Please direct all bug reports to /dev/null
  18. Web-Document Templates; Charts; Presentations... by Anonymous Coward · · Score: 2, Insightful

    There a many uses, besides simply having a format that multiple programs can open. Besides, when new features are added to the format, the older software could ignore those tags, somewhat like HTML has been doing. Then you get the ability to still open newer variations on the format. Not to mention make it easier to covert between them, and add an XSLT to an older app to "update" it to support the newer fomat better.

    few off the top of my head:
    online services generating template documents; such as online resume creating websites.

    Draw charts in a GOOD charting program instead of the crap these office programs have.

    Generate presentations from outlines or databases, create videos from presentation files

    For the small-time database software, the database could be imported into other database software, or converted to SQL or be translated into just about anything.

  19. Is it just me... by cca93014 · · Score: 3, Funny
    or is XML good for the following things:
    • moderately useful at providing a very basic cross platform information transport.
    • very useful when being mentioned by PHBs in meetings with CEO/Investors in an attempt to look knowledgable, bleeding edge and worthy of their job/salary
    • exceedingly useful when being mentioned by stock analysts to pump a company

    I mean, come on. It's just a standardised file format. That's all it is, OK?
  20. Agreed.. by msimm · · Score: 4, Insightful

    And before anyone try's to point out the cost/open source issue: In business that doesn't mean squat. Trying to sell something for free is the wrong attitude, businesses don't want to rely on good will. Kudo to all the dual licensed project out there that have learned how to play both sides of the fence.

    --
    Quack, quack.
  21. OMFG someone with sense by DrSkwid · · Score: 4, Funny

    Ron Minnich at lanl described this one also (though we weren't talking about XML)

    -----
    You want to make your way in the CS field? Simple. Calculate rough time of
    amnesia (hell, 10 years is plenty, probably 10 months is plenty), go to
    the dusty archives, dig out something fun, and go for it.

    It's worked for many people, and it can work for you.
    ----
    if you must

    So get ready for all the gee whizzery now the new kids have "found" plain text.

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  22. You must manage, force use of limited metadata by g8orade · · Score: 4, Interesting

    I helped spec out a document management metadata database 18 months ago for an engineering firm that wanted to catalog its files. They started out wanting just to categorize their CAD drawings, then decided to include all types of project files.

    Our solution was a tcl front end that forced the entry of a minimal amount of metadata *during file creation,* to be picked from preset categories and subcategories. We also provided for free text entry but that was to be used only after the other fields.

    The points are
    a) The general metadata categories were known; the engineering tasks weren't new.
    b) No one is going to go back after the fact and enter the metadata. You have to integrate its entry into the new file work procedure.
    c) It's got to be as easy as file/new in a GUI.
    d) Its utility has got to be very very apparent when juxtaposed with a subdirectory / filename scheme.

  23. It's called troff by DrSkwid · · Score: 2, Informative

    and we've had it since most /.ers were born

    then there was postscript

    now XML

    whee, I have candyfloss in my hair

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  24. Microsoft Dropped the Ball? by Carnage4Life · · Score: 5, Interesting
    Now that MS has dropped the ball on the XML Office front,
    I'm curious, how did Microsoft drop the ball with respect to other XML-based Office suites? The linked article points to a report that the ability to import user-defined XML formats into a form that can be understood by the primary Office products is an Enterprise feature. However loading or saving documents using a default XML format is in the base versions of Office and in fact was in the last version of Office given that Excel had a documented XML Spreadsheet Format.

    Is anybody out there writing Perl/Java/whatever programs to take advantage of StarOffice XML?
    Not me but I am writing C# apps that make use of Excel's XML format. I wrote about using XSLT on the Excel XMLSS format in my blog a few months ago when I had to update date values in certain columns. I also posted the XSLT stylesheet.

    Disclaimer: I work on the XML team at Microsoft but not directly with Microsoft Office.
    1. Re:Microsoft Dropped the Ball? by YouAreATool · · Score: 4, Informative

      At this point, people should realize /. articles are mostly fretards talking out their ass. I too read this article, thinking: wft? As I am writing this comment, I'm looking at my (beta) Word 2003 file save dialog and an example XML doc I just made. It round-trips all formatting and junk in the XML format. It has a "save data only" checkbox in the saveas dialog, and can support xsl transforms (you supply the xsl) on export. If I cared, I think I could make it export OpenOffice format pretty easily. The high-fidelity XML file has a lot of junk, but it's all XML.

  25. Yup, peeople are by amblin · · Score: 4, Informative

    Take a look at Axkit's, OpenOffice filter.

  26. Re:The more things change... by NumLk · · Score: 2, Insightful

    I fondly do remember WordPerfect's Reveal Codes feature. While this is more a reflection on the simplistic nature of WordPerfect (and other word processors of the day), being able to see all of the formatting codes as they appeared in a document was great help when trying to format a document to look a certain way, but have it turn out completely different. Also, if I remember correctly, you could even type in the codes exactly where you wanted them to appear.

    --
    Children in the backseats don't cause accidents. Accidents in the back seats cause children.
  27. XML and MS Office by Mr.+Ophidian+Jones · · Score: 3, Informative

    I guess there's XML and there's XML and getting between them is not necessarily easy.

    Microsoft made a big deal about the most recent versions of Office writing out XML, but that was because XML was a buzzword, sounded as if it might be more open than ".doc", and was essentially a selling point.

    From what I've read, people have been underwhelmed with the XML coming out.

    If only a similar set of transformations could be developed for OpenOffice to import and export the XML of the latest version of Microsoft Office. From what I understand, the schema is not documented and the formatting and rendering rules for documents are still kept a private affair, just as it has been for .doc files.

    You're still locked-in, dude!

  28. Docbook XML OOo Filters by Evangelion · · Score: 2, Interesting

    I've been using these XSLT OOo <-> Docbook-XML filters for a little while.

    They work pretty well (if you can manage to get them installed with the broken install instructions) but only for a limited subset of Docbook. There's no support for the programlisting tag, and lists are currently broken.

    If anyone out there has superior XSLT kung fu, getting those two things working would be most appreciated : )

    (I know the basics, but I don't yet have time at work to justify it. Maybe if this project gets done on time...)

  29. Office Automation by merlin_jim · · Score: 3, Interesting

    Well I don't know about Free/Open/Libre or XML development for Office... but I do know about the proprietary APIs Microsoft distributes for Office.

    If you wanna give them a try sometime, assuming you got Windows, VB5+, and Office installed... just add Office to your references (try Microsoft Office in the Project References menu) and give it a whorl. It's fairly easy to program in if you've used Office... most of the concepts that make for a good Office user translate directly into programming concepts for the Office object model.

    And yet Office Automation programmers are in scarce supply.

    Microsoft even offers a cert specifically for Office Automation programmers!

    But I haven't seen too many well written Office applications. My speculation is that its not for lack of tools, but that its for lack of concepts. Other than the obvious reporting needs that any large organization has, are there any compelling reasons to spend an afternoon coding an office application?

    I think it is this lack of compelling reasons, and not a lack of easy-to-use programming tools that causes the lack of good free open add-ins...

    --
    I am disrespectful to dirt! Can you see that I am serious?!
  30. Actually, WordPerfect has supported XML for years by Karl_D_Schroeder · · Score: 2, Interesting

    ...Of course, not very well--but it's pretty easy to compile, say, the Docbook 4.1 DTD in Wordperfect and edit moderately complicated documents. Or import... The limitations are that it uses its own formating system, rather than XSLT; and it uses DTDs instead of schemas, because the technology derives from SGML (which wordperfect also supports). Arguably, WordPerfect has better support than any of the alternatives within the word processing space (i.e. discounting pure editors such as EMACS).

    --
    Author of Permanence and Ventus, co-author of The Claus Effect and The Complete Idiot's Guide to Publishing SF.
  31. Putting the cart before the horse... by EricTheGreen · · Score: 4, Insightful

    Bemoaning the lack of XML-based magic goodness in corporate document processing assumes that a corporate document base exists which a) follows predictable content and structural patterns to allow automated processing, and b) is structured and rigorous enough to do meaningful processing against, an assumption which frankly doesn't hold water in too many places.

    For most of the office document world (at least the world I work with regularly), most documents are unique in both structure and content and I as a programmer can make only the most basic of assumptions regarding what a program can expect to find within the content bundle. Sure the XML gives me a nice set of rules to rely on for breaking the document into parts and reading it in. But it doesn't do a whole lot to ensure that, say, two spreadsheets follow similar content assignment conventions. Most places can't get two managers to agree on the form and structure of a basic memo, or even get the same individual to repeatedly use a consistent structure in all his/her business communications.

    Most organizations need to work on a few things before this type of processing will be useful in the large. Two particular areas would be: a) consistent use of metadata within document definitions to facilitate querying and filtering, and b) more sophisticated use of template functionality beyond just ensuring every page has the same graphic in it's header.

  32. Assumption by m00nun1t · · Score: 2, Interesting

    I still don't get this thing about MS dropping the ball. I've played with Office 2003, and the XML features in particular (mostly Word & Infopath, not the other programs) and I think they are quite well done.

    Word has two different modes. One is where you can save an ordinary word document in an XML format. This is the one /. goes on about mostly. Yes, it's pretty ugly XML, but you are trying to represent non-structured data in a structured format - of course it's going to be ugly. But it is documented & there is a publicly available XSLT from Microsoft to work with it. The other mode is to import and XSD and tag up the document as you like. You can save this in "rich" mode (with all the office formatting - unstructured again) or "clean" mode in which the XML is as pure as your XSD is.

    InfoPath simply rocks. Where else can you create a end user friendly UI that outputs clean XML (with XHTML islands if you choose) and will submit directly into a web service & make the whole thing start to end in a few minutes (for a simple form, of course).

    I just don't get it. Seems like mindless MS bashing to me.

  33. Re:"cost", not "costed" by mabhatter654 · · Score: 2, Insightful

    Hey SlashLords! I humblely request We need a "-2 GrammerNazi" to get rid of these!

  34. The two stages we haven't reached yet by Anonymous+Brave+Guy · · Score: 4, Insightful

    The parent post is right on the money here.

    Right now, I don't want flashy, XML-driven power apps. I'd settle for a word processor where I can produce my document with minimal fuss and good quality results. Apparently the vast majority of other word processor users agree with me, because I don't see any big uptake of ueber-powerful macro systems, manipulation tools based on super-flexible file formats, or any of the other much-promised stuff.

    The simple truth is that usability is nowhere near the point where these facilities add value yet. Before you can develop powerful extra tools, you have to get the basics right:

    • a clean but powerful UI (no, this is not impossible)
    • good basic navigation and editing capabilities
    • good basic structure and formatting controls
    • good basic tools (spell check, word count and mail merge would probably do for a very large subset of WP users).

    These are essential for a serious document preparation system, yet no currently popular WP, commercial or free, even comes close to doing them all well. The serious people universally use either DTP packages or typesetting systems, and there's a reason for that.

    When we reach the stage where a word processor can do these things well, without the user ignoring stylesheets because they're too awkward, having to look up the help every time they do a mail merge or finding that limitations in the document structure support prevent you doing what you want to at all in a non-trival document, then we'll be getting to the stage where more powerful "workflow" tools might be of real benefit.

    The second stage, of course, is developing the tools to create those workflow tools, and making them sufficiently usable themselves that people actually take advantage of the advanced capabilities. Right now, we have some awesome-sounding automation tools available, but who really uses them? Not many people, IME. Much of the problem is that the automation tools themselves are, like the applications within which they live, simply too much effort to bother with.

    Give me a usable basic WP and usable tools to automate it (XML-based or otherwise) and I will move the document creation world. Until then, don't call us...

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    1. Re:The two stages we haven't reached yet by Pfhreakaz0id · · Score: 4, Interesting

      I agree with you SOOO much. Often times, it seems applications are written by programmers/computer geeks FOR computer geeks. I work on a workflow-based web application (It uses oracle workflow). We recently completely redid the app to do away with the Oracle-generated web pages for "notifications" (stages in the workflow) to do our own and send messages to the engine via API. Why? Our users just didn't "get" the workflow concepts and we had to design vastly more complicated UI that had pictures, etc.

      and yet we met with massive resistance from the other IT groups... "Why are you doing that, workflow does that" "that's a training issue (code phrase for 'the users are stupid') and "don't you know how to say no?" and (getting to your central point) "you've dumbed it down. Your application doesn't any of the powerful search, etc, features the workflow web interface has" (never mind NO ONE used these things).

      I think it was a piece from Douglas Adams who told a story of someone he knew using word who wanted all the junk removed from Word's menus that he didn't use. He showed him how to remove menu items thru customization and he ended up with just Open, Save, Bold, Italic, Print and Spell check.

    2. Re:The two stages we haven't reached yet by mcdesign · · Score: 2, Insightful

      Word processing programs are still far to stuck on the typewriter way of doing things. They will never improve until they ditch that metaphor. Page layout programs have a much better approach. If you want to put that text box 10.123mm from the top of the page that is just fine in a page layout program. If you want to overlap you text boxes fine as well. Many Word users seem to be spending far too much time wrestling with the word way of doing things rather than getting on a producing the document.

    3. Re:The two stages we haven't reached yet by WoTG · · Score: 2, Insightful

      I generally agree. However, I am one of those users who at one time or another uses a lot of those weird features that "nobody ever uses". Macros, comparing documents, embedding stuff, mail merges, etc. I just did a quick browse of my Word 2000 menu bar, and the only things that I don't recall using are various wizards like auto-summarize, auto-format, and letter-wizard. The thing is, that I don't think I'm that unique in using a wide swath of features. True, most of the time 90% of the features are not used in a particular document, but over the course of 20, 30, or 50 documents, a whole lot of features are used.

      One idea that I've been thinking about lately, is having 2 or 3 basic modes of operation; something like Novice, Intermediate, and Expert. And make it VERY obvious how to switch between modes. In Novice mode, lock down all the toolbars, don't auto-hide menu options (not that I care for that feature at all!), maybe make the help features come up quicker(?). For the other modes, let varying amounts of the features get displayed.

      Eventually everyone would end up in Expert mode, but it would be a nice and gradual transition. This doesn't have to be too hard to setup... theoretically, someone could probably create an add-on to MS-Office or other suites to customize the appearance...

    4. Re:The two stages we haven't reached yet by Sri+Lumpa · · Score: 2, Funny

      "Open, Save, Bold, Italic, Print and Spell check"

      "I suppose the funny part was that he forgot "Close" ? :-)"

      Nah, it's Word, it's got the automatic shutdown feature (otherwise known as crashing).

      --
      "The obvious mathematical breakthrough would be development of an easy way to factor large prime numbers." Bill Gates,
  35. xml - pdf by jefu · · Score: 2, Interesting
    XML to PDF can be done with the XSLT outputting FOP and then a FOP to PDF translator.

    That probably sounds icky and scary, but should not be all that hard.

    I don't know what the formats are, but there's a whole pile of flexibility in XSL and FOP so building a very accurate version could take some fiddling. But producing a close approximation is probably very straightforward.

  36. Missing some of the points by evil_roy · · Score: 2, Informative

    Formatting can be handled by whatever.

    The strength is in the meta-data. By using XML the doc can be formatted by anything that can understand it. But formatting is not the point.

    The docs can then be referenced in a relational database - searched,indexed & importantly shared and migrated to other indexing systems or stripped.

    The XML 'magic' is very simple. The use of the data is whatever you want it to be. Do you want to restrict access, provide access, record access, implement version control and X-referencing - then using this technology is for you.

    It has sfa to do with troff/groff/cat/echo/print and everything to do with document collaboration and sharing.

    1. Re:Missing some of the points by Doug+Merritt · · Score: 2, Interesting
      Formatting can be handled by whatever. The strength is in the meta-data.

      True! But it is widely under-appreciated that this can and was done even with troff, and still is today in an important way: the "apropos" command that scans for relevant man pages works by looking at a DB built by searching for semantic tags in the man pages.

      This very handy feature would not be possible if troff just did presentation style.

      It's true that this is not the main emphasis of troff, and that one is at the mercy of whoever wrote the macro package, etc, but that's true of XML sublanguages too.

      I'm playing devil's advocate. I realize (and posted elsewhere here) that there's a difference in that XML, when used as intended, is supposed to be primarily about semantics, with style as a secondary transformation, whereas it's the other way around with troff...it's intended for presentation, but people have nonetheless done handy things with it at a higher level of abstraction.

      But still, the point is that nothing XML does is brand new. It just represents new industry awareness of some old good ideas.

      Similar to how Java has popularized the 40 year old notion of doing garbage collection, so now people say GC is "new technology". Not at all. Just being more widely used.

      --
      Professional Wild-Eyed Visionary
  37. You missed the point by spotteddog · · Score: 2, Interesting

    I don't want an "Office Suite" shoved down my throat. I want to use the graphing tool I think is best, I want my favorite email app, I want to use the word processor I like, and the spreadsheet I like, etc. I want to be free to try the newest software without converting everything I might need in the future. If the "office productivity programs" all used xml file formats, I could interchange files for one app to the next easily. I would NOT be locked into a single vendor's "suite" or programming HELL.

    If the apps were using XML, easy migration would be a given, and programmers could spend time "enhancing" the user interface.

    --
    . there used to be a sig here.....
  38. Structure Structure Structure by unfortunateson · · Score: 2, Interesting

    Having an XML representation of a Word (MS, Open, whatever) document as a stream is really no more useful to me than RTF: I can parse them both.

    The better part is when you can structure your document. Not just a heading surrounding a bunch o' paragraphs, but a (to use the stuff I have to work with) Research Report contains a Title Page, a Synopsis, an Introduction, Materials Section, etc. You can't put tables and figures on the title page or Introduction, you can in the Synopsis and Materials Section. TOCs and things like that are created as part of rendition, between the Synopsis and Introduction, without the user messing with it.

    Now even more than storing those sections (which would, in the HTML world, be DIVs and SPANs), I want control over the UI: disable that table button in the title page, even down to where bold and italics can be used.

    Office 2003 has some facility to implement this, but it's kind of awkward -- it's an extension of how their SmartTags work. Generally pretty ugly, to control everything.

    I don't want to use an XML editor, my users know Word, are used to Word Processors, and they cost 1/5 of XML editors, less in bulk licenses.

    I'd be implementing this now, if it weren't for two things: a) I work for a big corporation that never buys into new releases for a couple of years, and 2) they're laying me off -- closing all the facilities in Chicago (sigh).

    --
    Design for Use, not Construction!
  39. It's all about the parsers. by SuperKendall · · Score: 3, Insightful

    XML can more easily represent complex data structures than CSV, but that's not the main benefit.

    Nope, the real revolution was in creating standardized parsers. I spent many an hour with LEXX and YACC churning out parsers for many custom file formats. Even though XML may not seem the most efficient way to represent things, it's great not to have to write a new parser every time we have a new bit of information to represent in a file. It frees you to think about what data you want in a file instead of directing your file contents to things that will be easy to parse.

    That's why XML is every bit as valuable as it is made out to be, just not for the reasons usually given...

    --
    "There is more worth loving than we have strength to love." - Brian Jay Stanley
  40. Document Properties Manditory by AShocka · · Score: 2, Interesting

    You can configure most office suites to display the document properties dialog on save. I'm sure you could also build templates with macros that would check and update these. Yes, it's a real problem and most businesses do not have strategies to address it. It's a document management issue very few address.

    It's a similar problem with web publishing; there is little or no metadata to identify documents. I've always thought that the Dublin Core set would serve as a very good repository for a kind of CVS on the status of documents. Have wanted to build a back end to something like Apache/Cocoon using this model, which would also serve as the data repository for populating both the metadata in the web documents and also all the other data for semantics and accessibility, all done on the fly out of a DC metadata repository.

  41. Open standards for XML forms by mdubinko · · Score: 2, Interesting

    One thing the Open Source office suites don't (yet) have much of an answer for is an XML data collection/management system along the lines of Microsoft Office InfoPath. A natural standard for such applications is W3C XForms.

    Read all about it--fullly GFDL and online now--from the O'Reilly book at my site.

    .micah

    --
    --- Learn XForms today: http://xformsinstitute.com
  42. Re:Users Expect ... by yuri+benjamin · · Score: 2, Interesting

    I've been thinking about a document management system that has an integrated word processor.
    To create a new document users would first be presented with a DE screen asking for some meta-data (perhaps with some manditory fields) before being dropped into the more familiar wordprocessor gui.

    Someone with admin rights to the document management system might define the fields that go into the initial DE screen.
    Users might have to choose beforehand whether the document will be emailed, faxed or printed (eg for snail-mail), and the document would be "attached" to a client record, along with any replies (eg by email).

    The "save" feature would be replaced by "save draft" and "save final", because once the document is sent to an external party you need to "freeze" the document as a record of what's been sent.
    Maybe some kind of versioning & rollback would be useful too (something more powerful than undo/redo).

    I'd do it myself, of course, but I don't have the time or the skills.
    If I ever see a patent application for this idea, I'll point to the /. archive of this post as prior art (although I'm sure this kind of system is already in use in some organisation somewhere).

    --
    You make the mistake of thinking you can educate the fundamental stupidity out of people. You can't.
  43. XSLT by wwi · · Score: 3, Insightful

    In about .5 hrs, I was able to
    extract the content from an
    OpenOffice text document, as
    well as a presentation, and feed them
    into other tools. This without
    trying to read any DTD's. Applying
    more effort would have yielded more
    functionality, but I was in a hurry,
    just trying to get some information
    out with some heirarchy to it.

    Now, extracting the style is a different
    challenge, and of course style
    means different things to different
    people. But it is simply madness to try
    to extract content from Word
    and Powerpoint files for use elsewhere.

    Oh yes, I used Saxon. Nice product.

  44. "any programmer with a Perl script..." by eclecticIO · · Score: 2, Interesting

    "and a bit of intelligence"

    Using a MS Word template, ActiveState Perl, and a number of modules including Win32::OLE I created a documentation generation system that pulled information from a database and created a Word document with dynamic headers, footers, formating, content, etc. I used it to created 1000+ password protected, pre-formatted Word documents that we provided to the client. Anytime the format needed updating or any data needed to be changed all I had to do was rerun the Perl script rather than update all of those docs.

    I'm not going to say that this was easy by any means, it took quite a bit of research and tweaking to finally get right. XML would, no doubt, make this task easier but I don't necessarily think it is the panacea that will FINALLY permit us to automate docs and reports that need to be generated and shared. My point is that with "a Perl script and a bit of intelligence" document automation is something that can be done now.

  45. Migrating file formats by SgtChaireBourne · · Score: 3, Interesting
    I'm currently working on a project to retrieve documents accross a company's backed-up data from the past 10 years, and there is very very little metadata available for us to do any searching on.
    Yes, but you can't claim that an absence of metadata is due to a failure to write metadata: I myself used to keep a lot of metadata in my text processing documents and found that if you migrate periodically to new versions of the MS-Word format suite, you will periodically lose the metadata. No errors, no warnings, it's just gone. XML in the MS-Office suites is not going to come to the forefront. Microsoft, an Oasis member, backing out of the Oasis standard shows where they are heading. The misdirection about the schema should remove any doubt.

    On the other hand, OO.o's XML format + schema will be available even to competitors and theoretically beyond the life span of OO.o. One way for OO.o to encourage users to think in a structured is through style sheets. Style sheets and document templates can save a lot of wasted time and effort. But again, what would people do with the spare productivity if formatting were done in 5 minutes, instead of spending 2 days formatting manually and re-formating manually various reports and presentations?

    --
    Beta is broken and the link to classic doesn't work. Stop wasting our time or there won't be anybody left here.
  46. I'm doing it right now by bertilow · · Score: 2, Interesting

    > Is anybody out there writing Perl/Java/whatever programs to
    > take advantage of StarOffice XML?

    Yes, actually I started doing that yesterday: I'm using Perl and XSLT to build documents in StarOffice XML (or actually OpenOffice.org XML), converting some 500 XHTML pages into one huge OpenOffice.org document. It's amazingly easy!

  47. Goldfarb's Conjecture by RobotWisdom · · Score: 2, Interesting
    People need to wake up to a simple fact-- XML is for databases, not for documents. (I first pointed this out in 1998.)

    The gigantic propaganda campaign about the "wonderful new things" that semantic markup would make possible was always just a masturbatory fantasy by people who'd never implemented anything, encouraged by SGML contractors who saw an opportunity to broaden their target market.

    At the root of this delusion is what I call "Goldfarb's conjecture"-- the claim that document styles are superficial representations of underlying semantics. If Goldfarb were right, then tagging document semantics would be no harder than tagging styles, so this sort-of-works for titles and highlighting.

    But hardly any other semantics have associated styles, so tagging them becomes sheer drudgework for almost no payoff. It's absurd to have to tag every name as a name, every place as a place, etc. This metadata belongs in headers, not as embedded tags.

    So the real outcome of the XML-scam is that the effort to add metadata to webpages has been set back at least five years. What should have been emphasized was META headers for: Yahoo topic-category, DMoz topic-category, list of persons, list of places, list of companies, list of things, dates discussed, document type (eg timeline, image gallery, biography, etc).

    1. Re:Goldfarb's Conjecture by Isofarro · · Score: 2, Insightful
      The gigantic propaganda campaign about the "wonderful new things" that semantic markup would make possible was always just a masturbatory fantasy by people who'd never implemented anything,

      So, what have you implemented that's being used by thousands of businesses across the world? Pot. Kettle. Black, Mr failed AI expert.

      So the real outcome of the XML-scam is that the effort to add metadata to webpages has been set back at least five years.

      Adding metadata to webpages is deceased. It has been for over half a decade (Yes it is 2003 this year). Its a dead donkey, no need to flog it any more.

      What should have been emphasized was META headers for: Yahoo topic-category, DMoz topic-category, list of persons, list of places, list of companies, list of things, dates discussed, document type (eg timeline, image gallery, biography, etc).

      Utterly useless. Listing a series of dates does nothing a simple perl script can extract. Now linking a date to an actual place - now that's something useful. And your above example fails that simple relationship. Screenscraping ain't gonna save you - its far too brittle for practical real world use.

  48. Re:MS isn't the only one with a proprietary format by nagora · · Score: 3, Interesting
    Why is no one complaining this much about Adobe Acrobat?

    Maybe because its not a closed format, hence all the open-source pdf generation programs.

    Frankly, I'd rather see more PDF generation than XML. If I sit down and spend hours designing a book or report it's more important to know that it will appear as designed than that it can be converted into a mass of raw data and presented in any half-arsed way by someone so primative that they still think PowerPoint is a pretty good idea.

    TWW

    --
    "Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
  49. Re:Word to HTML to XML to HTML by bWareiWare.co.uk · · Score: 3, Informative

    I find the easiest way of getting usable XML out of Word is you use Word's save as HTML function and then running W3C TidyLib to get rid of all (most) of the M$ crap.

    This leaves you with a HTML-esq document that you can feed to an XSL:T and get whatever XML you need.

    I did consider using OO to open the Word document and to save them as XML however I had trouble with its API (I also had trouble with automating Word but here I had plenty of biter experience to draw on.).

  50. Actually, I have written a perl script by davidbailey · · Score: 2, Informative

    I recently wrote Perl script to download multiple congregation church membership directories from our churches website and manipulate them into comma-delimited, tab-delimited, and nicely formatted OpenOffice Calc (spreadsheet) and Writer (word-processor) formats directly from the Perl script. Because the Microsoft formats are closed, I could not output into those formats directly from the script, nor do I feel like reverse engineering the formats to figure out how.

    I then used OpenOffice to save the files as Word and Excel formats for those who don't have access to OpenOffice, but I included a reminder that OpenOffice is free and included a link to the website.

    This would have been impossible without OpenOffice, and I thank them for their work. The final output has headers, footers, special formatting and prints out like a professional document, not roughly formatted text output in courier.

  51. Here's why your wrong by Overly+Critical+Guy · · Score: 2, Informative

    "MS won't stand for an XML file format -- it's human-readable. the last thing MS wants is for their file format to be easily convertible and transformable. it's a pity, because switching Office files to XML would quickly make them insanely useful."

    You people are so biased. Now Office has suddenly "dropped the ball." Of course, that meme will permeate through all Slashbots' thinking, whether or not they've even tried Office 2003.

    Here is a sample XML file. The original message said "This is a <b>test</b> of <b><i><font face="verdana" size="24">XML</font></i></b>."

    NOTE:&nbsp ; Slashcode adds random semicolons and other garbage for some reason.

    <?mso-application progid="Word.Document"?>
    <w:wordDocument w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no" xml:space="preserve">
    <o:DocumentProperties>
    <o:Title>This is a test of XML</o:Title>
    <o:Author>Preston Sumner</o:Author>
    <o:LastAuthor>Preston Sumner</o:LastAuthor>
    <o:Revision>1</o:Revision>
    <o:TotalTime>1</o:TotalTime>
    <o:Created>2003-09-18T15:29:00Z</o:Created>
    &nbsp ; <o:LastSaved>2003-09-18T15:30:00Z</o:LastSaved>
    <o:Pages>1</o:Pages>
    <o:Words>3</o:Words>
    <o:Characters>20</o:Characters>
    &nbsp ; <o:Company>White Goat Studios</o:Company>
    <o:Lines>1</o:Lines>
    <o:Paragraphs>1</o:Paragraphs>
    <o:CharactersWithSpaces>22</o:CharactersWithSpaces >
    <o:Version>11.5604</o:Version>
    </o:DocumentProperties>
    <w:fonts>
    <w:defaultFonts w:ascii="Times New Roman" w:fareast="Times New Roman" w:h-ansi="Times New Roman" w:cs="Times New Roman"/>
    <w:font w:name="Verdana">
    <w:panose-1 w:val="020B0604030504040204"/>
    <w:charset w:val="00"/>
    <w:family w:val="Swiss"/>
    <w:pitch w:val="variable"/>
    <w:sig w:usb-0="20000287" w:usb-1="00000000" w:usb-2="00000000" w:usb-3="00000000" w:csb-0="0000019F" w:csb-1="00000000"/>
    </w:font>
    </w:fonts>
    <w:styles>
    <w:versionOfBuiltInStylenames w:val="4"/>
    <w:latentStyles w:defLockedState="off" w:latentStyleCount="156"/>
    <w:style w:type="paragraph" w:default="on" w:styleId="Normal">
    <w:name w:val="Normal"/>
    <w:rPr>
    <wx:font wx:val="Times New Roman"/>
    <w:sz w:val="24"/>
    <w:sz-cs w:val="24"/>
    <w:lang w:val="EN-US" w:fareast="EN-US" w:bidi="AR-SA"/>
    </w:rPr>
    </w:style>
    <w:style w:type="character" w:default="on" w:styleId="DefaultParagraphFont">
    <w:name w:val="Default Paragraph Font"/>
    <w:semiHidden/>
    </w:style>
    </w:styles>
    <w:docPr>
    <w:view w:val="normal"/>
    <w:zoom w:percent="100"/>
    <w:doNotEmbedSystemFonts/>
    <w:proofState w:spelling="clean" w:grammar="clean"/>
    <w:attachedTemplate w:val=""/>
    <w:defaultTabStop w:val="720"/>
    <w:characterSpacingControl w:val="DontCompress"/>
    <w:optimizeForBrowser/>
    <w:validateAgainstSchema/>
    <w:saveInvalidXML w:val="on"/>
    <w:ignoreMixedContent w:val="off"/>
    <w:alwaysShowPlaceholderText w:val="off"/>
    <w:compat>
    <w:breakWrappedTables/>
    <w:snapToGridInCell/>
    <w:wrapTextWithPunct/>
    <w:useAsianBreakRules/>
    <w:useWord2002TableStyleRules/>
    </w:compat>
    </w:docPr>
    <w:body>
    <wx:sect>
    <w:p>
    <w:r>
    <w:t>This is a </w:t>
    </w:r>
    <w:r>
    <w:rPr>
    <w:b/>
    </w:rPr>

    --
    "Sufferin' succotash."