Migration from PDF, MS Word and Frontpage?
l337hx0r asks: "I've just began work at a government department who, with 24 branches and five years at their disposal, has managed to create 85,000 proprietary documents. The current IT manager doesn't see a problem with this, and the government recommendations of open formats came as quite a surprise. I want to move this intranet away from WYSIWYG to a logical document structure (Docbook, Tex, XHTML), though the migration tools have been quite disappointing. Can it be done?"
Yes, it can be done.
The easy part is converting and indexing all the docs. Not that that is easy. What I would do is something along the lines of scripts to convert them to html and put them in a database with a web browsable front end, building indexes of keywords, accompanied by A LOT of manual labour inserting meta-information about each document.
Almost any document editor these days can import and export html.
The hard part is getting people to start using it. They won't insert their new documents, they won't use it to efficiently look up stuff instead of poking around in their harddrives and email archives, they will just keep doing what got them in trouble in the first place.
And the only thing you can do about it is get a new job.
Hope that helps.
As a previous poster said, it can be done. Unfortunately, the biggest challenges aren't of technical nature. Getting conversion filters from all those formats to a new one isn't a problem, it's just a matter of money in the worst case.
What becomes the real stumbling block in such initiatives is the user. As longs a the user begged, tempted and forced to produce all new documents in the new format, all work you do will be in vain.
To get acceptance by all those civil servants, you'll have to
- either accept MS-Word documents with excel objects and power point presentations added in them,
- or transparently convert those kind documents into a format you like, all while trying to figure what relevant data is in there,
- or force the user to use a system that produces properly formatted documents.
The first solution will have you deal with atrocious documents, sometimes with every line terminated by paragraph break and tables formatted with spaces. To add insult to injury, you'll find in them power point objects, which include excel graphs, just because the user likes to add a little arrow pointing to a column. But at least user will not complain about the new guidelines, because they can happily go on producing their horrors.
With second solution, converting the documents to a storage format, you'll end up with something like PDF, which just try to render the document to their foremat just to get rid of all the crap. PDF is well documented and it's quite easy to produce a document reader. This solution is usually acceptable, although you lose the document structure (for the few documents that had one in the first place).
Forcing the user to produce well formed documents is the most challenging solution. You will basically alienate most of your customers and, if you don't have decades of experience in interdepartment infighting and enough power to enforce the new guidelines, you'll lose. On the other hand, if you had that experience and power, you wouldn't think about such a project.
What works best is to provide a reasonable framework for the documents and everything else will get stripped. That way you take most creative potential from the clients and lock them in a most rigid frame. If you have the luck to get that framework sanctioned from very high authority (which means tons of unproductive meetings with uninterested and clueless people who have their own axes to grind), the users will bitch and moan, but they won't be able to escape. If that official approval misses, the framework will be ignored.
You have a very stony road ahead, and depending on your capacity for frustration, you will give up sooner or later. And, when you give up, taking PDF as standard format isn't such a terrible choice. Been there, got burned. Now I'm older.
Confused (failed crusader for structured documents)
PS: I love LATeX and TeX, but trying to introduce it in a Microsoft oriented place is asking for disaster. You will just exchange unusable word document with unusable LATeX documents. At least the Word Documents don't have syntax errors.