Ask Slashdot: What Is the Best Open Document Format?
kramer2718 writes: I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?
Since "best" can be highly driven by circumstances, please explain your reasoning, too.
Have a question for Slashdot's readers? Take a look at other recent questions first to see if someone else has had a similar question. And if not, ask away! The more details and context you include, the more likely your question will be selected.
if you use the API's supplied by their creators?
http://www.pdfa.org/2011/08/pd...
Let's make like a bird... and get the flock outta here.
I would suggest, unless you have a pressing need to convert them, that you should store the documents in the formats they are uploaded in.
Whenever you convert a document you run the risk of completely messing up the layout, style, etc.
.txt. If you need pretty formatting, fill it Latex tags.
1) Forget the Universal Format approach - your users will kill you for messing up their formatting, and you'll never get complete feature parity
2) Store the docs in their original format
3) Get Apache Solr to search your content
4) You'll be spending a lot of time on #3, so leave time to tinker
Word Perfect Document, because it's been consistent for nearly 20 years. it has a simple underlying format, it's more finely granular than HTML and because I just like obsolete things.
As an IT person, I hate questions like this. There's not enough information to give a solid answer. For example:
* What kinds of documents are you talking about? Text? Photos? Spreadsheets? .pdf, etc", what formats are in "etc"?
* What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format?
* What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents?
* How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself?
* When you say that the application will need to allow ".docx, doc,
There may be many other relevant questions, my point is that there just isn't enough detail here. In general, if the most important thing is that you have a printable document that you want to be able to print out from any machine, maintaining the formatting as much as possible, then PDF is a pretty good choice (be sure to embed the fonts and include searchable text!). If you already have a bunch of Word documents and you want the formatting unchanged, and would like the capability to edit the document after it's retrieved, then I'd typically just recommend keeping it as a .docx. It keeps things simple, will be widely supported, and prevents the risk of something going wrong while you're converting to another format. If you like the idea of using .docx because of what I just said, but want something more "open", then ODF is probably worth looking into.
Really, there are only so many choices, and each have advantages depending on your specific needs.
...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...
Development is programmable; Discovery is not programmable. (Fuller)
Lets' see ... you want to allow uploading in a large number of formats .. you want to magically turn it into a universal format ... while retaining all of aspects of the original ... and will be easily maniuplated ... and you want it in an open, and documented format? And all for free?
I want one of those too. And a Red Rider BB gun with a compass in the stock and this thing which tells time. And a new skateboard. And a pony.
Honestly, you're asking for the holy grail of document management systems ... the universal, lossless document format.
I'm not sure it exists. And I'm not sure companies like Microsoft or Adobe would allow it to exist.
Lost at C:>. Found at C.
Docbook allows you to separate out the content from the presentation. You write in XML and define paragraphs, chapters, images, etc. and then leave it to the various stylesheets to drive how it looks like when it comes out the other end - PDF, HTML, Word, whatever, and the stylesheet makes sure that if some features are supported (hyperlinking from the table of contents to the chapter) it'll be included in there. Since the content is in plain 'ol XML you can use any kind of XML processor to go through it..
LaTeX is the CoNDoM that protects you from Microsoft Office ViRuSeS.
...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...
Chisel it into stone tablets, then find an ignorant local. Set up a natural gas line to a nearby bush and hide behind a rock. Cub your hands to add a slight reverb effect and tell him to preach the chiselled word, then break the tablets and hide them in a box and trick nazis into looking at them.
I've written several books. Because ODT's have standard compression, they are usually much smaller. For a 109,683 word book, with styles and formatting:
ODT: 271,090 bytes
Docx: 300,057 bytes
Word 97: 1,379,328 bytes
PDF: 1,050,788 bytes
If bytes cost money to store ODT rules.
Imagine yourself sticking with Word 97 because it's a reliable standard: imagine buying three times as much storage, as well as the backup for the storage,
https://www.youtube.com/c/BrendaEM
No, just no....
Store the documents in their original format.
There are many possible reasons why you shouldn't mess with the originals such as formatting, legal implications, loss of content because one format supports stuff that the other doesn't, etc.
The only way that I could see this working is if you converted everything to an open format but kept copies of the originals and linked to them. But if the plan is to dump the original documents, then it just isn't worth it....