Slashdot Mirror


Ask Slashdot: What Is the Best Open Document Format?

kramer2718 writes: I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best? Since "best" can be highly driven by circumstances, please explain your reasoning, too. Have a question for Slashdot's readers? Take a look at other recent questions first to see if someone else has had a similar question. And if not, ask away! The more details and context you include, the more likely your question will be selected.

10 of 200 comments (clear)

  1. Don't convert needlessly by PSVMOrnot · · Score: 5, Insightful

    I would suggest, unless you have a pressing need to convert them, that you should store the documents in the formats they are uploaded in.

    Whenever you convert a document you run the risk of completely messing up the layout, style, etc.

  2. Re:.txt by jythie · · Score: 5, Insightful

    And here I am without mod points...

    Generally when I have to worry about integration or longevity, it is still hard to compete with ASCII & LaTeX. While they do not have the every day visibility of various office document types or pdfs, renderers, search tools always know exactly what to do with them. They can even interact with version control systems cleanly since the underlying tools do not need to know anything about the formatting to manipulate it.

  3. Need more information by nine-times · · Score: 4, Insightful

    As an IT person, I hate questions like this. There's not enough information to give a solid answer. For example:

    * What kinds of documents are you talking about? Text? Photos? Spreadsheets?
    * What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format?
    * What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents?
    * How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself?
    * When you say that the application will need to allow ".docx, doc, .pdf, etc", what formats are in "etc"?

    There may be many other relevant questions, my point is that there just isn't enough detail here. In general, if the most important thing is that you have a printable document that you want to be able to print out from any machine, maintaining the formatting as much as possible, then PDF is a pretty good choice (be sure to embed the fonts and include searchable text!). If you already have a bunch of Word documents and you want the formatting unchanged, and would like the capability to edit the document after it's retrieved, then I'd typically just recommend keeping it as a .docx. It keeps things simple, will be widely supported, and prevents the risk of something going wrong while you're converting to another format. If you like the idea of using .docx because of what I just said, but want something more "open", then ODF is probably worth looking into.

    Really, there are only so many choices, and each have advantages depending on your specific needs.

  4. Re:.txt by TechyImmigrant · · Score: 3, Insightful

    How is it impractical?

    --
    I should use this sig to advertise my book ISBN-13 : 978-1501515132.
  5. For Two-Millennia Durability... by nightcats · · Score: 4, Insightful

    ...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...

    --
    Development is programmable; Discovery is not programmable. (Fuller)
  6. And a pony too? by gstoddart · · Score: 3, Insightful

    Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?

    Lets' see ... you want to allow uploading in a large number of formats .. you want to magically turn it into a universal format ... while retaining all of aspects of the original ... and will be easily maniuplated ... and you want it in an open, and documented format? And all for free?

    I want one of those too. And a Red Rider BB gun with a compass in the stock and this thing which tells time. And a new skateboard. And a pony.

    Honestly, you're asking for the holy grail of document management systems ... the universal, lossless document format.

    I'm not sure it exists. And I'm not sure companies like Microsoft or Adobe would allow it to exist.

    --
    Lost at C:>. Found at C.
  7. Re:.txt by Anonymous Coward · · Score: 0, Insightful

    Does latex support MS Office? If not then it's very impractical for 95%+ of users.

  8. Re:.txt by ClickOnThis · · Score: 3, Insightful

    If you are in publishing - like in writing or editing books - you need MS Word.

    Well, use whatever you want to write the book. But if you are printing it, I'd definitely use something other than MS Word. It just doesn't produce publication-quality documents.

    And as far as Latex is concerned, it would be even more work.

    That depends on what you are writing. If your document contains lots of equations and you're using MS Word, then God help you.

    Latex and other formats are great if you are in complete control from start to finish of the publishing process. But working with other people that are scattered all over the World? Nope.

    Again, that depends on who the other people are. Many academics, particularly scientists, use LaTeX.

    --
    If it weren't for deadlines, nothing would be late.
  9. Re:.txt by Darinbob · · Score: 4, Insightful

    MS Office is also impractical for 95%+ of its users.

  10. Re:.txt by Yaztromo · · Score: 4, Insightful

    It's not a garbage character. It's a BOM and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.

    This is one of those sticky situations. For UTF-8, the Unicode standard discourages the use of a BOM, unless you're converting from a different Unicode format that requires a BOM. The whole purpose of a BOM is to describe the byte order used to generate the file data, however UTF-8 data is broken up into 8-bit code units, and thus endianness doesn't play a role. You simply read the stream one byte at a time.

    Indeed, using a BOM is discouraged (by both the Unicode standard and the IETF) precisely because it breaks backward compatibility with ASCII text processors. Unfortunately, Microsoft seems intent on adding an unnecessary (and, in the case of UTF-8, badly named) BOM to virtually every UTF-8 file created on their platform. This is done to make it easier for them to detect the encoding; however there are reliable, published heuristics which do the same job without the need for the BOM. That's what every other platform in existence does to detect UTF-8 streams. Microsoft's BOM use is purely to make their processing easier, even if it means that it breaks backward compatibility with older tools.

    Thus, you are technically both correct. It's technically not a garbage character at the beginning of the stream, however it is unnecessary, and contrary to the way every other OS on the planet handles the situation.

    (I've run into this more than once in my professional life, dealing with people who are supposed to be technically minded who use Windows Notepad to try to figure out what encoding a file is using. I've had them come back claiming my files weren't UTF-8 because Notepad claimed they were 'ANSI' (never mind that there is no character encoding standard called 'ANSI' in the first place). I've had to explain to more than one person that standard ASCII is valid UTF-8, even going so far as to providing them chapter and verse of the Unicode specs to prove that what Notepad says shouldn't be treated as gospel.)

    Yaz