Slashdot Mirror


Ask Slashdot: What Is the Best Open Document Format?

kramer2718 writes: I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best? Since "best" can be highly driven by circumstances, please explain your reasoning, too. Have a question for Slashdot's readers? Take a look at other recent questions first to see if someone else has had a similar question. And if not, ask away! The more details and context you include, the more likely your question will be selected.

15 of 200 comments (clear)

  1. PDF/A by thechemic · · Score: 5, Informative
    --
    Let's make like a bird... and get the flock outta here.
  2. Don't convert needlessly by PSVMOrnot · · Score: 5, Insightful

    I would suggest, unless you have a pressing need to convert them, that you should store the documents in the formats they are uploaded in.

    Whenever you convert a document you run the risk of completely messing up the layout, style, etc.

    1. Re:Don't convert needlessly by Anonymous Coward · · Score: 5, Interesting

      Or store both the original, and a standardized format. The place I work stores everything from engineering drawings, meeting minutes, purchase records, to manuals of old equipment in a central document library. It retains the original file, and makes a pdf of every file, and a link to both is listed in each entry. We've already had some older CAD formats no longer supported by current software we have easy access to, but the old pdfs are still readable and it is cheap enough to find some intern to re-create the document from the pdf if need be.

    2. Re:Don't convert needlessly by AthanasiusKircher · · Score: 4, Interesting

      Or store both the original, and a standardized format. The place I work stores everything from engineering drawings, meeting minutes, purchase records, to manuals of old equipment in a central document library. It retains the original file, and makes a pdf of every file, and a link to both is listed in each entry.

      THIS.

      PDFs (or some similar standard) will ensure that the original documents can be read by everyone and viewed with the original formatting intended by the person creating them. Any differences in the version of Word or whatever is going to tweak the formatting in unpredictable ways.

      But the originals should always be retained, since it may make future editing easier. And people also won't be stuck trying to undo whatever unpredictable reformatting or editing (e.g., loss of certain features moving between formats) might go on in your conversion process.

  3. .txt by Anonymous Coward · · Score: 5, Interesting

    .txt. If you need pretty formatting, fill it Latex tags.

    1. Re:.txt by jythie · · Score: 5, Insightful

      And here I am without mod points...

      Generally when I have to worry about integration or longevity, it is still hard to compete with ASCII & LaTeX. While they do not have the every day visibility of various office document types or pdfs, renderers, search tools always know exactly what to do with them. They can even interact with version control systems cleanly since the underlying tools do not need to know anything about the formatting to manipulate it.

    2. Re:.txt by Desler · · Score: 4, Informative

      Then you end up with Microsoft inserting garbage characters at the start of each text file to make their job easier, breaking scripts and confusing both users and other editors alike.

      It's not a garbage character. It's a BOM and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.

    3. Re:.txt by Darinbob · · Score: 4, Insightful

      MS Office is also impractical for 95%+ of its users.

    4. Re:.txt by Yaztromo · · Score: 4, Insightful

      It's not a garbage character. It's a BOM and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.

      This is one of those sticky situations. For UTF-8, the Unicode standard discourages the use of a BOM, unless you're converting from a different Unicode format that requires a BOM. The whole purpose of a BOM is to describe the byte order used to generate the file data, however UTF-8 data is broken up into 8-bit code units, and thus endianness doesn't play a role. You simply read the stream one byte at a time.

      Indeed, using a BOM is discouraged (by both the Unicode standard and the IETF) precisely because it breaks backward compatibility with ASCII text processors. Unfortunately, Microsoft seems intent on adding an unnecessary (and, in the case of UTF-8, badly named) BOM to virtually every UTF-8 file created on their platform. This is done to make it easier for them to detect the encoding; however there are reliable, published heuristics which do the same job without the need for the BOM. That's what every other platform in existence does to detect UTF-8 streams. Microsoft's BOM use is purely to make their processing easier, even if it means that it breaks backward compatibility with older tools.

      Thus, you are technically both correct. It's technically not a garbage character at the beginning of the stream, however it is unnecessary, and contrary to the way every other OS on the planet handles the situation.

      (I've run into this more than once in my professional life, dealing with people who are supposed to be technically minded who use Windows Notepad to try to figure out what encoding a file is using. I've had them come back claiming my files weren't UTF-8 because Notepad claimed they were 'ANSI' (never mind that there is no character encoding standard called 'ANSI' in the first place). I've had to explain to more than one person that standard ASCII is valid UTF-8, even going so far as to providing them chapter and verse of the Unicode specs to prove that what Notepad says shouldn't be treated as gospel.)

      Yaz

  4. Forget the Universal Format crap by xxxJonBoyxxx · · Score: 5, Informative

    1) Forget the Universal Format approach - your users will kill you for messing up their formatting, and you'll never get complete feature parity
    2) Store the docs in their original format
    3) Get Apache Solr to search your content
    4) You'll be spending a lot of time on #3, so leave time to tinker

  5. Oldes are the bestes by Anonymous Coward · · Score: 5, Funny

    Word Perfect Document, because it's been consistent for nearly 20 years. it has a simple underlying format, it's more finely granular than HTML and because I just like obsolete things.

  6. Need more information by nine-times · · Score: 4, Insightful

    As an IT person, I hate questions like this. There's not enough information to give a solid answer. For example:

    * What kinds of documents are you talking about? Text? Photos? Spreadsheets?
    * What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format?
    * What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents?
    * How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself?
    * When you say that the application will need to allow ".docx, doc, .pdf, etc", what formats are in "etc"?

    There may be many other relevant questions, my point is that there just isn't enough detail here. In general, if the most important thing is that you have a printable document that you want to be able to print out from any machine, maintaining the formatting as much as possible, then PDF is a pretty good choice (be sure to embed the fonts and include searchable text!). If you already have a bunch of Word documents and you want the formatting unchanged, and would like the capability to edit the document after it's retrieved, then I'd typically just recommend keeping it as a .docx. It keeps things simple, will be widely supported, and prevents the risk of something going wrong while you're converting to another format. If you like the idea of using .docx because of what I just said, but want something more "open", then ODF is probably worth looking into.

    Really, there are only so many choices, and each have advantages depending on your specific needs.

  7. For Two-Millennia Durability... by nightcats · · Score: 4, Insightful

    ...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...

    --
    Development is programmable; Discovery is not programmable. (Fuller)
    1. Re:For Two-Millennia Durability... by gstoddart · · Score: 5, Informative

      Nonsense, bamboo can't touch papyrus for longevity, and you don't need to worry about pandas.

      Damned bamboo shills.

      And don't anybody go suggesting cave paintings, it's a completely dead platform.

      --
      Lost at C:>. Found at C.
  8. Burning Bush by Etherwalk · · Score: 4, Funny

    ...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...

    Chisel it into stone tablets, then find an ignorant local. Set up a natural gas line to a nearby bush and hide behind a rock. Cub your hands to add a slight reverb effect and tell him to preach the chiselled word, then break the tablets and hide them in a box and trick nazis into looking at them.