Slashdot Mirror


Ask Slashdot: What Is the Best Open Document Format?

kramer2718 writes: I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best? Since "best" can be highly driven by circumstances, please explain your reasoning, too. Have a question for Slashdot's readers? Take a look at other recent questions first to see if someone else has had a similar question. And if not, ask away! The more details and context you include, the more likely your question will be selected.

200 comments

  1. can't you search the current doc types? by alen · · Score: 3, Informative

    if you use the API's supplied by their creators?

    1. Re:can't you search the current doc types? by Anonymous Coward · · Score: 0

      Does Microsoft provide a way of searching docx documents on a Linux server? I don't know. I'm curious.

    2. Re:can't you search the current doc types? by Vitus+Wagner · · Score: 1

      docx is just a zip-archive with xml files. And as far as I remember, schemas are published somewhere (althouth format description is several thousands of pages)

    3. Re:can't you search the current doc types? by Anonymous Coward · · Score: 0

      Also, there is no reference implementation for docx. Microsoft Office is the de-facto implementation, but it does not always follow the reference, and contains behaviour not defined in the reference.

    4. Re:can't you search the current doc types? by sabbede · · Score: 1
  2. Depends by Anonymous Coward · · Score: 0

    on what your data stored in those documents is.

  3. Maybe...? by Anonymous Coward · · Score: 0

    I'd store them as docx, doc, .pdf, etc.

  4. Why not HTML? by Anonymous Coward · · Score: 0

    ...or even XHTML

    1. Re:Why not HTML? by Anonymous Coward · · Score: 0

      XHTML was deprecated last time I check. Use HTML if you're into that, or even just XML if the contents isn't really HTML.

    2. Re:Why not HTML? by Anonymous Coward · · Score: 1

      How about nroff? It may not be as popular, but it has stood the test of time.

    3. Re:Why not HTML? by brausch · · Score: 1

      Vote this up. It was my first thought too. Basically, plain ASCII text with formatting instructions that are human readable.

      --
      "Almost every wise saying has an opposite one, no less wise, to balance it." - George Santayana
  5. PDF/A by thechemic · · Score: 5, Informative
    --
    Let's make like a bird... and get the flock outta here.
    1. Re:PDF/A by Anonymous Coward · · Score: 0

      +1

    2. Re:PDF/A by ray-auch · · Score: 1

      +2

      Hundreds of people who do this for a living (they're called records managers), and have done for many years, have worked long and hard to come up with a standard format for exactly this. Doesn't do everything, but what does, but it does ensure that it will still do it in 50yrs if not longer.

      Caveat 1: OP doesn't mention editing, if he needs it editable then don't convert, or store original and PDF rendition for preservation

      Caveat 2: There is a trade off between doc size (OP mention compression) and digital preservation - PDF/A mandates embedding of fonts, which ensures readability in 50yrs at the expense of larger documents. If the OP doesn't need things to be readable beyond 10yrs (say) then PDF/A may be overkill. On the other hand, storage is cheap and getting cheaper, it is managing it that is expensive.

    3. Re:PDF/A by Anonymous Coward · · Score: 0

      Despite Adobe I actually like PDF format. But one of the problems I see here are document scanners: Just about every document scanner I've come across spits out PDF documents (to email, thumb sticks, network shares, etc.) but in actuality they're often just a JPEG or TIFF image per page (often with some proprietary encoding) embedded in a PDF document. Obviously the authors of PDF/A aren't smart enough to have figured out this yet as they're still advocating PDFs with TIFFs - and TIFFs themselves are just another container format for multitudes of image encodings that will get lost in time.

  6. Don't convert needlessly by PSVMOrnot · · Score: 5, Insightful

    I would suggest, unless you have a pressing need to convert them, that you should store the documents in the formats they are uploaded in.

    Whenever you convert a document you run the risk of completely messing up the layout, style, etc.

    1. Re:Don't convert needlessly by Anonymous Coward · · Score: 1

      There's storing for later download, and then there's storing for ongoing analysis, indexing, previews, etc. For the latter, it would help a lot to have one standard format. Probably plain text.

      To properly analyze .doc / .docx, for instance, you'll probably need a Windows machine with Word installed. It will likely be significantly cheaper to have Word installed on only one or two machines, convert to text (capturing any necessary metadata on the way), and then do further processing on other machines that don't need Word installed. It's probably more computationally efficient, too.

    2. Re:Don't convert needlessly by Anonymous Coward · · Score: 5, Interesting

      Or store both the original, and a standardized format. The place I work stores everything from engineering drawings, meeting minutes, purchase records, to manuals of old equipment in a central document library. It retains the original file, and makes a pdf of every file, and a link to both is listed in each entry. We've already had some older CAD formats no longer supported by current software we have easy access to, but the old pdfs are still readable and it is cheap enough to find some intern to re-create the document from the pdf if need be.

    3. Re:Don't convert needlessly by darkain · · Score: 2, Interesting

      All of the "X" variants of MS Office documents stand for "XML" - that is, the documents are stored in a series of XML files inside of a ZIP file that is renamed to formatX (docx, xlsx, etc). There is no real need to even have Windows or Office installed to index these documents. Just write up a basic script to extract the ZIP file and parse out the related XML documents. Note: this isn't as trivial as it sounds at first, though. This would assume that Microsoft's XML structures (yes, plural), had an easy to comprehend standard that was logical to work with. It'll take a little digging but totally doable.

      TLDR: not by choice, my company heavily relies on Excel documents, and this is how I ended up managing them, importing their contents into a SQL database for indexing and other purposes .

    4. Re:Don't convert needlessly by AthanasiusKircher · · Score: 4, Interesting

      Or store both the original, and a standardized format. The place I work stores everything from engineering drawings, meeting minutes, purchase records, to manuals of old equipment in a central document library. It retains the original file, and makes a pdf of every file, and a link to both is listed in each entry.

      THIS.

      PDFs (or some similar standard) will ensure that the original documents can be read by everyone and viewed with the original formatting intended by the person creating them. Any differences in the version of Word or whatever is going to tweak the formatting in unpredictable ways.

      But the originals should always be retained, since it may make future editing easier. And people also won't be stuck trying to undo whatever unpredictable reformatting or editing (e.g., loss of certain features moving between formats) might go on in your conversion process.

    5. Re:Don't convert needlessly by mlts · · Score: 1

      Even with programs that can import Word/Excel/etc. documents, they do a good job, about 99% well. However, that one percent that is missed can do quite a number on a document.

      The answer for a document format... depends.

      For a document format that keeps formatting exactly, and isn't intended to be edited, PDF/A is the best thing going, since barring a major world-ending disaster, we will still have utilities that can read PDFs, and PDF/A ensures that the fonts and such are present and readable.

      For a document that is edited... there are a number of different standards. As stated elsewhere, it might be best to have a tarball or ZIP file that has multiple document formats in it, where there is a .txt and .PDF file available for quick viewing, then SGML/HTML/XML/nroff/TeX/LaTeX version included for editing.

    6. Re:Don't convert needlessly by Anonymous Coward · · Score: 0

      > It'll take a little digging but totally doable.

      Is it? Older OOXML files often won't open in newer versions of Office without subtle formatting errors. Many, many sections of the OOXML spec literally say "[Do this thing {kern text, render bullets, whatever}] like MSFT Office 97 does it.".

      There's no doubt, you can extract most of the meaning from an OOXML document, but accurate reproduction is impossible for mere mortals and *very* hard for MSFT employees.

    7. Re:Don't convert needlessly by Anonymous Coward · · Score: 1

      Surprised that no one has mentioned ODF. It literately stands for Open Document Format, it is supported by most modern readers. It is the native format for LibreOffice and OpenOffice, even Microsoft's programs can read it now.

    8. Re:Don't convert needlessly by Anonymous Coward · · Score: 0

      but don't use PDFs if you want something usefully searchable, open, etc.

    9. Re:Don't convert needlessly by ray-auch · · Score: 1

      Really ? You should tell AIIM, LIbrary of Congress, etc. - they've all been doing it wrong for years with PDF/A.

  7. .txt by Anonymous Coward · · Score: 5, Interesting

    .txt. If you need pretty formatting, fill it Latex tags.

    1. Re:.txt by jythie · · Score: 5, Insightful

      And here I am without mod points...

      Generally when I have to worry about integration or longevity, it is still hard to compete with ASCII & LaTeX. While they do not have the every day visibility of various office document types or pdfs, renderers, search tools always know exactly what to do with them. They can even interact with version control systems cleanly since the underlying tools do not need to know anything about the formatting to manipulate it.

    2. Re:.txt by TechyImmigrant · · Score: 3, Insightful

      How is it impractical?

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    3. Re:.txt by Anonymous Coward · · Score: 0

      Agreed.

      Or, for those that prefer a real-world alternative to the "ivory tower" of LaTeX, HTML markup is a good-enough solution that everyone can use.

      And for datasets, CSV (character-not-comma) with proper 0x1F separators and 0x1E terminators (or if it must be on-keyboard text, then \t and \n, respectively).

    4. Re:.txt by Anonymous Coward · · Score: 0

      I completely agree, LaTeX is superior, and works perfectly with version control systems. I don't use office programs anymore unless someone forces me to work with one.

    5. Re:.txt by Anonymous Coward · · Score: 0, Insightful

      Does latex support MS Office? If not then it's very impractical for 95%+ of users.

    6. Re:.txt by Anonymous Coward · · Score: 0

      txt is a horrible format, every text editor has to guess the file encoding and unless it is ASCII they get it wrong. Then you end up with Microsoft inserting garbage characters at the start of each text file to make their job easier, breaking scripts and confusing both users and other editors alike. In fact having encoding hints in "plain text" files is rather common, python files store it in a comment, html specifies a list of valid encodings and the exact placement and contents of the encoding tag. Unless you limit yourself to ASCII a txt file is just not sufficient to store text.

    7. Re:.txt by ShanghaiBill · · Score: 2

      How is it impractical?

      It is impractical because the average end user will have no idea what to do with a .txt file containing Latex markup. It will look like gibberish. Txt files also have no clickable table of contents, or index, or hyperlinks to other documents.

    8. Re:.txt by Anonymous Coward · · Score: 1

      If you are in publishing - like in writing or editing books - you need MS Word. I tried Libre/Open Office but with change tracking, cites, bibliographies, etc ... getting screwed up, I ended up just forking over the $115 bucks of MS Office.

      And as far as Latex is concerned, it would be even more work.

      Latex and other formats are great if you are in complete control from start to finish of the publishing process. But working with other people that are scattered all over the World? Nope.

    9. Re:.txt by Anonymous Coward · · Score: 0

      That is not a format, because there are so many different character encodings.
      The answer is ASCII, and it is quite practical.

    10. Re:.txt by Anonymous Coward · · Score: 0

      Hard to compete, except when you need to reproduce the original documents's formatting and layout. LaTeX also doesn't do images very well without 3rd party tools.

    11. Re:.txt by Anonymous Coward · · Score: 0

      Also newlines across platforms are not standardized in plain text files.

    12. Re:.txt by Anonymous Coward · · Score: 0

      How is it impractical?

      Good luck getting nice and clean text files from all of the formats that the submitter listed.

    13. Re:.txt by Desler · · Score: 4, Informative

      Then you end up with Microsoft inserting garbage characters at the start of each text file to make their job easier, breaking scripts and confusing both users and other editors alike.

      It's not a garbage character. It's a BOM and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.

    14. Re:.txt by Anonymous Coward · · Score: 1

      It is impractical because the average end user will have no idea what to do with a .txt file containing Latex markup. It will look like gibberish. Txt files also have no clickable table of contents, or index, or hyperlinks to other documents.

      Store it as LaTeX, render to PDF on demand. Solved.

    15. Re:.txt by captnjohnny1618 · · Score: 1

      I love this answer... but sadly people aren't willing to learn things like latex. Even in academia (medical physics) many smart people refuse to learn technologies like latex.

      And, despite my agreement (I was actually going to post a similar answer if someone hadn't), there are times when I don't want to bother with the latex overhead for quick documents. Am I doing it wrong? ;-)

    16. Re:.txt by Anonymous Coward · · Score: 1

      I love this answer... but sadly people aren't willing to learn things like latex. Even in academia (medical physics) many smart people refuse to learn technologies like latex.

      That's because Latex is horrendous to use.

    17. Re:.txt by Desler · · Score: 1

      read *and handle* a BOM, that is.

    18. Re:.txt by TechyImmigrant · · Score: 2

      Does latex support MS Office? If not then it's very impractical for 95%+ of users.

      If I had this question, I would google before asking Slashdot and exposing my ignorance. There are lots of tools.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    19. Re:.txt by Anonymous Coward · · Score: 0

      The only way to reproduce the original document format and layout is to take a photo of the document from the original writers' computer.

      Unless you mean from only sane and rational formats that don't have undead rendering problems and the like, so, but even then, with for example, PS or PDF, two pretty portable document format systems, the font, paper, page settings and so forth won't be part of the document (fonts are copyrighted in some places, therefore unsuitable for use in any place, since corporations and copyright will hunt your ass down no matter if it's legal where you are or not, because they can download it where they are.

      So, really, forget all that shit about fidelity with original formatting and layout.

      Print the paper out on a printer with A4 when you used US letter? Either you get the wrong format or you don't get a printout at all.

      "Original format and layout" is a shibboleth of print media people and those who have no clue what print media is, but know a few of the words and can hum along to the tune a bit.

      LaTeX will hold the metadata with fine precision of how the format SHOULD appear, but the final render will be how the one displaying it wants it to be.

      Large print, colour change (for colour blind), braille, text-to-speech. ALL are required for the actual user of the document, but the maker of the document, being non of those problem cases,demands that THEIR special view be the only one allowed is fucking them over for their own egotistical purposes.

    20. Re:.txt by ClickOnThis · · Score: 3, Insightful

      If you are in publishing - like in writing or editing books - you need MS Word.

      Well, use whatever you want to write the book. But if you are printing it, I'd definitely use something other than MS Word. It just doesn't produce publication-quality documents.

      And as far as Latex is concerned, it would be even more work.

      That depends on what you are writing. If your document contains lots of equations and you're using MS Word, then God help you.

      Latex and other formats are great if you are in complete control from start to finish of the publishing process. But working with other people that are scattered all over the World? Nope.

      Again, that depends on who the other people are. Many academics, particularly scientists, use LaTeX.

      --
      If it weren't for deadlines, nothing would be late.
    21. Re:.txt by Anonymous Coward · · Score: 0

      I have three books published (400 pages a piece with figures etc), all three written in LaTeX. For such large projects I wouldn't even consider MS Word.

    22. Re:.txt by Anonymous Coward · · Score: 0

      HTML vs LaTeX is not a matter of 'working class' vs. 'ivory tower', it is two completely different formats with two completely different intents.

      Although they are both abused pretty regularly by today's worldwide web.

    23. Re:.txt by Darinbob · · Score: 1

      I agree. Text works. Now we get to argue over ASCII, Latin-1, ISO, Unicode, tabs vs spaces, emojis versus emoticons, CR/LF vs CR vs LF, byte stream or record format, CamelCase or pythonCase or unixcase, but thankfully we don't have to decide on vim vs emacs as those are external tools.

    24. Re:.txt by Darinbob · · Score: 4, Insightful

      MS Office is also impractical for 95%+ of its users.

    25. Re:.txt by Yaztromo · · Score: 4, Insightful

      It's not a garbage character. It's a BOM and it's part of the Unicode standard. If your scripts and text editors can't read the BOM in 2015 then they are the things that are horribly broken.

      This is one of those sticky situations. For UTF-8, the Unicode standard discourages the use of a BOM, unless you're converting from a different Unicode format that requires a BOM. The whole purpose of a BOM is to describe the byte order used to generate the file data, however UTF-8 data is broken up into 8-bit code units, and thus endianness doesn't play a role. You simply read the stream one byte at a time.

      Indeed, using a BOM is discouraged (by both the Unicode standard and the IETF) precisely because it breaks backward compatibility with ASCII text processors. Unfortunately, Microsoft seems intent on adding an unnecessary (and, in the case of UTF-8, badly named) BOM to virtually every UTF-8 file created on their platform. This is done to make it easier for them to detect the encoding; however there are reliable, published heuristics which do the same job without the need for the BOM. That's what every other platform in existence does to detect UTF-8 streams. Microsoft's BOM use is purely to make their processing easier, even if it means that it breaks backward compatibility with older tools.

      Thus, you are technically both correct. It's technically not a garbage character at the beginning of the stream, however it is unnecessary, and contrary to the way every other OS on the planet handles the situation.

      (I've run into this more than once in my professional life, dealing with people who are supposed to be technically minded who use Windows Notepad to try to figure out what encoding a file is using. I've had them come back claiming my files weren't UTF-8 because Notepad claimed they were 'ANSI' (never mind that there is no character encoding standard called 'ANSI' in the first place). I've had to explain to more than one person that standard ASCII is valid UTF-8, even going so far as to providing them chapter and verse of the Unicode specs to prove that what Notepad says shouldn't be treated as gospel.)

      Yaz

    26. Re:.txt by Darinbob · · Score: 1

      And yet we used to have tons of secretaries and administrative assistants in universities who did just fine with LaTeX, or even TeX. There was a time when everyone had to learn new things and it was considered a normal part of the day to day job. It was not considered a human rights issue back then to not use MS Office.

    27. Re:.txt by Darinbob · · Score: 1

      Can MS Word handle books now? Back in the 90s we had a major revolt of the doc writers at our company when the dictate came down that Word must be used instead of Framemaker, because Word was incapable of working with documents that large (ie, multiple full binders, with a table of contents and index that covers all of them).

      In the same way that you can't treat "1" as a lowest common denominator, you shoud not treat MS Office as the lowest common denominator either.

    28. Re:.txt by TechyImmigrant · · Score: 1

      If you want it nice and clean, pay a curator.

      People will upload crap. Your converters will introduce crap.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    29. Re:.txt by Lunix+Nutcase · · Score: 1

      Thus, you are technically both correct. It's technically not a garbage character at the beginning of the stream, however it is unnecessary, and contrary to the way every other OS on the planet handles the situation.

      And yet I use a multitude of text editors and have scripts that can handle UTF-8 text files with a BOM just fine. Your programs and scripts are broken if they can't.

    30. Re:.txt by vtcodger · · Score: 1

      Pretty much my thought. Use the simplest format that will do the job. It it's just prose, use txt. Does anyone seriously believe that One Day in the Life of Ivan Denisovitch is somehow enhanced by saving it as .doc or .pdf or .htm or god knows what else? If the text needs some bold and italics, use .txt with markdown. If it needs lots of markup, then something more elaborate -- preferably something with standards and a DTD or equivalent indicating what standard applies. If there are flat tables, use csv. Spreadsheets? Best use their native format (.ods, .xls, etc) I should think. Images and music? Not my area of expertise. I use jpeg and mp3 respectively for myself, but I wouldn't be at all surprised that there are better choices

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    31. Re:.txt by Anonymous Coward · · Score: 1

      It's worse than that. While I admire the output, the input is mind numbing torture. Not to mention trying to compile your own from sources. Go ahead, try it. This is something Systemd should take on.
      If Systemd could automatically load latex anytime I hit the print thingy in microsoft word it would be really good. If you put Systemd and latex on a supercomputer, it just might be able to create the entire works of shakespeare AND a JJ Abrams version of STAR TREK that doesn't suck.

    32. Re:.txt by TechyImmigrant · · Score: 1

      I only se a goat.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    33. Re:.txt by TechyImmigrant · · Score: 1

      The Springer publications via the IACR prefer submissions to be in Latex.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    34. Re:.txt by TechyImmigrant · · Score: 1

      MS Word is still a complete mess when it comes to numbering sections and lists. This makes it unusable for writing technical books.
      Framemaker had a simple and powerful format for describing numbering sequences. It worked well. I haven't used it for a few years.
      Latext obviously gets it right. Why wouldn't it?

       

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
    35. Re:.txt by Yaztromo · · Score: 2

      And yet I use a multitude of text editors and have scripts that can handle UTF-8 text files with a BOM just fine. Your programs and scripts are broken if they can't.

      Or they're legacy tools. There are a large number of such tools out there that do various jobs, where having an unnecessary BOM is a liability.

      If you're compiling for some legacy embedded hardware, for example, I have little doubt that its compiler would choke on BOM characters, and you may not have access to the source to fix it. And just because YOU don't need or use such tools hardly means that nobody out there does.

      Yaz

    36. Re: .txt by billDCat · · Score: 2

      Perhaps fine for Roman characters, not so fine if the document contains Kanji, Hiragana, Katakana, Hebrew, or any of the other character sets that don't play nice with "plain text" formats. For something that you would think would be pretty straight forward, plain text character handling is surprisingly maddening to work with.

    37. Re: .txt by vtcodger · · Score: 1

      Yes text handling for non-ascii characters can be surprisingly maddening to work with. (Wasn't UTF-8 supposed to fix that?). Problem is that wrapping txt in some more elaborate format like HTML often doesn't make the problem go away. With apologies to Jamie Zawinski It just means that now you have two problems.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
    38. Re:.txt by Anonymous Coward · · Score: 0

      LaTeX is for more than just math. Isn't it used for creating textbooks and whatnot?

      I would love to see math students introduced to LaTeX over a 3 day plus 4 weekly classes in high school.
      Example...
      Wed, Thur, Fri one week to really introduce it. Then four follow-up Fridays with brief assignments.

  8. The only open one - ODF by Anonymous Coward · · Score: 1

    Either you truly wish to use something "open", at which point the only choice is ODF, or you simply want something that can be widely used. If the latter is the case, ODF is still good, perhaps PDF.

    1. Re: The only open one - ODF by LostMyBeaver · · Score: 1

      Uh... Did you miss the question altogether?

    2. Re: The only open one - ODF by Anonymous Coward · · Score: 0

      Yes, there is only one answer for an open format, the Open Document Format...unless you include ASCII, too.

      For docs with heavy formatting, it's a big problem if it's proprietary and the "standard" changes almost version of the software.

      With little formatting, ASCII is best for your solution as EVERYTHING (I think) can read it.

  9. PDF retains the layout by jones_supa · · Score: 1

    I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?

    PDF allows accurate rendering so it's the best choice. It will be a hot mess if you use anything else. Conversion of such complex formats is very error-prone for layout problems.

    1. Re:PDF retains the layout by ChunderDownunder · · Score: 1
      PDF is a print format, which is fine if your audience is going to print it out on a piece of A4 paper - though I think yanks have their own standard. :)

      But they don't generally reflow. e.g. Viewing a document formatted for portrait on landscape monitor, journal articles with multiple columns, reading on a 4" smartphone are challenges for reading onscreen.

    2. Re:PDF retains the layout by ray-auch · · Score: 1

      PDF/A is the ISO standard format for document archiving, as well as printing.

  10. Forget the Universal Format crap by xxxJonBoyxxx · · Score: 5, Informative

    1) Forget the Universal Format approach - your users will kill you for messing up their formatting, and you'll never get complete feature parity
    2) Store the docs in their original format
    3) Get Apache Solr to search your content
    4) You'll be spending a lot of time on #3, so leave time to tinker

    1. Re:Forget the Universal Format crap by omnichad · · Score: 1

      I'll second everything but step 3. I would get some standard libraries set up to extract the plain text from each format and make that searchable. Probably much simpler.

    2. Re:Forget the Universal Format crap by Nadir · · Score: 1

      Why, Apache Solr understands various type of office documents via Apache POI. No need to get "standard" libraries, whatever you mean by that.

      --
      --
      The world is divided in two categories:
      those with a loaded gun and those who dig. You dig.
    3. Re:Forget the Universal Format crap by Anonymous Coward · · Score: 2, Informative

      I work at a typography, and I get a lot of documents from a lot of different people. Those "documents" come as MSWord files with missing fonts, pdfs made with some shoddy software, strange ODTs, many more different types of doc files, the mysterious lnk files that work perfectly fine for them, but not for anyone else and my personal favorites, jpg files (not png or some other lossless format, because that would imply actual thinking).
      Strange enough, I've yet to receive any plain text files.

      To index everything, I use calibre, just something simple like changing the file name to "Project name - tag 1, tag 2, tag 3 - Client name.pdf" and it imports it automatically from a folder I drop it in, adding that info to the metadata.
      It doesn't actually index or search inside the files, but they are easier to find and handle.

    4. Re:Forget the Universal Format crap by omnichad · · Score: 1

      Because it's most likely overkill if all you want is indexing.

    5. Re:Forget the Universal Format crap by Anonymous Coward · · Score: 0

      It's called Apache Tika and it already integrates nicely with Solr.

    6. Re:Forget the Universal Format crap by Anonymous Coward · · Score: 0

      How is using something specifically designed for exactly this problem overkill?

  11. Practical, but unhelpful, answer by Anonymous Coward · · Score: 0

    The one that everyone involved in the project can read.

  12. Oldes are the bestes by Anonymous Coward · · Score: 5, Funny

    Word Perfect Document, because it's been consistent for nearly 20 years. it has a simple underlying format, it's more finely granular than HTML and because I just like obsolete things.

    1. Re:Oldes are the bestes by jfdavis668 · · Score: 1

      Or try WordStar. That will never go out of support.

    2. Re:Oldes are the bestes by Whiteox · · Score: 1
      --
      Don't be apathetic. Procrastinate!
    3. Re:Oldes are the bestes by Megane · · Score: 3, Funny

      I think WordStar has problems with Unicode support. But then again, so does Slashdot.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    4. Re:Oldes are the bestes by Anonymous Coward · · Score: 0

      The US Gov't used WordPerfect format as a standard for at least several years. Then MSO dropped the import from .wpd, and people complained.

    5. Re:Oldes are the bestes by Anonymous Coward · · Score: 0

      Dead languages are the best, they never change again.

    6. Re:Oldes are the bestes by Anonymous Coward · · Score: 0

      We've found George R.R. Martin's Slashdot account.

    7. Re:Oldes are the bestes by jfdavis668 · · Score: 1

      I'll have to invite you to the next wedding.

    8. Re:Oldes are the bestes by tmjva · · Score: 1

      I too vote for Wordstar.

      --
      Tracy Johnson
      Old fashioned text games hosted below:
      http://empire.openmpe.com/
      BT
  13. Why? by rodrigoandrade · · Score: 0, Troll

    Is there a technical reason why you need open source formats, or you just want to look cool and hip to others (friends, customers, etc.)?

    MS Office formats are widely used and accepted all over the word and over the best integration with most cloud providers (quick viewing, for instance).

    My .02

    1. Re:Why? by Anonymous Coward · · Score: 1

      And those .02 wasn't worth much, considering that the most effective way to FUBAR a doc-file is to pass it around to a few users who use different versions or even patches of Word. Nobody is 100% compatible with MS-Word. Not even Microsoft, so so that "format" goes right into the shitter, along with your lame attempt at being cool and suitably hipster "anti-opensource".

      Captcha: "Vanity". Indeed.

  14. Coding approach by Sigma+7 · · Score: 1

    I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?

    Are you writing the search/compression/render capability from scratch, or are you using a library to handle that job for you?

    If you're handling more than one document type, then go for a library. I don't have a recommendation myself, but I'm sure you can find them on a search.

    Also, don't worry about compression, as modern .odf/.docx is already compressed with something compatible with PKZIP.

  15. HTML, people by Anonymous Coward · · Score: 0

    HTML is standard enough, has the maximum reusability and readability, while being exceptionally lighter than other formats.

    1. Re:HTML, people by vtcodger · · Score: 1

      So long as you remember that the M in HTML is "Markup", not "Layout". If it is important that page layout be "perfectly" preserved in the presentation, something else like pdf (Yechhh) might be a better choice.

      --
      You can't see ANYTHING from a car, You've got to get out of the goddamned contraption and walk...Edward Abbey
  16. Need more information by nine-times · · Score: 4, Insightful

    As an IT person, I hate questions like this. There's not enough information to give a solid answer. For example:

    * What kinds of documents are you talking about? Text? Photos? Spreadsheets?
    * What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format?
    * What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents?
    * How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself?
    * When you say that the application will need to allow ".docx, doc, .pdf, etc", what formats are in "etc"?

    There may be many other relevant questions, my point is that there just isn't enough detail here. In general, if the most important thing is that you have a printable document that you want to be able to print out from any machine, maintaining the formatting as much as possible, then PDF is a pretty good choice (be sure to embed the fonts and include searchable text!). If you already have a bunch of Word documents and you want the formatting unchanged, and would like the capability to edit the document after it's retrieved, then I'd typically just recommend keeping it as a .docx. It keeps things simple, will be widely supported, and prevents the risk of something going wrong while you're converting to another format. If you like the idea of using .docx because of what I just said, but want something more "open", then ODF is probably worth looking into.

    Really, there are only so many choices, and each have advantages depending on your specific needs.

    1. Re:Need more information by Archangel+Michael · · Score: 1

      * What kinds of documents are you talking about? Text? Photos? Spreadsheets? Photos aren't documents. Spreadsheets tend to be proprietary.

      * What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format? This! I tend to classify documents as "Primary Data" (Structured) and "freeform Data" (Human Readable)

      * What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents? Data needs to be organized by purpose (Record keeping = Primary / structured data) and Executive Summary Type data (human readable).

      * How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself? Again, this is a primary data vs freeform data. Primary data should be structured, while preserving free form data is best done with PDF style rendering which preserves it historically.

      * When you say that the application will need to allow ".docx, doc, .pdf, etc", what formats are in "etc"? Long Term Editable Documents need to be in the original format. At some point that format will go away, and if you want to maintain edit-ability you'll need to upgrade to newer formats. All others can be saved as PDF before the old format is extinct.

      --
      Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
    2. Re:Need more information by nine-times · · Score: 1

      Photos aren't documents. Spreadsheets tend to be proprietary.

      Nonsense.

      Data needs to be organized by purpose (Record keeping = Primary / structured data) and Executive Summary Type data (human readable).

      It depends on what the data is, and what and how it's being used. There is no "correct" organization, and no "one true way" to deal with data. I would not recommend going around cramming documents into some set organization without understanding where the data is coming from and what people hope to do with it.

      Your organization may work for your purposes, within the constraints of the company or organization you work in. I've supported a lot of different types of companies over the years, and personally, I've never found a one-size-fits-all solution. In each case, it really pays off to start off with no assumptions, and figure out what will work for that specific situation.

    3. Re:Need more information by Whiteox · · Score: 1

      You are so right. In the end, you will lose the ability of retaining the original document after processing, so converting into the most powerful format is probably the best. But as we don't know anything more, it's pretty much impossible to give further advice.
      Some pdf uploads may cause concern as they may not be searchable due to character encoding from non-US keyboards. I've struck that many times in the past and there is no easy way out of that.
      I was thinking that indexing each document before storage might be the best way to go. Creating, exporting, storing and using the index is another matter entirely and would take up more space.

      --
      Don't be apathetic. Procrastinate!
    4. Re:Need more information by MattGWU · · Score: 1

      "As an IT person.."

      Well in that case...

      Please advise the best way to format all of our documents in an standard open format. Need this ASAP. Please advise. Thanks in advance.

      Please Advise,
      A. User

      --
      "These people look deep within my soul and assign me a number based on the order in which I joined" --Homer re:
    5. Re:Need more information by Anonymous Coward · · Score: 0

      Oh, for fucks sake! Can't a professional IT guy give a simple nswer to a simple question? What the fuck do you guys do all day? You sound more like a call center drone - just keep asking stupid questions till the user hangs up!
        The correct answer is to use Microsoft office. See how easy that is?

    6. Re:Need more information by Anonymous Coward · · Score: 0

      What legal needs are there? There are times when the formats we use don't really matter. The CONTENT DOES!

      I worked for a time in a large city building department. The time frames for document storage and retrevial are not a measly five or ten years. It is important for vital records to be kept in a readily retrievable condition for literally centuries. There is no 'propriatory' format that can claim any time frame for readability longer than 20 years, and that is only on documents with a currently available 'translator' (that is Word Perfect for your information, version 4.5) No Microsoft product has reliable retrievability for more than eight years currently, and that requires second party support. The same is true for AutoCAD and it's competitors. The same is also true for most video and sound codecs. Can you view reliably the sprite files used on the old Commodore systems? What about a Wang?

      The only format that can be reliably read that is older than 20 years is plain ASCII Text. There are seventy year old files that are still readable for ASCII Text. True, those files are on wire spools, and are actually teletype files, but, they are the only ones that can maintain the long term legibility.

      The 'new' formats are mostly not backwards compatible, and are almost never fully documented, or where documented, are not implemented as described in the documentation. just try building a DocX file using the Microsoft written standard and a text editor.

      Even PDF files are somewhat printer dependent. What do you think the odds are of having a working HP Laserjet printer in 200 years? Do any of you still have a working 8 1/2" Floppy drive? For that matter, how many people still have the equipment to play 78's?

      Even keeping important documents for a mere seven years in a digital format is proving hard for most corporations.

  17. What Suits Your Needs? by mckellar75238 · · Score: 1

    "Best" in this case depends on your needs and resources, not on standards or common practices. Flexibility and "getting the job done" are more important than what everyone else prefers. Do what works for you.

  18. Re: Starlight Glimmer 2016 by RavenLrD20k · · Score: 0

    Ever since the OMG Ponies! incident... Slashdot just hasn't been the same...

  19. Depends... by EmeraldBot · · Score: 1

    I'd highly recommend leaving them in their original format, or if anything, converting them all to .pdf. Conversion is always fraught with danger, and you will be spending an awful lot of time getting to the know the intricacies of Microsoft Word if you go this route. Pdfs display equally nicely on every operating system, they archive very well, and almost every tool out there can read them - but while converting documents to this format usually works better than others, I'd still be very careful to watch for mistakes in conversions.

    --
    "Set a man a fire, he'll be warm for the rest of the night. Set a man afire, he'll be warm for the rest of his life."
    1. Re:Depends... by Anonymous Coward · · Score: 0

      I highly recommend you face the music here http://science.slashdot.org/co...

  20. Everything as everything by Anonymous Coward · · Score: 0

    All of them. No, really. Transform everything into every format you can and save them all. That way, even if /. is collectively incapable of predicting format longevity, you're most likely to have a copy of everything in a format that's still understandable.

    Or include the source code (as text files) for any interpreters you assume will never die. That should be vaguely future proof, right?

  21. Use a document manager by LostMyBeaver · · Score: 1

    There are many premade document management systems. They generally will store their indices in a database format for quick searching. Why not store them in their native formats and leave it up to a document management system to handle it for you?

    1. Re:Use a document manager by Anonymous Coward · · Score: 0

      There are many premade document management systems. They generally will store their indices in a database format for quick searching. Why not store them in their native formats and leave it up to a document management system to handle it for you?

      Can you recommend such a document manager?

    2. Re:Use a document manager by Whiteox · · Score: 1
      --
      Don't be apathetic. Procrastinate!
    3. Re:Use a document manager by ls671 · · Score: 1

      worth a look too:

      https://tika.apache.org/

      --
      Everything I write is lies, read between the lines.
  22. For Two-Millennia Durability... by nightcats · · Score: 4, Insightful

    ...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...

    --
    Development is programmable; Discovery is not programmable. (Fuller)
    1. Re:For Two-Millennia Durability... by jfdavis668 · · Score: 2

      We carve 16 bit Unicode into stone slabs for long term backup storage.

    2. Re:For Two-Millennia Durability... by gstoddart · · Score: 5, Informative

      Nonsense, bamboo can't touch papyrus for longevity, and you don't need to worry about pandas.

      Damned bamboo shills.

      And don't anybody go suggesting cave paintings, it's a completely dead platform.

      --
      Lost at C:>. Found at C.
    3. Re:For Two-Millennia Durability... by Anonymous Coward · · Score: 0

      Funny, bamboo strips and The Art of War never changes. I use it all the time and so does my Panda.

    4. Re:For Two-Millennia Durability... by cyberchondriac · · Score: 1

      And don't anybody go suggesting cave paintings, it's a completely dead platform.

      But.. but.. they've lasted the longest! Granted, they're not very mobile.

      --

      Look back up at my post, now look back down, you're on the Internet. Now look back up. I'm a signature.
    5. Re:For Two-Millennia Durability... by Megane · · Score: 1

      Cave paintings certainly have some migration issues.

      --
      #naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }
    6. Re:For Two-Millennia Durability... by Em+Adespoton · · Score: 1

      Cave paintings certainly have some migration issues.

      They also have the propensity to be overwritten from time to time, and many of them suffer from infrastructure collapse.

      And many never see the light of day.

      Petroglyphs are great, except that you'll still want to store them somewhere safe, as the weather has this tendency to destroy them over time.

    7. Re:For Two-Millennia Durability... by Anonymous Coward · · Score: 0

      Not sure how they scan electronically

      To upload your bamboo strips to the cloud, just use a lighter.

    8. Re:For Two-Millennia Durability... by Anonymous Coward · · Score: 0

      No, the longest recording media for legibility is the mud tablets with cuneiform writing left by the Sumerians. Those are still quite legible after 5,000 years. The technology remained in use for over 2800 years too!

    9. Re:For Two-Millennia Durability... by ChunderDownunder · · Score: 1

      Petroglyphs are more susceptible to treehuggers from Greenpeace than the weather.

    10. Re:For Two-Millennia Durability... by Anonymous Coward · · Score: 0

      And don't anybody go suggesting cave paintings, it's a completely dead platform.

      If anything, I think we'ed prefer dead caves to the living ones.

  23. And a pony too? by gstoddart · · Score: 3, Insightful

    Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?

    Lets' see ... you want to allow uploading in a large number of formats .. you want to magically turn it into a universal format ... while retaining all of aspects of the original ... and will be easily maniuplated ... and you want it in an open, and documented format? And all for free?

    I want one of those too. And a Red Rider BB gun with a compass in the stock and this thing which tells time. And a new skateboard. And a pony.

    Honestly, you're asking for the holy grail of document management systems ... the universal, lossless document format.

    I'm not sure it exists. And I'm not sure companies like Microsoft or Adobe would allow it to exist.

    --
    Lost at C:>. Found at C.
    1. Re:And a pony too? by Anonymous Coward · · Score: 0

      lol, you nailed it straight on.

      We will just keep ADDing more complexity not simplifying.
      So asking for something better at this point is just dreamy.

    2. Re:And a pony too? by angelbar · · Score: 1

      You lost me with the Pony...

      --
      -no sig today-
    3. Re:And a pony too? by gstoddart · · Score: 3, Informative

      English idiom connoting yet another impossible thing in a child's unrealistic wishlist ... typically placed at the end of a series of outrageous demands: " ... and a pony".

      Now, please, don't make me pedantic you again to explain the cromulency of phrases. ;-)

      --
      Lost at C:>. Found at C.
  24. "Best" depends on intent by Phoenix+Rising · · Score: 2

    On the conversion side... If you're taking in PDFs created by a layout/page design program, then you're not likely to get good satisfaction converting them and storing them as something other than PDFs. OTOH, if you're taking in a lot of documents created in an office suite, and they have collaborative notes, and you need to retain the documents for legal purposes, then converting them to PDF is going to lose data.

    On the future use side: PDFs are slower to render and search than most formats; they're harder to alter, but they're more reliably rendered than any other format. Office documents offer richer content and easy editing; their layout may vary depending on the output device (good and bad), and office document formats seem to change a bit more than other document types. HTML with CSS is good, and probably now stable enough that future clients will render something similar - but it's not PDF for reliable formatting, nor office docs for feature richness; editing tools for HTML aren't all that intent on preserving what came before. LaTeX is a reliable formatter wrapped around text-centric documents, but it's not something most people will be able to use and edit.

    Each document type has its reasons for being - you'll need to decide why you need to store your documents and what you need them for in the future. Retaining the original document along with a text conversion stored and indexed in a search engine may be your best bet - or not.

    --
    Let us live so that when we come to die, even the undertaker will be sorry -- Mark Twain
    1. Re:"Best" depends on intent by Vitriol+Angst · · Score: 1

      Sometimes you people make things WAY too complicated.

      In our 'best judgement' -- what's a very open standard for documents? Now, we can ask "what type of document" -- and we can also try and answer for whatever documents we know.

      So here goes;

      Documents; Try RTFD. Rich Text Formatted Document. It might not be perfect in layout -- but it's open, and accessible to a lot of apps and cross platform. If you get bad results, you might just need to switch to some other "open" app. OpenOffice on all platforms will likely have consistent results but I haven't tried this. I use "Bean" on the Mac for a lightweight text editor and have no trouble.

      PDF is good if you need to preserve the look and feel and for the most part -- it's accessible even without paying Adobe. Higher end features require an editor -- but you can have text, images and basic hyper links without cost. There are open source tools available. Adobe of course is a for profit company, but you can get 90% of everything you need with the free and "accessible" standard it has become. It isn't open -- but the PDF format won't change for anything it is compatible with right now.

      SVG is a vector based image format. PNG is an image format. JPEG is a lossy compression format. All highly available.

      Not so sure for 3D but Collada may be the best. Obj and DXF are old as dirt and don't transfer a lot of information like vertex normals correctly -- at least from discussions I've read. Someone with more experience should weigh in on this topic.

      --
      >>"ad space available -- low rates!!!"
  25. Docbook? by Enry · · Score: 3, Funny

    Docbook allows you to separate out the content from the presentation. You write in XML and define paragraphs, chapters, images, etc. and then leave it to the various stylesheets to drive how it looks like when it comes out the other end - PDF, HTML, Word, whatever, and the stylesheet makes sure that if some features are supported (hyperlinking from the table of contents to the chapter) it'll be included in there. Since the content is in plain 'ol XML you can use any kind of XML processor to go through it..

    1. Re:Docbook? by gstoddart · · Score: 1

      OK, grandpa, it's time for your meds again ... look, Matlock is about co start ... no, they're not on your lawn. ;-) [ Wow, and actual 3-digit id ]

      Honestly, for those of us old enough to still have a copy of Goldfarb's book, this has been the holy grail for a very long time.

      But in practice, there's still no tools to convert all those formats to it, and most anything you do is going to be custom code.

      As a system which takes other formats as input, docbook falls into the category of wishful thinking.

      Even us old SGML geeks don't see it as really being a viable solution. It presupposes that all of your content starts in that format, and that's really unrealistic.

      --
      Lost at C:>. Found at C.
    2. Re:Docbook? by Anonymous Coward · · Score: 0

      "plain ol' XML" doesn't exist.

    3. Re:Docbook? by Zontar+The+Mindless · · Score: 1

      But in practice, there's still no tools to convert all those formats to it, and most anything you do is going to be custom code.

      I love DocBook. But I've also spent a fair portion of my lifespan persuading other formats to turn into it. Not the most fun I've ever had, to say the least.

      --
      Il n'y a pas de Planet B.
    4. Re:Docbook? by Anonymous Coward · · Score: 0

      (Open|Libre) Office uses XSL to transform some XML into some formats.
      Perhaps XSL tools are not very popular, but I bet that there are more of them than what we imagine.

  26. Re: Starlight Glimmer 2016 by gstoddart · · Score: 2

    Dude ... can I point out to you that you got the reference and that many of us wouldn't know WTF it was?

    So maybe your question is how many other Slashdotters are Bronies besides you? ;-)

    And, for the record, I included that link because I had to google it to find out what it meant.

    Now if you will excuse me I need to go apply brain bleach. The images which came up in that google search are terrifying.

    --
    Lost at C:>. Found at C.
  27. Re: Starlight Glimmer 2016 by jfdavis668 · · Score: 0

    Shh... Mods are asleep, post Ponies!

  28. Impossible question to answer by Saanvik · · Score: 1

    There is no "best" document format, open or otherwise, for "easy search, compression, rendering, etc." because those words are too fuzzy.

    • What does rendering mean (print, screen, mobile, or ...)?
    • What is the search scope ("this" document or multiple documents)?
    • How important is compression?

    If your use case is a typical one, then you actually want, for maximum search functionality, text (perhaps with some form of markup so you can assign weights to segments, like higher weight for titles), a HTML5 based website for screen rendering (including mobile), and PDFs for print. Depending on what the user wants, they can pick.

    If you want to pick one format, then you need to weigh every factor and decide what matters most to you, but don't ask us to guess.

  29. LaTeX by Anonymous Coward · · Score: 0

    Convert it to LaTeX and make everyone hate you.

  30. EDI? by ArhcAngel · · Score: 1

    Will this be for sharing information with 3rd parties? If your documents consist of a set of data you will need access to EDI (electronic data interchange) was designed to store the data in a standard format and be able to inject that data into any document format. It's not so helpful if you are just archiving word documents or emails. There are a number of companies that assist in converting your documents to EDI.

    --
    "A person is smart. People are dumb, panicky dangerous animals and you know it." - K
  31. There Is No Substitute by Anonymous Coward · · Score: 0

    .docx

  32. Obligatory XKCD reference... by Anonymous Coward · · Score: 0

    There's an XKCD comic for everything.

  33. Re: Starlight Glimmer 2016 by Anonymous Coward · · Score: 0

    To be fair, doing a google image search for pretty much anything is dangerous.

  34. Re: Starlight Glimmer 2016 by gstoddart · · Score: 1

    To be fair, doing a google image search for pretty much anything is dangerous.

    I didn't. But google was "helpful" enough to throw up related images along with the search results.

    And now I shall ever be traumatized that 'adults' are dressing up like that.

    I can't simply unsee that. I'm going to make it the new Rick rolling ... just randomly stick in links to bronies. Spread around the pain.

    --
    Lost at C:>. Found at C.
  35. Don't change the documents by Anonymous Coward · · Score: 0

    Don't change the documents to a different format.
    Because...
    1. Your altering the format and might lose the original.
    2. Some formats depending on how they are create don't convert very well to other formats.

    For example some PDF format don't use top spacing, but use page coordinates, and when converted everything is placed at the top of each page.

  36. Use pandoc on incoming files by WillAdams · · Score: 1

    Convert to your choice of XML and store that

    Use pandoc to convert to whatever format is requested. If a document is requested and edited, use pandoc to read in the edited version and store that.

    Once you've trained everyone to accept the lowest common denominator, it'll work.

    For bonus points you could go straight to MediaWiki markup and put everything into a wiki.

    --
    Sphinx of black quartz, judge my vow.
  37. odf by JohnVanVliet · · Score: 0

    you do know that "ODF" stands for " open document format"
    and it is the DEFAULT format for libreoffice and Openoffice
    and is the STANDARD for a few countries and at lease one state( California )

    --
    "I don't pitch OpenSUSE Linux to my friends, i let Microsoft do it for me
  38. How about by pahles · · Score: 2

    Markdown? It's easy to write, read, render, compress, search.

    --
    Sig?
    1. Re:How about by Anonymous Coward · · Score: 0

      Plain paper?

  39. Markdown by Anonymous Coward · · Score: 0

    Like others have said, converting will not end well. If you want to encourage a format for new documents though consider markdown. Markdown is stored as text and can be converted to html, allowing for hyperlinks, headers, code blocks etc. Since it is stored as plain text, searching, version control, longevity and compression are easy. There are a ton of implementations but the one looking to be the "standard" is CommonMark. This implementation is supported by Reddit, GitHub, and StackExchange. Try it out!

  40. plain text by Anonymous Coward · · Score: 0

    nuff said.

  41. Re: Starlight Glimmer 2016 by ArcadeMan · · Score: 1

    http://img3.wikia.nocookie.net...

    I do share your feelings about the cosplaying though... unless it's a cute girl dressed like Fluttershy. If that's okay with her, I mean.

  42. Bad idea by Daniel+Hoffmann · · Score: 1

    Your documents will lose formating when the files are converted, if you want users to be able to download the files in any format you should just store the files in the way that the user uploaded them and convert directly. Create a metadata plain text version for search, maybe a visualization version so that the user be able to see the files inside your application, in this visualization version you should just use the easiest method.

    Of course this depends heavily on your requirements.

  43. That's easy by ArcadeMan · · Score: 1

    Just convert all the documents into 1200 DPI, 32-bit PNG images.

  44. XHTML5 exists by tepples · · Score: 1

    The HTML5 introduction states that XHTML is one of the two syntax forms of HTML5.

    1. Re:XHTML5 exists by Dracos · · Score: 1

      But HTML5 treats XHTML syntax as a resented stepchild because WHATWG hates XML so much. All the sloppy markup that the rest of the spec advocates should be treated that way instead.

  45. LaTeX CoNDoM by tepples · · Score: 3, Funny

    LaTeX is the CoNDoM that protects you from Microsoft Office ViRuSeS.

    1. Re:LaTeX CoNDoM by Anonymous Coward · · Score: 0

      Who's letting the teenagers in here?

  46. Stupid file extension tricks by tepples · · Score: 1

    Heck, you could save your LaTeX files with a .tex extension and associate that with a script that invokes a TeX to PDF renderer followed by your preferred PDF reader.

    1. Re:Stupid file extension tricks by TechyImmigrant · · Score: 1

      With latex stored, you can render to anything when the user requests it. odf, pdf, docx, bitmap, tex, GIF. Take your pick.

      Push the techy stuff on the developer to make the user tasks no brainers.

      --
      I should use this sig to advertise my book ISBN-13 : 978-1501515132.
  47. Burning Bush by Etherwalk · · Score: 4, Funny

    ...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...

    Chisel it into stone tablets, then find an ignorant local. Set up a natural gas line to a nearby bush and hide behind a rock. Cub your hands to add a slight reverb effect and tell him to preach the chiselled word, then break the tablets and hide them in a box and trick nazis into looking at them.

    1. Re:Burning Bush by Anonymous Coward · · Score: 0

      Plot twist: The Ark of the Covenant was full of punch cards.

    2. Re:Burning Bush by Anonymous Coward · · Score: 0

      Why do you need your hands to be bitten by juvenile bears?

    3. Re:Burning Bush by Anonymous Coward · · Score: 0

      There's a serious problem with transcription errors. The original said "celebrate", not "celibate", for priests.

  48. Paper by Anonymous Coward · · Score: 0

    Try paper. It is a universal format open to all and you can convert any document type (.txt, odf, doc, docx, pdf, etc.) to this format. It never goes obsolete and is easily searchable by laying it out on a flat surface. This technique can also be used for rendering the document. You can easily compress it by stacking the documents. Users can upload it by mailing it to a PO box. You will need to set up a cron job to actually have someone go retrieve the document from the box periodically though. No specialized equipment or technology is required for this method, so it can be deployed within your current infrastructure.

  49. ODF, duh. by Anonymous Coward · · Score: 0

    Seriously, the committee may be the longest-windiest way of getting a standard,but the needs of everyone there get answered when it does get to an answer, and the ODF standards committee include just about everyone whose very life is based on documents. In short, if your needs aren't solved by what THEY needed, then your needs are either manufactured wants, rather than needs, or so very specific and uncommon that NO standard is possible, in which case roll your own. Nobody else will be able to use it, but then they won't have any needs that are compatible with yours, in the latter case.

  50. Dear Slashdot. by Anonymous Coward · · Score: 0

    I understand that "best" is a relative concept and relative to what you're trying to accomplish. Desipte that, I'm going to give you a 2 sentence description of what I'm doing, and expect you to choose what you think is "best".

    Feel free to make wild assumptions about what I'm doing that may be completely inaccurate, while at the same time not even realizing you're making these assumptions.

  51. Three letters by plcurechax · · Score: 1

    DNA

    Millions of years of field testing, and it still mostly works, and DNA itself is not patented.

  52. HTML - this is sort of the point of HTML.. by JacobA.Munoz · · Score: 1
    I would think HTML is the obvious choice as it is the most widely implemented document format.. "Document Object Model" isn't just a name.
    • * open - just about as open as any format can be!
    • * searchable - yes, very. especially with Microdata or RDFa tags to describe data content
    • * portable - yes, very. virtually all modern operating systems support at least one decent web browser and there are countless other HTML/XML tools out there
    • * compressible - yes, depending. Depending on how much "compression" really means to you - assuming you're simply storing the actual text content and not images of text then HTML is obviously superior to all other formats. If you need graphics and diagrams (not photos), you can use embedded SVG to store near-perfect resolution images and shapes. and embedded uri's to hold the photos/images right there in the document with "data:image/png;base64,...".
    • * editable - yes, surprised? It may not always be pretty or easy, but you can use MS, Open, or Libre offices to load/edit/save HTML documents. Don't let your head explode... I know HTML documents generated by these editors are ugly as sin - but you can always edit the tags yourself (which I personally prefer) if you don't want to fight with a fickle GUI.. you can't really do anything like that in a .doc or .pdf. To me, nothing beats a good clean text editor.

    I'm just not sure why it's a question.. it's the format . . for .. um.. documents.

    1. Re:HTML - this is sort of the point of HTML.. by Zontar+The+Mindless · · Score: 1

      HTML is primarily a layout language. It does diddley-squat for semantics. You need DocBook XML, or something like it, if you're going to go that route... and good luck with converting to it from something like Word.

      --
      Il n'y a pas de Planet B.
  53. Fonts are copyrightable in some cases by Anonymous Coward · · Score: 0

    Therefore you cannot store PDFs with embedded fonts. If you can't store embedded fonts, then you can't confirm "original rendering" with any format, including PDF. Therefore FORGET the idiotic meme of "original format".

    What if the original format is being read by someone blind? Text to speech? That changes the rendering format. Colour blind? Change colours? Changes the rendering format. Poor eyesight? Increase font size,but that again changes the rendering format. Reading on a TV screen? Changed. 4:3 monitor? Changed. Tablet? Change.

    DO NOT GIVE A FLYING FUCK over keeping the original formatting and layout.

    You need 100.000000000000000% of the information. Fuck the formatting. It's pointless. Flow is necessary because it's really acting like punctuation and signposted. But a contents or index doesn't demand you read every page in order. And formatting shouldn't demand you view the words in exactly the same place either.

    NOBODY should get to claim the format you need for archive, index, retrieval and display of their documents MUST replicate the format exactly. If they do, tell them to fuck off and get their own system, or write a document that doesn't demand precise placement of every word on the screen to be understood by the reader. Because that can only be guaranteed by displaying on the original machine with the setup that it had at the time.

    if someone demands that fidelity, then they should be told to keep that machine locked down to the settings that were made and kept there available for people to sit at and watch at their expense.

    Because your job is to get the information to people. Not the placement of words.

    1. Re:Fonts are copyrightable in some cases by Em+Adespoton · · Score: 1

      Fonts are copyrightable in some cases

      Therefore you cannot store PDFs with embedded fonts. If you can't store embedded fonts, then you can't confirm "original rendering" with any format, including PDF.

      How to get copyright wrong on multiple levels....

      1) ALL fonts are copyrighted; they haven't been around long enough (even the old bitmapped fonts) to fall out of copyright. The difference is in how they're licensed (most are freely shareable, but some have specific restrictive licenses, such as Apple Garamond).

      2) Copyright doesn't prevent you from possessing a PDF document with fonts in it that have restrictive copyright. What it prevents is you SHARING that document with others. Even if you didn't buy the license to use the font yourself, it's not illegal to possess a PDF that contains it.

      The annoying thing is that there's PDF and there's PDF... if you're dealing with a fully encapsulated 1.3 or 1.4 document, you're set. If you're dealing with a 1.7 document whose fonts are external on some font server, so much for WYSIWYG. You're better off using PNG or JPEG2000 (which is now out of patent protection IIRC) to keep an exact representation, linked to a unicode document containing the text -- in CSV format for all tables.

      But if you're creating the format yourself, PDF should be fine, as you can set the encoding yourself. If you create a PDF where each page is actually a PNG image, with cleartext encoding linked to it, then it's both exactly as the original, and fully searchable. This doesn't work as well for tables, but Acrobat Reader at least can interpret chunked table data in a PDF and recombine it into tab separated values (TSV). Many other PDF parsers get it wrong and just read the embedded objects serially, totally garbling up the order.

      Just remember to keep the originals; you might not be able to read MacWrite documents or ClarisWorks 1.0 documents currently, but hey -- there's always emulation of the original system, multi-step conversion, or someone reverse-engineering the format years down the line. And if you have a "modern" format that shows how it's actually supposed to look, then you can re-create that look in a modern format.

  54. Use cases? by PPH · · Score: 1

    Who is going to use these stored documents? How will they be used (read-only, revise and check in, etc.)? What tools are authors generating these documents with? Answers to these questions will help determine the best storage format.

    For documents intended to be downloaded and read or string searched, PDF is a good choice. There are a lot of PDF readers for different O/Ss available.

    --
    Have gnu, will travel.
  55. Save Space, Switch to ODT by BrendaEM · · Score: 3

    I've written several books. Because ODT's have standard compression, they are usually much smaller. For a 109,683 word book, with styles and formatting:
    ODT: 271,090 bytes
    Docx: 300,057 bytes
    Word 97: 1,379,328 bytes
    PDF: 1,050,788 bytes

    If bytes cost money to store ODT rules.

    Imagine yourself sticking with Word 97 because it's a reliable standard: imagine buying three times as much storage, as well as the backup for the storage,

    --
    https://www.youtube.com/c/BrendaEM
  56. What Is the Best Open Document Format? by koan · · Score: 1

    One that others can easily use.

    --
    "If any question why we died, Tell them because our fathers lied."
  57. latex by Anonymous Coward · · Score: 0

    because the developer got tired of shit quality...http://www.latex-project.org/

  58. Easy Answer by Anonymous Coward · · Score: 0

    From the standpoint of longevity and compatibility the answer is simple: the more, the better.

    1. Re:Easy answer by Anonymous Coward · · Score: 0

      I think I'll pay a little extra and store it with just 1's

  59. Easy answer by Anonymous Coward · · Score: 0

    You store it as 1's and 0's and if you're cheap make do with just the 0's.

  60. Get an account on Box. Was Re: Forget the Univer by Anonymous Coward · · Score: 0

    Agree with all that. Box provides free storage, sharing, security,and free Solr indexing.

  61. Silly Rabbit, Trix are for kids.... by David_Hart · · Score: 3, Interesting

    No, just no....

    Store the documents in their original format.

    There are many possible reasons why you shouldn't mess with the originals such as formatting, legal implications, loss of content because one format supports stuff that the other doesn't, etc.

    The only way that I could see this working is if you converted everything to an open format but kept copies of the originals and linked to them. But if the plan is to dump the original documents, then it just isn't worth it....

    1. Re:Silly Rabbit, Trix are for kids.... by cerkit · · Score: 1

      I agree with this sentiment. However, .docx is already a "standard" that is also compressed. In fact, if you rename it to .zip and open it up, you can view the XML formatted documents it contains. Office OpenXML: http://www.ecma-international....

      --
      Michael Earls http://cerkit.com/
  62. txt by Anonymous Coward · · Score: 0

    txt txt txt or plain old ascii.

    They'll be readable on all systems and the easiest to search with grep.

  63. Stone tablets by Anonymous Coward · · Score: 0

    For longevity.

  64. Hybrid pdf by Anonymous Coward · · Score: 0

    Never heard of hybrid pdf? It's a "normal" pdf with its OpenDocument source embedded. LibreOffice will open the embedded OpenDocument that will remain completely editable (may it be a text, spreadsheet, scalable drawing, presentation...) and the pdf counterpart will complete the file. Remember, a pdf file is also a container, so the pdf version you will see with any pdf viewer will also integrate, hidden, it's OpenDocument source.

    To create an hybrid pdf, simply use "Export as a pdf..." in the menu of LibreOffice and check the hybrid pdf box in the first options tab. Don't use "Direct export to pdf" unless you have checked the hybrid box before as the diverse options retain their state.

    OpenDocument is an ISO format for any Office Suite file.

  65. Re:.txt NSFW by Anonymous Coward · · Score: 0

    Don't click. It's that goat thing... I'm starting to think these people have problems.

    Slashdot should have a button to flag a post as innapropriate.

  66. Have you considered XML by Anonymous Coward · · Score: 0

    You could create an XML document for each then place the original document in a CDATA.

  67. PDF/A with embeded ODF by Anonymous Coward · · Score: 1

    You can do it with Libreoffice. So you have a faithful representation (PDF/A was designed for long-term archiving) and its fully editable.

  68. LANG=POSIX, 7-bit ASCII text by Antique+Geekmeister · · Score: 1

    If you can't read it in flat text, it's not long-term reliable documentatoin.

  69. History has some proof of this. by MartinTem · · Score: 1

    just use pen and paper!

  70. my 1st and 2nd choice for document format by Skapare · · Score: 1

    i usually just go with the .TXT format but have been considering the compatible .RST format.

    --
    now we need to go OSS in diesel cars
  71. Frequently not "doable" by dbIII · · Score: 1

    Some of those files are just an XML wrapper around a binary format for which the documentation is not available outside of Microsoft. The wrapper meets the legal obligations but the file format in such cases is ultimately useless in the long term.
    Meanwhile I can import seismic data from the early 1970s into current software without any conversion - simply because the file format is documented instead of Microsoft's later step backwards.

    1. Re:Frequently not "doable" by RockDoctor · · Score: 1

      Meanwhile I can import seismic data from the early 1970s into current software without any conversion -

      Strange, but that is exactly what I am doing at the moment. Or at least, I was doing until a few moments ago, when the task finished. Hi ho! back to the grindstone!

      --
      Birds are not dinosaur descendants;birds are dinosaurs, for all useful meanings of "birds", "are" and "dinosaurs"
  72. You don't want it to by dbIII · · Score: 1

    The desktop publishing software I used on an Atari ST back in the day is far better suited to the task than even the current MS Word. While it's gone halfway to being DTP software the real thing has a few differences in the way things are done that avoids the massive time sink you get if you try to treat MS Word like DTP software.

  73. Self-contained HTML document with data URIs by Anonymous Coward · · Score: 0

    I would love to se the annoying to produce, even more annoying to read PDF format replaced by html documents with all resources embedded as Data URIs. Anyone tried creating self contained html for this purpose?

  74. Self-contained HTML document with data URIs by rajder · · Score: 1

    I would love to se the annoying to produce, even more annoying to read PDF format replaced by html documents with all resources embedded as Data URIs. Anyone tried creating self contained html for this purpose?

  75. The best Open Document Format? by Trogre · · Score: 1

    The answer's right there in the question...

    --
    "Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife