Ask Slashdot: What Is the Best Open Document Format?
kramer2718 writes: I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?
Since "best" can be highly driven by circumstances, please explain your reasoning, too.
Have a question for Slashdot's readers? Take a look at other recent questions first to see if someone else has had a similar question. And if not, ask away! The more details and context you include, the more likely your question will be selected.
if you use the API's supplied by their creators?
on what your data stored in those documents is.
I'd store them as docx, doc, .pdf, etc.
...or even XHTML
http://www.pdfa.org/2011/08/pd...
Let's make like a bird... and get the flock outta here.
I would suggest, unless you have a pressing need to convert them, that you should store the documents in the formats they are uploaded in.
Whenever you convert a document you run the risk of completely messing up the layout, style, etc.
.txt. If you need pretty formatting, fill it Latex tags.
Either you truly wish to use something "open", at which point the only choice is ODF, or you simply want something that can be widely used. If the latter is the case, ODF is still good, perhaps PDF.
I am working on a project that requires uploading and storing of documents. Although the application will need to allow uploading of .docx, doc, .pdf, etc, I'd like to store the documents in a standard open format that will allow easy search, compression, rendering, etc. Which open document format is the best?
PDF allows accurate rendering so it's the best choice. It will be a hot mess if you use anything else. Conversion of such complex formats is very error-prone for layout problems.
1) Forget the Universal Format approach - your users will kill you for messing up their formatting, and you'll never get complete feature parity
2) Store the docs in their original format
3) Get Apache Solr to search your content
4) You'll be spending a lot of time on #3, so leave time to tinker
The one that everyone involved in the project can read.
Word Perfect Document, because it's been consistent for nearly 20 years. it has a simple underlying format, it's more finely granular than HTML and because I just like obsolete things.
Is there a technical reason why you need open source formats, or you just want to look cool and hip to others (friends, customers, etc.)?
.02
MS Office formats are widely used and accepted all over the word and over the best integration with most cloud providers (quick viewing, for instance).
My
Are you writing the search/compression/render capability from scratch, or are you using a library to handle that job for you?
If you're handling more than one document type, then go for a library. I don't have a recommendation myself, but I'm sure you can find them on a search.
Also, don't worry about compression, as modern .odf/.docx is already compressed with something compatible with PKZIP.
HTML is standard enough, has the maximum reusability and readability, while being exceptionally lighter than other formats.
As an IT person, I hate questions like this. There's not enough information to give a solid answer. For example:
* What kinds of documents are you talking about? Text? Photos? Spreadsheets? .pdf, etc", what formats are in "etc"?
* What is the source of the documents? Are these currently printed out documents that need to be scanned back in? Are they currently digital, and in a particular file format?
* What will people need to do with them when these documents are retrieved? Do they need to be able to edit the documents?
* How much does formatting matter? If someone retrieves the document in 5 years, will it be important that all the line breaks and page breaks are in the same place? Does it need to have all of the correct fonts? Or are you more interested in being able to have access to the information itself?
* When you say that the application will need to allow ".docx, doc,
There may be many other relevant questions, my point is that there just isn't enough detail here. In general, if the most important thing is that you have a printable document that you want to be able to print out from any machine, maintaining the formatting as much as possible, then PDF is a pretty good choice (be sure to embed the fonts and include searchable text!). If you already have a bunch of Word documents and you want the formatting unchanged, and would like the capability to edit the document after it's retrieved, then I'd typically just recommend keeping it as a .docx. It keeps things simple, will be widely supported, and prevents the risk of something going wrong while you're converting to another format. If you like the idea of using .docx because of what I just said, but want something more "open", then ODF is probably worth looking into.
Really, there are only so many choices, and each have advantages depending on your specific needs.
"Best" in this case depends on your needs and resources, not on standards or common practices. Flexibility and "getting the job done" are more important than what everyone else prefers. Do what works for you.
Ever since the OMG Ponies! incident... Slashdot just hasn't been the same...
I'd highly recommend leaving them in their original format, or if anything, converting them all to .pdf. Conversion is always fraught with danger, and you will be spending an awful lot of time getting to the know the intricacies of Microsoft Word if you go this route. Pdfs display equally nicely on every operating system, they archive very well, and almost every tool out there can read them - but while converting documents to this format usually works better than others, I'd still be very careful to watch for mistakes in conversions.
"Set a man a fire, he'll be warm for the rest of the night. Set a man afire, he'll be warm for the rest of his life."
All of them. No, really. Transform everything into every format you can and save them all. That way, even if /. is collectively incapable of predicting format longevity, you're most likely to have a copy of everything in a format that's still understandable.
Or include the source code (as text files) for any interpreters you assume will never die. That should be vaguely future proof, right?
There are many premade document management systems. They generally will store their indices in a database format for quick searching. Why not store them in their native formats and leave it up to a document management system to handle it for you?
...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...
Development is programmable; Discovery is not programmable. (Fuller)
Lets' see ... you want to allow uploading in a large number of formats .. you want to magically turn it into a universal format ... while retaining all of aspects of the original ... and will be easily maniuplated ... and you want it in an open, and documented format? And all for free?
I want one of those too. And a Red Rider BB gun with a compass in the stock and this thing which tells time. And a new skateboard. And a pony.
Honestly, you're asking for the holy grail of document management systems ... the universal, lossless document format.
I'm not sure it exists. And I'm not sure companies like Microsoft or Adobe would allow it to exist.
Lost at C:>. Found at C.
On the conversion side... If you're taking in PDFs created by a layout/page design program, then you're not likely to get good satisfaction converting them and storing them as something other than PDFs. OTOH, if you're taking in a lot of documents created in an office suite, and they have collaborative notes, and you need to retain the documents for legal purposes, then converting them to PDF is going to lose data.
On the future use side: PDFs are slower to render and search than most formats; they're harder to alter, but they're more reliably rendered than any other format. Office documents offer richer content and easy editing; their layout may vary depending on the output device (good and bad), and office document formats seem to change a bit more than other document types. HTML with CSS is good, and probably now stable enough that future clients will render something similar - but it's not PDF for reliable formatting, nor office docs for feature richness; editing tools for HTML aren't all that intent on preserving what came before. LaTeX is a reliable formatter wrapped around text-centric documents, but it's not something most people will be able to use and edit.
Each document type has its reasons for being - you'll need to decide why you need to store your documents and what you need them for in the future. Retaining the original document along with a text conversion stored and indexed in a search engine may be your best bet - or not.
Let us live so that when we come to die, even the undertaker will be sorry -- Mark Twain
Docbook allows you to separate out the content from the presentation. You write in XML and define paragraphs, chapters, images, etc. and then leave it to the various stylesheets to drive how it looks like when it comes out the other end - PDF, HTML, Word, whatever, and the stylesheet makes sure that if some features are supported (hyperlinking from the table of contents to the chapter) it'll be included in there. Since the content is in plain 'ol XML you can use any kind of XML processor to go through it..
Dude ... can I point out to you that you got the reference and that many of us wouldn't know WTF it was?
So maybe your question is how many other Slashdotters are Bronies besides you? ;-)
And, for the record, I included that link because I had to google it to find out what it meant.
Now if you will excuse me I need to go apply brain bleach. The images which came up in that google search are terrifying.
Lost at C:>. Found at C.
Shh... Mods are asleep, post Ponies!
There is no "best" document format, open or otherwise, for "easy search, compression, rendering, etc." because those words are too fuzzy.
If your use case is a typical one, then you actually want, for maximum search functionality, text (perhaps with some form of markup so you can assign weights to segments, like higher weight for titles), a HTML5 based website for screen rendering (including mobile), and PDFs for print. Depending on what the user wants, they can pick.
If you want to pick one format, then you need to weigh every factor and decide what matters most to you, but don't ask us to guess.
Convert it to LaTeX and make everyone hate you.
Will this be for sharing information with 3rd parties? If your documents consist of a set of data you will need access to EDI (electronic data interchange) was designed to store the data in a standard format and be able to inject that data into any document format. It's not so helpful if you are just archiving word documents or emails. There are a number of companies that assist in converting your documents to EDI.
"A person is smart. People are dumb, panicky dangerous animals and you know it." - K
.docx
There's an XKCD comic for everything.
To be fair, doing a google image search for pretty much anything is dangerous.
I didn't. But google was "helpful" enough to throw up related images along with the search results.
And now I shall ever be traumatized that 'adults' are dressing up like that.
I can't simply unsee that. I'm going to make it the new Rick rolling ... just randomly stick in links to bronies. Spread around the pain.
Lost at C:>. Found at C.
Don't change the documents to a different format.
Because...
1. Your altering the format and might lose the original.
2. Some formats depending on how they are create don't convert very well to other formats.
For example some PDF format don't use top spacing, but use page coordinates, and when converted everything is placed at the top of each page.
Convert to your choice of XML and store that
Use pandoc to convert to whatever format is requested. If a document is requested and edited, use pandoc to read in the edited version and store that.
Once you've trained everyone to accept the lowest common denominator, it'll work.
For bonus points you could go straight to MediaWiki markup and put everything into a wiki.
Sphinx of black quartz, judge my vow.
you do know that "ODF" stands for " open document format"
and it is the DEFAULT format for libreoffice and Openoffice
and is the STANDARD for a few countries and at lease one state( California )
"I don't pitch OpenSUSE Linux to my friends, i let Microsoft do it for me
Markdown? It's easy to write, read, render, compress, search.
Sig?
Like others have said, converting will not end well. If you want to encourage a format for new documents though consider markdown. Markdown is stored as text and can be converted to html, allowing for hyperlinks, headers, code blocks etc. Since it is stored as plain text, searching, version control, longevity and compression are easy. There are a ton of implementations but the one looking to be the "standard" is CommonMark. This implementation is supported by Reddit, GitHub, and StackExchange. Try it out!
nuff said.
http://img3.wikia.nocookie.net...
I do share your feelings about the cosplaying though... unless it's a cute girl dressed like Fluttershy. If that's okay with her, I mean.
Get free satoshi (Bitcoin) and Dogecoins
Is a brand of adult diapers.
Get free satoshi (Bitcoin) and Dogecoins
Your documents will lose formating when the files are converted, if you want users to be able to download the files in any format you should just store the files in the way that the user uploaded them and convert directly. Create a metadata plain text version for search, maybe a visualization version so that the user be able to see the files inside your application, in this visualization version you should just use the easiest method.
Of course this depends heavily on your requirements.
Just convert all the documents into 1200 DPI, 32-bit PNG images.
Get free satoshi (Bitcoin) and Dogecoins
The HTML5 introduction states that XHTML is one of the two syntax forms of HTML5.
LaTeX is the CoNDoM that protects you from Microsoft Office ViRuSeS.
Heck, you could save your LaTeX files with a .tex extension and associate that with a script that invokes a TeX to PDF renderer followed by your preferred PDF reader.
...you can't beat bamboo strips. The oldest original versions of Lao Tzu's Tao Te Ching are written on rolls of bamboo strips. Not sure how they scan electronically, and you will have to keep your pet pandas away from them, but for document durability, you can't beat that format...
Chisel it into stone tablets, then find an ignorant local. Set up a natural gas line to a nearby bush and hide behind a rock. Cub your hands to add a slight reverb effect and tell him to preach the chiselled word, then break the tablets and hide them in a box and trick nazis into looking at them.
Try paper. It is a universal format open to all and you can convert any document type (.txt, odf, doc, docx, pdf, etc.) to this format. It never goes obsolete and is easily searchable by laying it out on a flat surface. This technique can also be used for rendering the document. You can easily compress it by stacking the documents. Users can upload it by mailing it to a PO box. You will need to set up a cron job to actually have someone go retrieve the document from the box periodically though. No specialized equipment or technology is required for this method, so it can be deployed within your current infrastructure.
Seriously, the committee may be the longest-windiest way of getting a standard,but the needs of everyone there get answered when it does get to an answer, and the ODF standards committee include just about everyone whose very life is based on documents. In short, if your needs aren't solved by what THEY needed, then your needs are either manufactured wants, rather than needs, or so very specific and uncommon that NO standard is possible, in which case roll your own. Nobody else will be able to use it, but then they won't have any needs that are compatible with yours, in the latter case.
I understand that "best" is a relative concept and relative to what you're trying to accomplish. Desipte that, I'm going to give you a 2 sentence description of what I'm doing, and expect you to choose what you think is "best".
Feel free to make wild assumptions about what I'm doing that may be completely inaccurate, while at the same time not even realizing you're making these assumptions.
DNA
Millions of years of field testing, and it still mostly works, and DNA itself is not patented.
I'm just not sure why it's a question.. it's the format . . for .. um.. documents.
Therefore you cannot store PDFs with embedded fonts. If you can't store embedded fonts, then you can't confirm "original rendering" with any format, including PDF. Therefore FORGET the idiotic meme of "original format".
What if the original format is being read by someone blind? Text to speech? That changes the rendering format. Colour blind? Change colours? Changes the rendering format. Poor eyesight? Increase font size,but that again changes the rendering format. Reading on a TV screen? Changed. 4:3 monitor? Changed. Tablet? Change.
DO NOT GIVE A FLYING FUCK over keeping the original formatting and layout.
You need 100.000000000000000% of the information. Fuck the formatting. It's pointless. Flow is necessary because it's really acting like punctuation and signposted. But a contents or index doesn't demand you read every page in order. And formatting shouldn't demand you view the words in exactly the same place either.
NOBODY should get to claim the format you need for archive, index, retrieval and display of their documents MUST replicate the format exactly. If they do, tell them to fuck off and get their own system, or write a document that doesn't demand precise placement of every word on the screen to be understood by the reader. Because that can only be guaranteed by displaying on the original machine with the setup that it had at the time.
if someone demands that fidelity, then they should be told to keep that machine locked down to the settings that were made and kept there available for people to sit at and watch at their expense.
Because your job is to get the information to people. Not the placement of words.
Who is going to use these stored documents? How will they be used (read-only, revise and check in, etc.)? What tools are authors generating these documents with? Answers to these questions will help determine the best storage format.
For documents intended to be downloaded and read or string searched, PDF is a good choice. There are a lot of PDF readers for different O/Ss available.
Have gnu, will travel.
I've written several books. Because ODT's have standard compression, they are usually much smaller. For a 109,683 word book, with styles and formatting:
ODT: 271,090 bytes
Docx: 300,057 bytes
Word 97: 1,379,328 bytes
PDF: 1,050,788 bytes
If bytes cost money to store ODT rules.
Imagine yourself sticking with Word 97 because it's a reliable standard: imagine buying three times as much storage, as well as the backup for the storage,
https://www.youtube.com/c/BrendaEM
One that others can easily use.
"If any question why we died, Tell them because our fathers lied."
because the developer got tired of shit quality...http://www.latex-project.org/
From the standpoint of longevity and compatibility the answer is simple: the more, the better.
You store it as 1's and 0's and if you're cheap make do with just the 0's.
Agree with all that. Box provides free storage, sharing, security,and free Solr indexing.
No, just no....
Store the documents in their original format.
There are many possible reasons why you shouldn't mess with the originals such as formatting, legal implications, loss of content because one format supports stuff that the other doesn't, etc.
The only way that I could see this working is if you converted everything to an open format but kept copies of the originals and linked to them. But if the plan is to dump the original documents, then it just isn't worth it....
txt txt txt or plain old ascii.
They'll be readable on all systems and the easiest to search with grep.
For longevity.
Never heard of hybrid pdf? It's a "normal" pdf with its OpenDocument source embedded. LibreOffice will open the embedded OpenDocument that will remain completely editable (may it be a text, spreadsheet, scalable drawing, presentation...) and the pdf counterpart will complete the file. Remember, a pdf file is also a container, so the pdf version you will see with any pdf viewer will also integrate, hidden, it's OpenDocument source.
To create an hybrid pdf, simply use "Export as a pdf..." in the menu of LibreOffice and check the hybrid pdf box in the first options tab. Don't use "Direct export to pdf" unless you have checked the hybrid box before as the diverse options retain their state.
OpenDocument is an ISO format for any Office Suite file.
Don't click. It's that goat thing... I'm starting to think these people have problems.
Slashdot should have a button to flag a post as innapropriate.
You could create an XML document for each then place the original document in a CDATA.
You can do it with Libreoffice. So you have a faithful representation (PDF/A was designed for long-term archiving) and its fully editable.
If you can't read it in flat text, it's not long-term reliable documentatoin.
just use pen and paper!
i usually just go with the .TXT format but have been considering the compatible .RST format.
now we need to go OSS in diesel cars
Some of those files are just an XML wrapper around a binary format for which the documentation is not available outside of Microsoft. The wrapper meets the legal obligations but the file format in such cases is ultimately useless in the long term.
Meanwhile I can import seismic data from the early 1970s into current software without any conversion - simply because the file format is documented instead of Microsoft's later step backwards.
The desktop publishing software I used on an Atari ST back in the day is far better suited to the task than even the current MS Word. While it's gone halfway to being DTP software the real thing has a few differences in the way things are done that avoids the massive time sink you get if you try to treat MS Word like DTP software.
I would love to se the annoying to produce, even more annoying to read PDF format replaced by html documents with all resources embedded as Data URIs. Anyone tried creating self contained html for this purpose?
I would love to se the annoying to produce, even more annoying to read PDF format replaced by html documents with all resources embedded as Data URIs. Anyone tried creating self contained html for this purpose?
The answer's right there in the question...
"Nine times out of ten, starting a fire is not the best way to solve the problem." - my wife