Slashdot Mirror


Online Document Search Reveals Secrets

An anonymous reader writes "New Scientist is reporting that many documents published online may unintentionally reveal sensitive corporate or personal information, according to a US computer researcher. Simon Byers, at AT&T's research laboratory in the US, was able to unearth hidden information from many thousands of Microsoft Word documents posted online using a few freely available software tools and some basic programming techniques." Update: 08/16 19:06 GMT by H : The story is originally from Crypto-gram, not New Scientist.

12 of 271 comments (clear)

  1. WHAT?!?? by zedmelon · · Score: 5, Funny

    From the article:

    • "He says hidden information can "incredibly useful" in improving the functionality of the software. "But if some of that data is sensitive, there have to be ways of ensuring that it isn't distributed where it shouldn't be," he says."

    I just created a Word document, blah.doc and put some text into it. I made sure I had a couple of undo points. I closed it and opened it back up, I couldn't undo SHIT. So where the hell am I being granted this mysterious "convenience?"

    I know that the guy stressed the fact that Micrsoft isn't alone in this disctinction, but this is just another example of why Microsoft SUCKS.

    I put the doc in a samba share and viewed it with vi. I found the path to the doc, the original name, my userid on my laptop, and the company name. All were hidden from the simple searches like this:

    s.l.a.s.h.d.o.t...o.r.g

    WTF?!?

    Oh, WAIT a minute! This is also from the article:

    • "The next edition of Office 2003 will include tools that will allow users to remove personal information from a document. It will also include new "information rights management" that will let an author specify who can read or forward a document."

    WHEW! I feel so much better. Please disregard the first six paragraphs. Thanks.

    --
    Mom says my .sig can beat up your .sig.
    1. Re:WHAT?!?? by zedmelon · · Score: 5, Insightful

      "You only have the convenience while the file is open. If you could undo after you re-opened a file, these "hidden secrets" wouldn't be hidden at all!"

      Exactly. I knew that to begin with, but I did it and then vi'd the file to confirm. If I delete text from a document, that means I don't want that text in the document. Neil Laver says "...hidden information can "incredibly useful" in improving the functionality of the software."

      So my main point is, if I am being supposedly CONVENIENCED by this "feature," HOW is the software helping me by storing these things in my document?

      --
      Mom says my .sig can beat up your .sig.
    2. Re:WHAT?!?? by wortelslaai3434 · · Score: 5, Funny

      As a sidenote...

      I. .t.h.i.n.k. .y.o.u.r. .s.e.e.i.n.g. .u.n.i.c.o.d.e. .t.e.x.t.

  2. It's been said hundreds if not thousands of times: by NightSpots · · Score: 5, Insightful

    It doesn't matter how good your corporate security is if you don't train your users (including managers) in basic security practices.

    Lots of people put sensitive documents in public webspace, primarily because they don't know any better. Eventually the cost-benefit analysis will be done, and corporations will pay to have their users trained. Until then, this type of thing will continue to happen.

  3. Re:It's been said hundreds if not thousands of tim by TMB · · Score: 5, Insightful

    Sure, but they point they're making is that it's not intuitively obvious to most people that there could be text in a Word document other than what appears.

    So a relatively security-conscious person who just doesn't know anything about Word file formats could easily publish something online on purpose without knowing that there is (invisible) sensitive information in it, even if they'd never put that information in a public place on purpose.

    [TMB]

  4. Job Recruiters by Anonymous Coward · · Score: 5, Interesting

    I have received two such word documents from two seperate job recruiters. The actual companies looking for the employee were hidden in the document, as well as contact information for the person at the company. Screw the middle man

  5. Clippy did it by sbillard · · Score: 5, Funny

    It looks like you're trying to post a document on the web.
    Would you like to...
    1. Divulge corporate secrets?
    2. List your passwords?
    3. Remove KB823980 and open port 135?


    It looks like your trying to close Clippy.
    Would you like to...
    1. Shit in your hat?
    2. Put fist through bling bling flat panel?
    3. Go home for teh weekend?

  6. Re:LaTeX by GarvMaster · · Score: 5, Funny

    Because 99.9% of the world would go back to pen and paper

  7. Re:Nothing New by gblues · · Score: 5, Informative

    That is because the people who published the PDF were idiots.

    Acrobat has a number of commenting tools. What the Washington Post staff did in that case was use the Highlight tool, set the color to black, and use it to draw over the names.

    Only problem? The highlighter is an object that is drawn on top of the text object it is attached to. The underlying text is not modified at all. In fact, if you watch closely, you can see the name for a split second before the renderer draws the highlights.

    If the Washington Post had used the TouchUp Text tool to delete the names, the information would not have been leaked.

    Nathan

  8. MS Word got Tony Blair busted in the WMD case by leoaugust · · Score: 5, Informative

    Tony Blair got busted in the WMD case because of the names of the people who revised the WMD Documents were still in the Word file. Now, it seems, that the Downing Street only puts PDF files on the web - and has removed all the MS word documents that were already there ....

    Tools reveal secret life of documents - Documents like in Word save too much Info - Blair Episode

    By Mark Ward

    July 03, 2003

    The UK Government was just the latest in a long line of organisations that has learned to its cost just how much information can be gleaned from innocent looking files. Earlier this year it issued a document called the 'dodgy dossier" about Iraq's concealment of weapons of mass destruction that was written using Microsoft Word. Every Word document remembers who made the last few revisions to it. The log reveals the names of four of the people who prepared the Iraq document for publication and the government Communications Information Centre that some of them work for. It was this log that Number 10 press chief Alastair Campbell had to explain to the House of Commons Foreign Affairs Select Committee in late June as part of its investigation into the Iraq dossier's history. Some of this information can be seen simply by right-clicking to view the properties of the downloaded document in a file listing. Utility programs can get even more information from Word revision logs.

    The life stories of the documents we create are becoming increasingly important as the scrutiny of industries and governments gathers pace. Every time you write or edit these files you leave a trail of information revealing what you did and when you did it. With the right tools it is possible to extract this data and work out the trail of authors and workers who created a document. That is why we should all use opensource and open data formats - so that we can humanly read what all we are "putting" into the document. The Word version of this document has now been removed from government websites but copies of it are still available elsewhere on the net.

    Unabridged and unedited article at

    http://news.bbc.co.uk/2/hi/technology/3037760.stm

    --
    To see a world in a grain of sand, and then to step back and see the beach where the sand lies ...
  9. Microsoft's article on reducing MS Word metadata by 200_success · · Score: 5, Informative

    It has been known for a long time that metadata are hidden within Microsoft Word documents. Microsoft even has Knowledge Base article 237361 explaining how to reduce the amount of metadata appearing in MS Word 2000 documents. Here's an excerpt:

    This step-by-step article explains various methods that you can use to minimize the amount of metadata in your Word documents.

    Whenever you create, open, or save a document in Microsoft Word 2000, the document may contain content that you may not want to share with others when you distribute the document electronically. This information is known as "metadata". Metadata is used for a variety of purposes to enhance the editing, viewing, filing, and retrieval of Office documents.

    Some metadata is easily accessible through the Microsoft Word user interface; other metadata is only accessible through extraordinary means, such as opening a document in a low-level binary file editor. Here are some examples of metadata that may be stored in your documents:

    • Your name
    • Your initials
    • Your company or organization name
    • The name of your computer
    • The name of the network server or hard disk where you saved the document
    • Other file properties and summary information
    • Non-visible portions of embedded OLE objects
    • The names of previous document authors
    • Document revisions
    • Document versions
    • Template information
    • Hidden text
    • Comments
    • Metadata is created in a variety of ways in Word documents. As a result, there is no single method to remove all such content from your documents. The following sections describe areas where metadata may be saved in Word documents.

    I'll bet there are more, but they won't disclose them.

    It's a pity that more people don't just save as RTF. It's just as good for most uses, and it's a less obscure format.

  10. Why Word Does This by spectecjr · · Score: 5, Informative
    I just created a Word document, blah.doc and put some text into it. I made sure I had a couple of undo points. I closed it and opened it back up, I couldn't undo SHIT. So where the hell am I being granted this mysterious "convenience?"

    You're not.

    There are two ways of saving a word document:

    • Fast Save
    • Full Save


    Fast Save dumps the binary from memory into the file. Full Save compacts the binary image, and reorders it. This takes time.

    Word's text stream is stored using a piece table. One of the benefits of a piece table is that if you keep the meta information about the text, you can get nearly infinite undo. The way it does this is by having an original data stream, and an appended data stream. Whenever you add data to the file, it gets added as a chunk to the end of the appended data stream. Whenever you delete, the meta table is updated to remove the text from the stream, but otherwise the text itself is left unaffected.

    As a result, text is never removed from the document. A Fast Save (which is the default) under Word dumps the Piece Table as-is (there is probably some compaction over time to remove the no-longer-used data, but it probably only occurs above a given threshold of used to unused text). A full save deconstructs the piece table's meta information, and turns it back into one contiguous stream of data.

    It's all just a function of the way the text is stored while it's being edited. Different editors have different mechanisms; some store data based on lines, and some store it using a gap buffer. But ultimately, the problem exists because Word uses a piece table, and it dumps the entire table to a file by default.

    It's actually a sensible way of handling the text data. However, whoever designed the Fast Save algorithm probably didn't consider the ramifications of the text still being stored in the document. The best workaround? Wipe the unused sections of the piece table. But then you might as well return to using a Full Save, as you'll be ditching the performance benefits anyway.

    Simon
    --
    Coming soon - pyrogyra