Encrypted But Searchable Online Storage?
An anonymous reader asks "Is there a solution for online storage of encrypted data providing encrypted search and similar functions over the encrypted data? Is there an API/software/solution or even some online storage company providing this? I don't like Google understanding all my unencrypted data, but I like that Google can search them when they are unencrypted. So I would like to have both: the online storage provider does not understand my data, but he can still help me with searching in them, and doing other useful stuff. I mean: I send to the remote server encrypted data and later an encrypted query (the server cannot decipher them), and the server sends me back a chunk of my encrypted data stored there — the result of my encrypted query. Or I ask for the directory structure of my encrypted data (somehow stored in my data too — like in a tar archive), and the server sends it back, without knowing that this encrypted chunk is the directory structure. I googled for this and found some papers, however no software and no online service providing this yet." Can anyone point to an available implementation?
It's not possible to do this even in theory, unless you're relying on very weak encryption. The point of encryption is that you can't infer anything about the contents. If Google was able to infer enough to give you meaningful search results (if for example each word was encrypted by itself, and you searched for the encrypted version of the word), they would therefore necessarily be able to know enough to perform a frequency analysis attack on your data and compromise it in no time flat unless it was a very small amount of data (thus meaning search isn't really of value anyway).
You'll find a similar problem plagues any attempt at searching. Searching requires a certain knowledge or meta knowledge of the material being searched; and that knowledge necessarily dramatically weakens your encryption.
Slay a dragon... over lunch!
For the n00bs, the above post is in ROT13. Here is a link for a converter.
Search works by tokenizing a document, and creating an inverse index from tokens to documents. The tokens does not need to mean anything to the search engine. If you generate the tokens on the client, and don't transmit the dictionary that maps from word to token id, you can have "encrypted search".
The problem with doing that directly is that if you want to do proximity based search you need information on the token order, and they could do frequency analysis to come up with plain text guesses if they guess the language right. You can counteract that by mapping the same word to multiple tokens to even out the frequency of each token id, but it means you would need to search for multiple tokens to find all occurrences of a word.
If you don't are about word proximity it's much safer, as the index would only contain each token once per document at most.