Slashdot Mirror


Encrypted But Searchable Online Storage?

An anonymous reader asks "Is there a solution for online storage of encrypted data providing encrypted search and similar functions over the encrypted data? Is there an API/software/solution or even some online storage company providing this? I don't like Google understanding all my unencrypted data, but I like that Google can search them when they are unencrypted. So I would like to have both: the online storage provider does not understand my data, but he can still help me with searching in them, and doing other useful stuff. I mean: I send to the remote server encrypted data and later an encrypted query (the server cannot decipher them), and the server sends me back a chunk of my encrypted data stored there — the result of my encrypted query. Or I ask for the directory structure of my encrypted data (somehow stored in my data too — like in a tar archive), and the server sends it back, without knowing that this encrypted chunk is the directory structure. I googled for this and found some papers, however no software and no online service providing this yet." Can anyone point to an available implementation?

5 of 266 comments (clear)

  1. It's not possible even in theory by nahdude812 · · Score: 4, Informative

    It's not possible to do this even in theory, unless you're relying on very weak encryption. The point of encryption is that you can't infer anything about the contents. If Google was able to infer enough to give you meaningful search results (if for example each word was encrypted by itself, and you searched for the encrypted version of the word), they would therefore necessarily be able to know enough to perform a frequency analysis attack on your data and compromise it in no time flat unless it was a very small amount of data (thus meaning search isn't really of value anyway).

    You'll find a similar problem plagues any attempt at searching. Searching requires a certain knowledge or meta knowledge of the material being searched; and that knowledge necessarily dramatically weakens your encryption.

    1. Re:It's not possible even in theory by TheRaven64 · · Score: 5, Informative

      Replying to myself: the scheme in the linked paper is not feasible. It performs O(n) searches, but this means that the amount of data you need to upload for the query is equal to the total amount stored. Since most consumer Internet links are asymmetric, it would be cheaper and easier to simply download the entire data search locally. The paper proposes having a server-side cache. This means that, for a typical block cypher, you would have a cache of every search term encrypted for each block. The server could then compare this to each block, but would not know what the plaintext is. This is not useful in any real-world scenario. The cache would be orders of magnitude bigger than the stored data and the search would sill be O(n), which is painfully slow. As I suggested above, uploading an encrypted index with the data makes more sense. Look at Apache Lucene or Apple's SearchKit for how to do this.

      --
      I am TheRaven on Soylent News
    2. Re:It's not possible even in theory by cakeninja · · Score: 3, Informative

      Mozy does not encrypt your file names. Someone without your private key could still view your file names if they had your Mozy login information.

  2. Re:huh? by needs2bfree · · Score: 4, Informative

    For the n00bs, the above post is in ROT13. Here is a link for a converter.

  3. Re:Am I missing something? by vidarh · · Score: 4, Informative
    No, it's not impossible. It's not even particularly hard. You do have some limitations though:

    Search works by tokenizing a document, and creating an inverse index from tokens to documents. The tokens does not need to mean anything to the search engine. If you generate the tokens on the client, and don't transmit the dictionary that maps from word to token id, you can have "encrypted search".

    The problem with doing that directly is that if you want to do proximity based search you need information on the token order, and they could do frequency analysis to come up with plain text guesses if they guess the language right. You can counteract that by mapping the same word to multiple tokens to even out the frequency of each token id, but it means you would need to search for multiple tokens to find all occurrences of a word.

    If you don't are about word proximity it's much safer, as the index would only contain each token once per document at most.