Slashdot Mirror


Encrypted But Searchable Online Storage?

An anonymous reader asks "Is there a solution for online storage of encrypted data providing encrypted search and similar functions over the encrypted data? Is there an API/software/solution or even some online storage company providing this? I don't like Google understanding all my unencrypted data, but I like that Google can search them when they are unencrypted. So I would like to have both: the online storage provider does not understand my data, but he can still help me with searching in them, and doing other useful stuff. I mean: I send to the remote server encrypted data and later an encrypted query (the server cannot decipher them), and the server sends me back a chunk of my encrypted data stored there — the result of my encrypted query. Or I ask for the directory structure of my encrypted data (somehow stored in my data too — like in a tar archive), and the server sends it back, without knowing that this encrypted chunk is the directory structure. I googled for this and found some papers, however no software and no online service providing this yet." Can anyone point to an available implementation?

9 of 266 comments (clear)

  1. It's not possible even in theory by nahdude812 · · Score: 4, Informative

    It's not possible to do this even in theory, unless you're relying on very weak encryption. The point of encryption is that you can't infer anything about the contents. If Google was able to infer enough to give you meaningful search results (if for example each word was encrypted by itself, and you searched for the encrypted version of the word), they would therefore necessarily be able to know enough to perform a frequency analysis attack on your data and compromise it in no time flat unless it was a very small amount of data (thus meaning search isn't really of value anyway).

    You'll find a similar problem plagues any attempt at searching. Searching requires a certain knowledge or meta knowledge of the material being searched; and that knowledge necessarily dramatically weakens your encryption.

    1. Re:It's not possible even in theory by TheRaven64 · · Score: 5, Informative

      Replying to myself: the scheme in the linked paper is not feasible. It performs O(n) searches, but this means that the amount of data you need to upload for the query is equal to the total amount stored. Since most consumer Internet links are asymmetric, it would be cheaper and easier to simply download the entire data search locally. The paper proposes having a server-side cache. This means that, for a typical block cypher, you would have a cache of every search term encrypted for each block. The server could then compare this to each block, but would not know what the plaintext is. This is not useful in any real-world scenario. The cache would be orders of magnitude bigger than the stored data and the search would sill be O(n), which is painfully slow. As I suggested above, uploading an encrypted index with the data makes more sense. Look at Apache Lucene or Apple's SearchKit for how to do this.

      --
      I am TheRaven on Soylent News
    2. Re:It's not possible even in theory by cakeninja · · Score: 3, Informative

      Mozy does not encrypt your file names. Someone without your private key could still view your file names if they had your Mozy login information.

    3. Re:It's not possible even in theory by Anonymous Coward · · Score: 2, Informative

      "why would you post a comment claiming that this can't even be done in theory, when the submitter included links in the summary to a paper that shows that it can?"

                Because it can't. The one paper proposes (unless I'm missing something!) giving the server the word to search for AND the keys! The security is by frequently rotating the key, and if you KNOW you only wanted to search, say, chapter 1 of a longer document, only give the key for chapter 1. Not very secure!

                If the encrypted data has ANY types of patterns that can be used to infer the contents, the encryption system is weak. The only way to do this is to generate some kind of metadata (search indexes basically) locally, BEFORE you send up the encrypted files, send the metadata up *unecrypted*, and hope the metadata doesn't have sensitive data.

    4. Re:It's not possible even in theory by BitZtream · · Score: 2, Informative

      And that would practically defeat the purpose of the encryption.

      For the index to be useful it has to provide too much information about the encrypted data. The point of encryption is to ensure that nothing can be inferred about the contents of the encrypted data. If you give them a nice big bunch of information about whats encrypted, why bother encrypting it in the first place?

      Given enough information in the index they could actually derive your encryption key as well with some simply brute forcing.

      --
      Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
  2. Re:huh? by needs2bfree · · Score: 4, Informative

    For the n00bs, the above post is in ROT13. Here is a link for a converter.

  3. There is a way, kind of: PIR by Naerbnic · · Score: 2, Informative

    There is a cryptography technique called Public Information Retrieval which allows you to do just that: Send an encrypted query to a server, let it perform some operations on your behalf, and send you an encrypted query result. The server neither knows the contents of the encrypted data, nor the content of the query, but you have your result nonetheless.

    The intuition is that there exists a sort of "black-box" operation which some cryptographic techniques can use. For example, if I have two encrypted bits a and b (where I can't tell what a and b actually are), I can still perform the operation a xor b. The result is encrypted, and I don't know the actual operands or the result, but I know that what came out is indeed the encryption of the xor of the encrypted bits. Such cryptosystems are forms of "Homomorphic Encryption".

    Using this, we can then give the server a search term thus encrypted and, using the black-box opertaion, have it do some set of operations which will reveal the result. The server will execute the exact same set of operations independent of the search term, so it knows nothing (and needs to know nothing) of the search term contents. Of course, this implies that the server has to operate on every element of the encrypted data to do its job, but that's the fundamental tradeoff. If you're willing to accept that, and the additional computational overhead, you can design such a system.

    --


    So there I was, juggling apples and small animals, when I accidentally bit into the wrong one...
  4. Re:You want to... by Thad+Zurich · · Score: 2, Informative

    ROT13 is encoding, not encryption. You transform the information, but you don't conceal any of it. Ziad El Bizri (OP cit.) apparently observes that if you encrypt the keywords individually, then you can submit encrypted keyword queries, and the server can search for them for you. This is great, but why would you want to? The object of a search server is for other people to be able to search the data (otherwise why index it on the server?) With the suggested scheme, only the data owner (or shared key holders) will be able to search the data. It would seem to be just as easy to construct a trustworthy server and then encipher the query traffic, as has already been observed.

  5. Re:Am I missing something? by vidarh · · Score: 4, Informative
    No, it's not impossible. It's not even particularly hard. You do have some limitations though:

    Search works by tokenizing a document, and creating an inverse index from tokens to documents. The tokens does not need to mean anything to the search engine. If you generate the tokens on the client, and don't transmit the dictionary that maps from word to token id, you can have "encrypted search".

    The problem with doing that directly is that if you want to do proximity based search you need information on the token order, and they could do frequency analysis to come up with plain text guesses if they guess the language right. You can counteract that by mapping the same word to multiple tokens to even out the frequency of each token id, but it means you would need to search for multiple tokens to find all occurrences of a word.

    If you don't are about word proximity it's much safer, as the index would only contain each token once per document at most.