>
1. A computer-readable storage medium storing computer-executable instructions for performing a method comprising:re-writing a query to contain data parallel operations that include partitioning and merging, wherein the query identifies at least one input data source;partitioning the at least one input data source into a plurality of initial partitions;performing a parallel repartitioning operation on the initial partitions, thereby generating a plurality of secondary partitions; andperforming a parallel execution of the query using the secondary partitions, thereby generating a plurality of output sets.
>
or indeed, any other obvious way to do parallel processing.
I presume they are talking about HBase. We used this a while back and it was pretty flakey and slow. There are numerous alternatives out there, for example
http://hypertable.org/
Which has the added advantage of being written in C++
Google has the benefit of having a lot of employees, a lot of goodwill, and a lot of money, so when it takes the "throw shit at the wall and see what sticks" business strategy, things have a way of working out for them.
But would this work for anyone else? Maybe Apple.
One problem with grading projects by how popular they are internally is that flashy projects get chosen over boring, less obvious but quite possibly equally important ones.
An example: who would want to work on infrastructure which is never directly visible to the outside world?
There is nothing "fuzzy" about this line of reasoning at all. Take a look at the book by Motowani and Raghavan ("Randomized Algorithms"). There you will see various kinds of randomisation: for example, you can have a randomised approach which is always deterministic in the answer, but may take a variable amount of time. (Randomised versions of Quicksort do this). Or, you can have another type which may make incorrect answers, at a given error rate. Probabilistic counting is of this form.
Now, you can turn a program which makes errors into one which does not make errors by repeatedly running it. In the case of probabilistic counting, you might do the whole thing many times and report the average. So, you can trust it: you just need to know what the probability of error is.
Verification in this case means probabilistically proving that the processor will make a certain number of errors, given certain conditions. Since you can make the error rate as low as you want (it could be lower than actual quantum events, for example) this isn't as bad as you might think.
Randomised approaches can make proofs simpler than deterministic versioins.
This has been done already. Look at "Bloom Filters" which you can think of as a randomised way to store information. You might make errors (for example, you think you stored an item when you never did) but the space savings can be vast compared with traditional exact representations.
(Probabilistic counting is another randomised algorithm and has been used in cases when you want to count in log-space directly, but at some given error rate).
http://www.theonion.com/content/node/29130
I presume they are talking about HBase. We used this a while back and it was pretty flakey and slow. There are numerous alternatives out there, for example http://hypertable.org/ Which has the added advantage of being written in C++
True, but this doesn't sound like it needs that much skill to do it. Also can't this sort of thing be automated?
Google has the benefit of having a lot of employees, a lot of goodwill, and a lot of money, so when it takes the "throw shit at the wall and see what sticks" business strategy, things have a way of working out for them.
But would this work for anyone else? Maybe Apple.
One problem with grading projects by how popular they are internally is that flashy projects get chosen over boring, less obvious but quite possibly equally important ones.
An example: who would want to work on infrastructure which is never directly visible to the outside world?
There is nothing "fuzzy" about this line of reasoning at all. Take a look at the book by Motowani and Raghavan ("Randomized Algorithms"). There you will see various kinds of randomisation: for example, you can have a randomised approach which is always deterministic in the answer, but may take a variable amount of time. (Randomised versions of Quicksort do this). Or, you can have another type which may make incorrect answers, at a given error rate. Probabilistic counting is of this form.
Now, you can turn a program which makes errors into one which does not make errors by repeatedly running it. In the case of probabilistic counting, you might do the whole thing many times and report the average. So, you can trust it: you just need to know what the probability of error is.
Verification in this case means probabilistically proving that the processor will make a certain number of errors, given certain conditions. Since you can make the error rate as low as you want (it could be lower than actual quantum events, for example) this isn't as bad as you might think. Randomised approaches can make proofs simpler than deterministic versioins.
This has been done already. Look at "Bloom Filters" which you can think of as a randomised way to store information. You might make errors (for example, you think you stored an item when you never did) but the space savings can be vast compared with traditional exact representations. (Probabilistic counting is another randomised algorithm and has been used in cases when you want to count in log-space directly, but at some given error rate).