Too Much Data? Then 'Good Enough' Is Good Enough
ChelleChelle writes "While classic systems could offer crisp answers due to the relatively small amount of data they contained, today's systems hold humongous amounts of data content — thus, the data quality and meaning is often fuzzy. In this article, Microsoft's Pat Helland examines the ways in which today's answers differ from what we used to expect, before moving on to state the criteria for a new theory and taxonomy of data."
640K is the best memory size after all, and more than that is too much...
Really, can't slashdot select better articles? Like the declatation that SpaceX will be creating moon rocket that will land on moon in 2020.
I visited their site today and that hit me as a surprise...
Obviously 640k was "Good Enough"
Seriously though he makes a good point. If you have so much information that it isn't stored consistently, with varying standards, or is an open field to be populated by an individuals perceptions. The example of all of the different colors of green (Green, emerald, asparagus, chartreuse, olive, pear, shamrock) is a great example of how one piece of information can be expressed in multiple ways. While you can define the color by using the hex code for it that isn't exactly an elegant or user friendly method of input or output.
He talks about various ways to handle these types of information from limiting input options to finding patterns and using those to "correct" the data.
I'll meet you at the intersection of "Should be" and "Reality"
GOATSE ALERT
The data quality and meaning of this summary is rather fuzzy. I have no clue what exactly they're talking about. No, i haven't RTFA yet, but the summary isn't making it very clear if TFA is something i'd be interested in or not.
This Space Intentionally Left Blank
SQL DBs are not appropriate for storing, processing, querying, and browsing unstructured documents.
Of course he complains, he probably doesn't have ADHD. Todays data-sets and information overload age is adjusted, guessed for people with ADHD. Such people can actually make sense of this humongus data. Because of their lack of attention they probably created some sort of native algorithms to "innovate" necessary lost data and with their attention deficit they lose a lot of information on intake => resonable sized relevant data!
Conclusion
NoSQL systems are emerging because the world of data is changing. The size and heterogeneity of data means that the old guarantees simply cannot be met. Fortunately, we are learning how to meet the needs of business in ways outside of the old and classic database.
Which was apparent to everyone, and missed the real point: We have lots of data, and we're too impatient to wait for it to be aggregated, synchronized and processed. There goes 10 minutes of my life I'll never get back.
Here's a hint: People working on the solutions to this problem work in the financial sector and in quantum physics.
The researcher is just throwing together a bunch of problems that have existed, in some fashion, for a very long time, and concludes with open questions rather than even vague proposals for solutions. So I would say this article is both too detailed, and not detailed enough.
Uh...so you spent decades working on systems which are not needed for many problems (many problems don't need transactions, especially mostly read web publishing problems, which is a strength of no-sql), and now you are upset that people are not using your systems?
Should be 'Too Many Data'. Morans.
Tubal-Cain smokes the white owl.
The article makes an assumption that all data in the world consists of marketing surveys and transcripts of phone wiretaps.
Contrary to the popular belief, there indeed is no God.
This article is confusing because most of the verbiage is made up by the author (such as "inside" or "locked" data). It is also misleading because it seems to indicate that structured and unstructured data usage is the same. Well it's not - a very large proportion of unstructured data is blog posts and emails but the amount of search and aggregation that is performed on this type of information outside of a few major companies (such as Google) is very low, which makes this usage a niche and not a trend maker.
The reality is that there are three categories of data that are relevant for databases: numbers, text and spatial. Everything else, which falls under the umbrella of "binary", is very unlikely to benefit from a database engine; only the metada can be manipulated and this metadata falls under one of the other categories and is a very good target for ETL. And so far nobody came up with a reliable way to search binary, such as video or audio, without relying on heavy indexing, metadata or any kind of transformation that takes binary and make it text data.
If a piece of data cannot be searched or aggregated, it does not belong in a database, it belongs on a filesystem. Anything can be done with blob columns but performance is usually not very good because the database engine cache is not designed for large objects. NoSql or not.
Also there is so much happening with storage infrastructure, such as sub-volume tiering or block-level replication, any analysis of data that does not take a look at storage is flawed.
lucm, indeed.
We don't read articles, just skim the headline, maybe the submittal, and then a few top ranked posts.
That's Good Enough! (tm)
This is why Statistics will become more and more important over time--it allows you to make inferences about populations that you couldn't possibly count. If you already know Comp Sci or or learned how to program on your own, go for a couple of Stats degrees. Along with your programming skills Stats will do you very well as the information age unfolds.
So-called scientists saying it's OK to just take a guess only shows what scientists have become in this modern world. Once you get to that point, you may as well throw out the data and base your guess on whatever floats your boat. It wouldn't be any less valid - and no less "scientific" according to this bozo.
Boolean logic doesn't adequately describe our perception of reality, and trying to force reality into a true-false description is simplistic and doomed to fail. There's another valid state - "I don't know". And if the dataset is impure or inconsistent, then that's the only valid conclusion.
"---- Teach Peace. It's Cheaper Than War."
Nice sig. Now the Republicans who spent too much are trying to blame it on Obama.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
It's the problem of Significant Figures for verbal data sets.
Last I recalled, you can only keep he number of significant figures equal to the fuzziest of the inputs. So you have 45.236 + 12.877 + "one million" ... means your answer can only keep the one significant figure of "a little over a million".
So for these non-verbal data sets, you get too many data fields, and misc people forget to put the stuff in ref1, someone puts a date instead of an invoice number in ref2, the vendor code is wrong in ID1, someone puts an employee instead of the lumber yard in ID2 etc. So then when the boss wants "gimme the total set of cases I need to go manage", you get bad searches.
My first Journal Entry ever, in 8 years! http://slashdot.org/journal/365947/aphelion-scifi-fantasy-horror-poetry-webzine
If the people that write these stories would familiarize themselves with Information Theory (Claude Shannon, in the 1940's) then they'd understand that you still can't make silk purses from sow's ears.
Yes, it's a lot of records. Yes, the data entry people made mistakes. All this really means is that there's more noise in the data. As the signal to noise ratio declines, the value of the results also declines. Making decisions based on noisy data isn't science, it's only guesswork. That's fine for weather forecasting (a similar problem) but expecting the results from the described data to be more accurate than weather forecasts is foolish. Remember: garbage in, garbage out.
Since when has MS ever had any OTHER opinion on ANYTHING!
The problem being encountered is one I've faced often in 30 years of weather forecasting: Ambiguity Management.
The weather business deals with reams of data from thousands of sources and all the complexity of trying to follow a single swirl within a flowing river to figure out where it will be tomorrow. Decades of research and modeling have evolved into dozens of primary rule-based tools available to forecasters which are applicable to most situations. Objectively, you should be able to follow the rules, weed out the conflicting or contradictory ones, and get a reliable answer. Realistically, you don't. Why? Two reasons:
1. The dataset is incomplete.
2. The tools are imperfect.
You simply can't have perfect knowledge of all the relevant details in the atmosphere to feed a completely objective tool (computerized model or whatever) to get your perfect prediction. Like Rosanne Rosannadana's mother said, "It's Always Something!"
The trick then in being a good (aka reliable) weather forecaster then is how you manage the ambiguity of incomplete data filtered through inherently biased tools. Some weather stations run hot or cold, have local effects enhancing or reducing pressure or winds, etc, etc, etc. Good models account for this, but that's a static adjustment, not a dynamic one. Models run hot or cold, fast or slow, depending on their structure and assumptions, and they reval their strengths and weakness over time compared to other models and reality at verification time.
The basic forecasting questions are - Where is it, Where is it going, an what will happen when it gets there? Because the models are perfect (100% replication of output from identical starting states), but are always wrong (inherent model and data limitations), you make your money examining the consistency. The model(s) are running slow and cold recently due to the whatever event going on? Ok -- warm it up a few degrees and expecting things a few hours earlier than it forecasts tomorrow. Some models handle well in winter but get klutzy with large thunderstorm events. One model I worked with covered the world in clouds if you waited long enough. Solution? Don't trust it past X number of hours. And so on for the family of models through the decades and to today. Some models have high skill up to a certain point then it drops off quickly. Others show less skill, but are decent for the long haul. You get the idea. You can make a forecast using only one tool, but you can make a better one using several and sorting out their differences by using ambiguity management.
Needless to say, you needed a solid understanding of the physics and dynamics of the atmosphere to help make good decisions to do all this effectively. The modelers and users now data mining these huge collections of information likewise need a solid understanding of Statistics and the event mechanics they're examining to make any good sense of it all. At the very minimum, a large poster announcing "Coincidence is not Causation" needs to be in every office, otherwise you start getting breathless announcements about how underarm deodorant "causes" cancer because people eating hamburgers had a lower incidence rate by comparison.
Your Mileage May Vary -- a lot. That's the point.
Pacifist paratroopers yell, "Ghandi!" when they jump.
Doesn't "how much is too much" depend more on what sort of data you are talking about than the systems used to record and analyse it? Aircraft risk analysts would surely argue that they need all the data they can get to help prevent every instance of catastrophic failure. Biologists on the other hand are used to working with extraordinarily fuzzy data and still drawing valid conclusions
Too Much Data?? Give me a break. Certainly, the less you have the more refined or distilled the information should be. But that amount of data is only a subset. Problem with most current databases is two fold -- improper storage and not understanding how to get the information back out. Current and former software systems were and are great data vacume cleaners. Sucking up every bit of information that came near. Dta was stored and stored and stored. Most never saw the light of day again. Why....? Many systems do not store the data correctly or in a manner that allows it to be retrieved. Storage for the sake of storage does a business little to no good. Also, the talent to be able to retrieve stored data in a manner that is usable and understandable is not there either. Working with small databases with small, well defined tables with few fields is easy. Working with larger amounts of data, some seemingly unrelated, is another matter all together.