Why Is Data Mining Still A Frontier?

Shot in the dark: by Spazntwich · 2006-04-10 08:50 · Score: 5, Insightful

Either
a) There's not enough money in it to make it worthwhile

or

b) It doesn't work.

Re:Shot in the dark: by Disavian · 2006-04-10 08:58 · Score: 5, Insightful

How about

c) our ability to produce data far outstrips our ability and/or willingness to analzye it
Re:Shot in the dark: by Chrispy1000000+the+2 · 2006-04-10 08:59 · Score: 1

I'm a proponent of b, as it's ruddy hard to express equations with a limited set of characters, none of which are repeated for clarity.

--
Sig
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 09:02 · Score: 0

b) It doesn't work.
or (c) human-provided context/interpretation is important and scarce. I work for (government scientific body) and there are 500 of us in the department and we do research. We have about 10 quite complex DBs, I would gain a lot of insight if I could link them all, but too often the pattern I would find turns into (when I walk down the hall to the original collector) "yes, the data was biased that year because the gear wasn't set up correctly" etc. Data gathering is an imperfect process. Data processing can be automated. But data interpretation still needs human thought.
Otherwise, I'd correlate everything against everything, and at least 5% of my results could be published!
Re:Shot in the dark: by Daniel+Dvorkin · 2006-04-10 09:03 · Score: 4, Informative

Neither of those is quite true -- a lot of entities public and private are throwing a lot of money at data mining research, reasonably expecting a big payoff, and sometimes it gets very good results indeed. The basic problem is that, as with any worthwhile CS question, doing it well is hard. It is very easy to come up with false connections between data. Sorting the wheat from the chaff in any kind of automated or even semi-automated fashion, OTOH, is an enormous challenge.

Analogies like this are always dangerous, but I'd say data mining now is about where language development was in the mid-1950's, when FORTRAN was first being developed. IOW, we have a set of tools that kind of work, most of the time, for certain applications -- but we can pretty much guarantee that they're not the best possible tools, and that we will build better ones. Consider how much work is still going on in language development half a century later, and you can see how much room there is for further development.

--
The correlation between ignorance of statistics and using "correlation is not causation" as an argument is close to 1.
Re:Shot in the dark: by delete · 2006-04-10 09:05 · Score: 2, Insightful

Or

c) The title of this submission is inaccurate, as data mining tools are both useful and financially lucative in a wide variety of domains today, particularly bioinformatics, image analysis and text mining.

Of course, the title of this article is quite ambiguous and misleading: the article itself is concerned with RDBMS, rather than the statistical analysis of data.
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 09:14 · Score: 0

This comment is both too informative and insightful for this topic, and Slashdot. Be prepared to be ignored while long threads erupt over the superiority of MySQL vs Oracle vs etc, as well as a range of other points that completely miss the point.
Kind of makes you want to mine the Slashdot posts for relevant entries ...
Re:Shot in the dark: by flynt · 2006-04-10 09:31 · Score: 4, Insightful

Also, blindly "mining" data for trends can be very misleading. Hypothesis generation is usually better done some other way. There will always be trends in data we already have that are there by chance, and this is what data mining finds in many cases. Then models are fit to that data and don't validate on future samples taken, and everyone wonders why.
Re:Shot in the dark: by mizhi · 2006-04-10 09:42 · Score: 1

Maybe I'm missing something. The article title suggests that datamining is not a frontier of research, the summary insinuates that there are no more uses for RDBMS systems since we have google, and the actual article talks about the use of MS SQL server to discover patterns in a set of data more efficiently and seemed to insinuate that many researchers overlook these technologies to analyze their datasets.

If anything, the article is support for the use and continued development of datamining technologies.

My question is, was the article submitter smoking something? For that matter, was the /. editor who approved it also smoking something?

--
Humorless sig goes here.
Re:Shot in the dark: by rainman_bc · 2006-04-10 09:43 · Score: 2, Insightful

I'm a big fan of c. As a reporting and data analyst, I see the same crap all the time.

People design systems for what they want to put into it, without consider what they want to get back out of it. That usually results in crappy query performance and all that crap because of undue care. When designing a system, engineers need to be aware of : 1) What do we want to store and how do we want to store it, 2) how do we want to put it in there, 3) What do we want to get back out of it.

Many people in designing systems pass over 3.

I've seen it in my last job. I had no input in database design, and had to deal with insanly stupid queries resulting from thoughtless and careless design.

--
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Re:Shot in the dark: by guitaristx · 2006-04-10 09:43 · Score: 1
My "Shot in the dark" goes like this:
1. Manager-type person wants to start collecting data from which data mining should occur.
2. Manager-type person finds publically-available, easy-to-process data, and assumes that all data has the same attributes.
3. Manager-type person fails to make the distinction between qualitative and quantitative data.
4. Manager-type person fails to make the distinction between real data and derived data (i.e. data that can be calculated from other data).
5. Manager-type person fails to understand why multiple authoritative sources for data is a hard problem to solve.
6. Manager-type person fails to understand the chaos and destruction that ensues from changing a data model without thoroughly scouring dependent apps to alleviate dependence on the now-obsolete data model.
7. Manager-type person fails to understand that data reliability is only as strong as its weakest source.
8. ...
9. No Profit!!!! The data is useless for any type of business activity.
--
I pity the foo that isn't metasyntactic
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 09:44 · Score: 0

Or how about people need to ask the right questions.

Data mining is more about reporting and trend following. Yet what trend do you want to follow? Just having data does not neccistate a trend.

You *MUST* have questions to ask. I can make the data dance just about any way you like. However, you must ask me what you are looking for. Just having the data of how long does it take to serve a box of french fries is not enough. You may ask 'how long does it take my customer to get their food?', or 'How long until a box of fries goes cold and I have to throw it out?'. Both of these questions are 'interesting' in and of themselves however, they are really asking about a process. There are other questions like 'what kinds of things to people buy with the medium fries vs the large fries?'. That is a trend type question. Which lets you change your process to fit different markets you may cover if there are different ones. Or perhaps you WANT to have different markets? How do you do that?

Once you have the data and trends that is not enough. You *MUST* be willing to change your process. If you do not all you have is some interesting triva facts. You can also then test to see if your new process is working. This is where data mining shines. However you must have people who know what questions to ask. These people tend to be marketing people and middle management types. The 'grunts' out mining the data have 0 idea how to put the big picture together.

I have seen data where by shifting the start of a group of people by a half hour to start later they had a 100% improvment in performance. This was because people were either waiting on the first group to finish something or a small resource was overwhelmed. Data mining works, but you must ask the right questions.

This sort of change does not come easy for some companies. They have been sometimes doing things for a hundred years. Why change? You must have a managment chain willing to change or as I said before it is just trivia. I have seen companies buy this stuff use it to death then change nothing. All they did was measure how good or bad they are doing. Data mining is more like a ruler. You can measure what you are doing. But if you want something to be shorter you need to get out a saw.
Re:Shot in the dark: by AuMatar · 2006-04-10 09:51 · Score: 2, Insightful

Occurences of polio go up in summer.
People eat more ice cream in summer.

Conclusion: ice cream causes polio.

This was actually something people believed for a brief time before the Salk vaccine. Its also a great example of the kind of facts data mining most frequently dredges up- accidents or correlation with no real common cause.

--
I still have more fans than freaks. WTF is wrong with you people?
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 09:55 · Score: 1, Funny

I was thinking the very same thing about my pr0n collection.
Re:Shot in the dark: by IdleTime · 2006-04-10 10:00 · Score: 1

Not to mention that when the data is not normalized (as is the case with most customers I deal with), it's just a messy spagetti of data that can not be related outside their inital scope.

--
If you mod me down, I *will* introduce you to my sister!
Re:Shot in the dark: by Coryoth · 2006-04-10 10:15 · Score: 5, Informative

a lot of entities public and private are throwing a lot of money at data mining research, reasonably expecting a big payoff, and sometimes it gets very good results indeed. The basic problem is that, as with any worthwhile CS question, doing it well is hard. It is very easy to come up with false connections between data. Sorting the wheat from the chaff in any kind of automated or even semi-automated fashion, OTOH, is an enormous challenge.

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie. It's avery very old field which people have been working on for a very long time, to the point where the problems that remain to be solved are incredibly difficult. What is new is someone other than pure mathematicians taking much interest in these problems. Do a search for "non linear manifold learning" on Google and you'll see what I mean.

Jedidiah.

--
Craft Beer Programming T-shirts
Re:Shot in the dark: by OpticalPaul · 2006-04-10 10:25 · Score: 1

It doesn't work. But lots of folks will claim that it does, or that it can, because lots of folks want to make money. And you can't make money claiming it doesn't work.
Plenty of folks have already suggested there are simply technical barriers to its success. Others have suggested legal barriers, or social barriers.
But the simple fact is that once you have enough data available, you can "mine" any result you want! Datamining is not about letting the data lead you to certain conclusions. It's all about trying to find things in the data that "hidden" - things that really aren't there when the data are properly analyzed.
It's akin to proving that lotteries aren't random, because some numbers come up more frequently than others. Or that a coin flip isn't random because 100 flips doesn't result in exactly 50 "heads".
Datamining is, generally, bunk science. (It should not, however, be confused with proper data analysis techiques, which are extremely useful and popular even today.)
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 10:27 · Score: 0

I'd like to see ANY of those 3 requirements be looked at in systems. IMHO, it's not the engineers that are at fault, it's the management types that cause those problems.
At my last job, the design meetings we had went something like this:

Me: So what data are you planning to store?
Management: Everything.
Me: What do you mean, everything?
Management: All the data we collect. We need you to put it in a database.
Me: Ok, what kind of reports and queries will you be making on the data?
Management: I don't know. We need a database that we can run reports on.

These sorts of design meetings were the norm. It wasn't so much that we were passing over "What do you want to get out of it" as a question, but management didn't really know what they wanted.
Re:Shot in the dark: by arlow · 2006-04-10 10:32 · Score: 3, Insightful

It does work, but it requires judgement. A lot of people seem to think that you just shove the data into a statistical test, out comes a p-value, and if it's small enough you win. Interpreting and validating the initial hit is where 90% of the real work is, and it requires the careful application of prior knowledge and subsequent experiments. I work with a guy who's probably one of the best statisticians in the world, and he often asks me, "well, does the result make sense?" His judgement was developed over decades of looking at real data. If you just shove your data into an algorithm and take the top-scoring hits, you'll probably spend most of your time chasing bogus predictions. Algorithms are good for automating specific tasks that are essentially repeatable. Data mining requires an in-depth understanding of the specific problem you're trying to solve; you usually need to tailor your statistics so that they make sense for the problem. That's why the idea of selling someone a suite of fancy data mining software is probably useless; you need to sell them the statistican too.
probably :/

--
my other lambda is a Y
Re:Shot in the dark: by plover · 2006-04-10 10:33 · Score: 4, Insightful

I have to wonder if data mining isn't the problem -- the real problem seems to be that there are few obvious problems data mining will solve.
Consider WalM*rt. When the 2005 hurricanes were predicted, they mined their sales data for previous hurricanes. They found that in the last hurricane people stocked up on beer, pop tarts and peanut butter, so they sent trucks full of that stuff to the stores in the path of the hurricanes. They made lots of sales, and provided a valuable service to the communities. Capitalism at its finest.
Data mining worked very well in this case. The issue was "here's an obvious problem, and a clever solution involving data mining."
The big problem is that people expect the same golden results from non-obvious situations. "Hey, sales are down in the Wisconsin stores, let's do some data mining to figure out what they'll buy" makes no sense. Data mining worked well in the case of an obvious trigger event, but data mining by itself didn't reveal the trigger. You can't predict hurricanes based on the sales of pop tarts and beer, for example.
But, can you ever correlate pop tart and beer sales to an external event? You might be able to go back and say "here's a strange case where pop tarts and beer sold out quickly, why did this happen?" If you can tie this to external events, you'd think you'd be better prepared to react to the same events in the future.
Maybe correlating sales to Google News is the next step? Republican scandal == lower white bread sales; French riots + Senate bickering over immigration control reform == higher 'Peeps' sales; etc. p. Or maybe it's always been a bad idea to equate correlation with causality.

--
John
Re:Shot in the dark: by KefabiMe · 2006-04-10 10:41 · Score: 1

c) our ability to produce data far outstrips our ability and/or willingness to analzye it

Wouldn't that be the same as b) it doesn't work?
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 10:44 · Score: 0

So your managers and mine went to the same school.
Re:Shot in the dark: by Tim+C · 2006-04-10 10:51 · Score: 1

No, his c) is more like "I could do it, but I really can't be bothered".

--
It's official. Most of you are morons.
Re:Shot in the dark: by budgenator · 2006-04-10 10:57 · Score: 1

not all data should be normalized (accounting data jumps to mind), but most data that should be normalized isn't

--
Apocalypse Cancelled, Sorry, No Ticket Refunds
Re:Shot in the dark: by drinkypoo · 2006-04-10 10:59 · Score: 1

C is the "all of the above" choice. Willingness is tied to monetary reward. Ability falls under "it doesn't work". Congratulations on your +5 score on a totally redundant comment.

--
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Re:Shot in the dark: by rainman_bc · 2006-04-10 11:22 · Score: 1

Of course when data is normalized poorly sometimes you end up with impossible joins, only solved by unions.

A specific case comes to mind. A guy I worked with wanted to design a billing system. He had six tables represnting detail lines on the invoce. Each table had identical fields except for a few items. The data should not have been normalized because a report on invoicing would have required a six table union. Unacceptable IMO.

--
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Re:Shot in the dark: by Shimmer · 2006-04-10 11:38 · Score: 2, Funny

Doesn't sound very normalized to me. Those "identical fields" should have been moved into their own table.

--
The most rabid believers in American Exceptionalism are the exact same people whose policies are destroying it.
Re:Shot in the dark: by hobbit · 2006-04-10 11:42 · Score: 1

It's akin to proving that lotteries aren't random, because some numbers come up more frequently than others. Or that a coin flip isn't random because 100 flips doesn't result in exactly 50 "heads".
The proper study of probability provides for the debunking of such nonsense. Bad statistics is bad statistics, whether applied to coin tossing or data mining.

--
"Wise men talk because they have something to say; fools, because they have to say something" - Plato
Re:Shot in the dark: by cafeman · 2006-04-10 12:10 · Score: 1

But the simple fact is that once you have enough data available, you can "mine" any result you want! Datamining is not about letting the data lead you to certain conclusions. It's all about trying to find things in the data that "hidden" - things that really aren't there when the data are properly analyzed.

Depends what you mean by "data mining". As the other reply has already said, bad statistics is bad statistics, regardless of the name. There's plenty of techniques in use to prevent spurious or misleading results through data mining - the use of hold-out samples, test cases, and random sampling checks are all ways of keeping the systems or results "honest" and accurate.

Sure, you can get any results you want through data mining. But, if it doesn't hold up in the real world, it's useless. So, you test it before you implement it, just as with any other rigorous field.

Datamining is, generally, bunk science. (It should not, however, be confused with proper data analysis techiques, which are extremely useful and popular even today.)

You say potaeto, I say potahto. Good data mining uses good data analysis techniques. Bad data analysis is bad data analysis, no matter which way you slice it, and it isn't limited to data mining.

--
This is your life, and it's ending one minute at a time.
Re:Shot in the dark: by TobiasS · 2006-04-10 12:11 · Score: 1

The common misnomer these days is that people call databases containing a decent amount of data a warehouse.
Mixing OLTP and warehouse environments on the same infrastructure never works out too well.

OLTP and warehousing really live at the opposite end of the optimization spectrum. Thats true for hardware sizing, schema design and database configuration.

Queries get too complex.
Confused optimizers
reports run too long
while the reports run OLTP performance suffers
Re:Shot in the dark: by cafeman · 2006-04-10 12:24 · Score: 1

The basic problem is that, as with any worthwhile CS question, doing it well is hard. It is very easy to come up with false connections between data. Sorting the wheat from the chaff in any kind of automated or even semi-automated fashion, OTOH, is an enormous challenge.

I'll respectfully disagree. There's a very large number of organisations that are using predicitive modelling through data mining to conduct various forms of customer scoring and analytical CRM activities. These are being used in a production sense, where they run totally hands-off and are used as inputs into the sales or customer maintenance process. This stuff has been going on in the financial services industry for at least a decade, and is very mature.

What I think you may be talking about is machine-driven data mining, where little to no human interaction occurs during the model formulation stage. This is still very much a frontier. Conceptually (and technically, in some cases), it's possible to automate in certain set of well understood circumstances (such as within forecasting, credit scoring, and other pretty well understood fields), but there's a limited set of products out there that provide the flexibility and automation to do so. They do exist, however.

Analogies like this are always dangerous, but I'd say data mining now is about where language development was in the mid-1950's, when FORTRAN was first being developed. IOW, we have a set of tools that kind of work, most of the time, for certain applications -- but we can pretty much guarantee that they're not the best possible tools, and that we will build better ones.

Within best of breed applications (and I'm not including anything from the originally pure-play DB vendors here), data mining tools have matured to the point where the incremental returns (measured through accuracy improvements) from version releases are trailing off. They're still there, and each additional model provides greater breadth within a tournament, but in a business sense, the incremental returns are falling. The next big steps will involve improved background model validation, automation, model lifecycle management, and seamless integration with business support systems. That's why everyone is pushing back towards the "platform" architecture - it's not only how you build your models that's important, it's what you do with them once they've been built.

--
This is your life, and it's ending one minute at a time.
Re:Shot in the dark: by polv0 · 2006-04-10 12:34 · Score: 2, Insightful

I'm a statistician and data mining consultant, and i've implemented models based on millions of records generating consulting fees in the high hundreds of thousands of dollars. I thus have a strong understanding of the data, modeling and project management aspects of data mining ventures.

I believe there are several fundamental factors required to make a data-mining project succesful:

1) A mathematically precise definition of what it is to be modeled (the response) as in the probability of purchasing product x rather than "profit"
2) Multiple sources of data linked together: demographic, financial, transactional, etc...
3) A set of "variables" from 2) that have a strong, intuitive relationship to the response in 1)
4) A reasonably sophisticated statistical algorithm designed to weed out the significant relationships between 1) and 3)
5) An organizational culture willing and capable of institutionalizing the model into a decision making process

In my experience, these projects end up failing because of problems with one of the above critical steps. Roughly 50% of the time 5) will hold back a project. This isn't what a consultant, statistician or DBA will tell you, because more often than not they don't stick around long enough to see it through to the bottom line. Step 1) and 4) are basically table stakes. If you can't very precisely define your objective (and thus that which you would like to predict, the response) you aren't going anywhere. Similarly, if there is a significant flaw in your analytical method (e.g. overfitting with neural networks) then you'll produce rubbish.

Steps 2) and 3) really determine the upper bound for how succesfull the project can be. Often, there is a fundamental driver of the value in question that is not accounted for in the analysis. For example, in modeling auto-insurance claim frequency, it is very difficult to obtain number of miles driven, and yet it is the most significant factor impacting the occurrence of claims. In the end, the quality of the model hinges on linking together high-quality data sources that are often transactional and deriving from them variables that would dramatically suprise you if they weren't predictive of the response in question.

That's my 2 cents...
Re:Shot in the dark: by TubeSteak · 2006-04-10 12:36 · Score: 1

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics.
Don't forget that if you ask the wrong questions you get either:
A. Wrong Answers
or
B. Garbage

Having computers crunch data to look for relationships is all well and good, but you're almost always going to need someone to interpret the results to make sure they aren't A or B.

--
[Fuck Beta]
o0t!
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 12:37 · Score: 0

In other words, (a). Given a market based on voluntary association, that is.
Re:Shot in the dark: by mycall · 2006-04-10 12:59 · Score: 0

Howabout the NSA datamining your X,Y position down to the second for all US citizens for the last 10 years.
Re:Shot in the dark: by asuffield · 2006-04-10 13:25 · Score: 3, Interesting

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics.

That's part of the problem.

Another part is computational complexity. No, I'm not kidding. These things are often in like the second and third powers of the data set size. The data sets are often terabytes in size. We don't have computers that big, and by the time we do, we'll probably have bigger data sets. Contemporary data mining is an exercise in finding a fast enough approximation that is accurate enough to look convincing. We're not really sure how accurate they actually are - most of the time, there's no way to find out for certain. "Probably good enough" is the best you normally get. Some researchers can put a number on that 'probably' for you, eventually. Mostly they just compare the available approximations and tell you which one works the best.

The biggest problem is the inability to figure out intelligent things to do with it. Computers aren't smart. You can't just hand them a heap of data and say "find me the things I want to know". You have to work out what the patterns in the data are for yourself, then do pure math research to turn those patterns into a mathematical model. Then you have to come up with useful questions to ask that model. That's two major insights plus several years of work - and most researchers only have one major insight in their entire career. Just to figure out what question to ask. Data mining is then the process of repeatedly answering that question for all possible values of the parameters. And the answers you get out will only be as good as the model you invented. The current method for discovering usable patterns in data is trial and error.

I think that 'data mining' is more or less a frontier by definition. It's all the things we don't yet know about the data we currently have which would take a huge amount of effort to discover. Most unsolved problems in mathematics could probably be called 'data mining problems': if an answer exists, it can be derived from the existing body of theory. Most decisions that people make, from deciding whether to eat now or later, to deciding whether to invade a foreign nation, can also qualify. The sheer range of things it could cover means that there will probably always be vastly more unsolved problems than solved ones.
Re:Shot in the dark: by lrichardson · 2006-04-10 13:43 · Score: 1

One of the database I play with (and they pay me, too!) is Essbase ... in one sense, the most highly normalized form possible. And yes, accounting data goes in it.
Think star schema, with the central table containing just numerical 'facts'. Each record's key links to every other table, and, for query optimization, we've got just one 'fact' per record. Payments, APR, Balances, they all get slapped in.
It's one of the best OLAP tools I've seen. A hell of a lot of work to do it 'right', like ten hours processing to put in a months' worth of date, said month loading into SQL in about twenty minutes. After that, a response time measured in seconds. And, coming from a pure DB2 background, the thought that A->B->C->D may not give the same D as A->C->B->D initially freaked me out ... but then one starts comparing that to the funky results on a standard database with left/right/innner/outer joins, and it doesn't look so bad. My department has 'Analysis' in its name. So we put a lot of thought into the design. Won't stop false mining, but the users are generally thrilled to pieces with it.
Re:Shot in the dark: by tacocat · 2006-04-10 13:48 · Score: 1

I think it's more a problem of access to the data for purpose of mining. In order to do any meaningful dataming you have to have a few barriers removed. Namely:

It has to be cheap to access. This is in terms of network costs, labor costs, and most importantly everyone believes that they can make a profit if they sell the access to their data. For data mining purposes, this becomes cost prohibitive. You have to Free the Data.

It has to be legal to access. As time goes on, the amount of data, or the types of data that I have access to is collapsing into a smaller circle every year as lawyers get paranoid about the data privacy. It is not at the point where vital engineering information is being removed. This allows the lawyers to sleep well, but the engineers have NFC what's going on.

It has to be understood before it can be accessed. Most people wouldn't understand the implications of the data presented to them. They would miss the subtleties and flounder in mis-assumptions. For instance, there is no such thing as unique SSN's in America. There are no unique keys for cars and doors.

The data has to be structured before it can be gathered. You can't just put everything on the planet onto a spreadsheet and think it will have any value.
Re:Shot in the dark: by audi100quattro · 2006-04-10 14:31 · Score: 1

Or, if that DB actually had more information like how people weren't being vaccinated enough, or data about the real cause of the rise, or the data mining technique was smart enough to cancel ice cream eating based on some other fact, or... I do agree with c, teaching analysis takes time. If that makes any sense. Back to analyzing stock market patterns...
Re:Shot in the dark: by Doctor+Faustus · 2006-04-10 14:54 · Score: 1

3) What do we want to get back out of it.

Many people in designing systems pass over 3.

Good. The desired results change too often to put them in the data model.

It's been my experience that the best database designs come from focusing on a layout that makes sense. Aside from recursive hierarchys (which are a special case because they don't fit into the relational model very well), you should only need to look much at the actual queries you expect to run when you're deciding on indexes.
Re:Shot in the dark: by glitch23 · 2006-04-10 15:36 · Score: 0

Data mining worked well in the case of an obvious trigger event, but data mining by itself didn't reveal the trigger. You can't predict hurricanes based on the sales of pop tarts and beer, for example.

And there lies the difference between a causal and co-relational relationship. Walmart used data mining to figure out when hurricanes hit then poptarts are sold but not vice versa. We have to get the cause and effect right before data mining can help. A pure co-relationship doesn't help in that case.

--
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
Re:Shot in the dark: by drachenstern · 2006-04-10 16:12 · Score: 1

Of course when data is normalized poorly sometimes you end up with impossible joins, only solved by unions.

A specific case comes to mind. A guy I worked with wanted to design a billing system. He had six tables represnting detail lines on the invoce. Each table had identical fields except for a few items. The data should not have been normalized because a report on invoicing would have required a six table union. Unacceptable IMO.
Doesn't sound very normalized to me. Those "identical fields" should have been moved into their own table.
I agree with your sentiments. The first thing that springs to mind is "You keep using that word. I do not think it means what you think it means." (but of course)

Perhaps a review of what a normalized table is would refresh the memory of the gp? But perhaps we are mistaken, maybe other /.ers have a really good reason of why normalization would have been bad?

--
2^3 * 31 * 647
Re:Shot in the dark: by RussP · 2006-04-10 18:27 · Score: 1

I think the fundamental problem is the lack of structure of most of the information on the Internet. It's mostly just one gigantic blob of amorphous text. Google may have a great search engine, but I am tired of getting results on condoms when I want information about LaTeX typesetting. XML was supposed to help solve this problem, but I'm still waiting for it to happen.

--
I watch Brit Hume on Fox News
Re:Shot in the dark: by RollingThunder · 2006-04-10 18:54 · Score: 1

We're about to roll out the "new" data warehouse at work. It's gonna start at 60 TB. I pray to god we never have to restore the bloody thing.
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 19:01 · Score: 0

Yes but anybody who knows his stuff uses causal analysis methods instead of simple correlations.
Of course, this can only work 100% when all related variables are observable (humans are very good at picking the proper set of variables needed but even our internal world model's are far from perfect).
It is also much more difficult to reliably estimate higher order statistics (conditional independence being the most important).
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 20:29 · Score: 0

terabytes of data nowadays is nothing, I work in IT at a datacentre that has over 100 TB of data and we are just ordering another 60 TB of space to house the expected growth for the next 6 months, and yes this is ALL housed in RDBMS. So to say todays computers can't handle it is crap. The problem is purely around not knowing how to process it or what mathematical algorithms to apply to these datasets.
Re:Shot in the dark: by jdh41 · 2006-04-10 20:52 · Score: 1

But, can you ever correlate pop tart and beer sales to an external event? You might be able to go back and say "here's a strange case where pop tarts and beer sold out quickly, why did this happen?" If you can tie this to external events, you'd think you'd be better prepared to react to the same events in the future.

Thsi is the beauty of this sort of datamining - you're bascially just scoring factors likely to push up sales - it doens't matter if theres actually a causation, because you're only interested in corrrelation, if its going to hold true in this case.
Re:Shot in the dark: by MancunianMaskMan · 2006-04-10 21:15 · Score: 1

Most decisions that people make, from deciding whether to eat now or later, to deciding whether to invade a foreign nation, can also qualify.
Well that one's not hard: Eat now.
And later.
And don't invade at all.
Re:Shot in the dark: by Anonymous Coward · 2006-04-10 22:55 · Score: 0

>Consider WalM*rt.

Consider that you don't appear to have a source for that.

Where did you get that information about beer and pop tarts? Not only doesn't it make sense, (people buy toaster food when they're expecting the electricity to go out?) it doesn't sound like the kind of thing that WalMart does, or needs to do.

To put it another way, I call bullshit on your story.
Re:Shot in the dark: by VolciMaster · 2006-04-11 00:46 · Score: 1

Conclusion: ice cream causes polio.

Amazing how many people still don't understand the difference between correlation and causation. I think everyone should take an intro economisc or statistics class just to realize the difference.

--
antipaucity
Re:Shot in the dark: by braun · 2006-04-11 02:28 · Score: 1

I dunno if this is a reply to previous post, and it's not a reply to the question "why is data mining...". But, as a philosofical question I find it interesting. Cause, its not easy defining "cause and consecvence". Like, if a mooving ball hit a not mooving ball, and after the collision the two balls moove - then was the cause of the other ball's motion the first one's? This is answered by the logic of the system (the system that definies motion, ball etc.). But, if you don't have a fixed system of logic, then cause etc. is not defined. Hence you can approch it by data mining. If successfull, you can construct a system from the trends you find, make hyphotesises etc. So, I don't have anything special to say.. But its interesting. Do we need to definie ex. cause beforehand?
Re:Shot in the dark: by plover · 2006-04-11 02:39 · Score: 1

First of all, Walmart's response to last year's hurricanes was noble. They donated over a thousand trucks full of relief supplies, and I don't want to take anything away from that.
But that's not what I was talking about. I'm in the retail industry, and keep one eye facing Walmart (everyone in retail does.) The "beer and poptarts" story was one of those stories that circulated about the same time as the hurricane, so I can't quote exactly which source I got it from first (could have been at a departmental meeting or something.) However, a bit of googling turned up Hurricanes, Pop-Tarts, And Beer which is roughly the same information I heard earlier.
Now, unfortunately the linked story has no corroborating links on it, either. At this point it may still be a total fabrication, or it may not. Snopes has nothing on it, one way or the other.
But the story has plenty of credibility. WalMart always wants to deliver merchandise that people want to buy, and we know people in the path of disaster definitely go out and stock up. I suspect they were originally looking at their data trying to figure out how many flashlight batteries, bottles of water and cans of soup they should put on the trucks when they encountered the "beer and pop-tarts" data.
And by the way, while I love pop-tarts (mmmm...cinnamon), I haven't put one in a toaster in at least 20 years.

--
John
Re:Shot in the dark: by Miraba · 2006-04-11 02:55 · Score: 1

You get cookies from me. It's especially true when scientists do field work, since the emphasis is to take as much data as possible.

Real World Example: This past summer, I went to Cyprus for a field survey (surface examination and collection, no digging involved). In three weeks of 15 people working 4 hours a day, we grabbed over 10,000 pieces of worked stone. A proper excavation will yield enough data for an academic lifetime, but only a small percentage will ever be thoroughly analyzed and published.
Re:Shot in the dark: by Anonymous Coward · 2006-04-11 03:41 · Score: 0

People would buy Pop-Tarts to stock up for hurricanes for the same reason that I carry them backpacking: lot of energy, small package, and it's less than half sugar (provides a mix of carbs for energy).
Just because you only use things how you're told doesn't mean everyone does. Some people know how to think.
Re:Shot in the dark: by plover · 2006-04-11 04:13 · Score: 1

you're only interested in corrrelation, if its going to hold true in this case.
There's the problem. "If". Let's say the data mining came up with a correlation between French riots and immigration legislation with the sale of Peeps. Next Easter you're going to disappoint a lot of shoppers when you don't have Peeps available; and next fall when rioting and legislation happen to hit the news at the same time, you're going to have a lot of wasted Peeps on your store shelves.
You may say "Of course riots have nothing to do with sales of Peeps -- they're sold at Easter time." But that's only obvious to you, a human with a cultural frame of reference (knowledge of both Peeps and the American traditions surrounding Easter.) But it's not so obvious to the computer that has to cross reference increased sales of UPC "74189577234" with the 40 days prior to the first Sunday following the Paschal full moon. That's the only correlation that really counts here. The wrong correlation would lead you down the wrong path, and you might not ever find out until it's too late.

--
John
Re:Shot in the dark: by asuffield · 2006-04-11 06:46 · Score: 1

So to say todays computers can't handle it is crap. The problem is purely around not knowing how to process it

That's what I said. Nobody knows a way to process it that today's computers can handle. We *do* know several ways to process it that those computers *can't* handle.

As to your claim that "we're storing hundreds of terabytes of data, obviously we can handle it" - you're just storing data. The problem is computational complexity, not storage. The well-known 'right' answers to most data mining problems are high polynomial time or worse; they would take centuries to run on a data set of that size. So data mining is often an exercise in finding faster approximations.
Re:Shot in the dark: by Disavian · 2006-04-11 06:52 · Score: 1

Hooray, cookies! :D
Re:Shot in the dark: by Disavian · 2006-04-11 06:59 · Score: 1

That sounds about right. ^_^
Re:Shot in the dark: by functor0 · 2006-04-11 15:21 · Score: 1

You should check out this sometime:
http://www.arunasoftware.com/
Re:Shot in the dark: by mcmonkey · 2006-04-12 05:58 · Score: 1

Conclusion: ice cream causes polio.
Amazing how many people still don't understand the difference between correlation and causation. I think everyone should take an intro economisc or statistics class just to realize the difference.

Who has time for for an economics class? Summer is just around the corner. We've got to do something about all that ice cream out there!!
Please, think of the children.
(Maybe we should start adding vaccine to the sprinkles. (Am I the only one who calls them jimmies?))
Re:Shot in the dark: by jbolden · 2006-04-12 07:36 · Score: 1

If you are using a real database look up "materialized views". If you aren't then add this to the list of reasons you should be.

Don't worry... by NoMoreNicksLeft · 2006-04-10 08:51 · Score: 1, Troll

I'm on it.

Searching still has a ways to go by Anonymous Coward · 2006-04-10 08:53 · Score: 0

--
Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?
--

I hope not. I spent a day searching Slashdot, Digg and Google for something I saw a couple of weeks ago. I couldn't remember if it was here or there and I knew the story linked out, but neither the "search" capabilities nor Google was able to find what I was looking for.

Really, how hard is it to find the story about the whiny bitch who couldn't install ubuntu and wouldn't listen to any of the suggestions given to him?

RDBMS's by Anonymous Coward · 2006-04-10 08:54 · Score: 0

So, just to clear things up, people do know that relational DB's aren't about "relationships" between data, but are in fact about storing data in mathematical relations. Seriously, go look it up.

Re:RDBMS's by Disavian · 2006-04-10 09:02 · Score: 1

I looked it up: http://en.wikipedia.org/wiki/Relational_database However... anyone who can actually _understand_ it is a smarter man than I.
Re:RDBMS's by kfg · 2006-04-10 09:21 · Score: 1

So, just to clear things up, people do know that relational DB's aren't about "relationships" between data

No, they don't.

KFG
Re:RDBMS's by benjamin264 · 2006-04-10 12:23 · Score: 1

It is a fancy way of saying you are only storing things that relate to what you are actually trying to store. You are not going to store a customer's shirt size for a candy store.

To quote C.J. Date, the basis of good database design is "nothing more than common sense, formalized."

In regards to data mining, there are tremendous complexities in trying to line up all the information about one thing. What typically happens is that databases arise from a smaller need. A company will start logging shipments and it becomes a little database. Another department logs the shipments that they make in another database. And accounting tracks the costs of all of these shipments in yet another database. In a perfect world, all three of these databases would be able to 'communicate' with each other. I am sure that you can see the benefits of this.

The problem is that we are talking about related things in different dialects. You would think that something like two plant databases would be easy to sync up; use the scientific names. This assumes that both databases store the scientific names. And that is the main theoretical problem with data mining; identifying like elements. Then you have to worry about the technical problems.

With people however, you have many more things than a name to line up. You can compare things based on addresses, phone numbers, ID numbers, etc. That is why there is potential with data mining... As long as everyone wants to give up their privacy.

In regards to the article, there are so many large organizations that do not even have a Database Administrator on staff... let alone database analysts. So in many instances, we have a lot of data sitting around (most likely in poorly designed 'databases') doing much less than it could.

I think a better question is, 'Where was that guy that does the "nothing to see here..." posts?'

google... by joe+155 · 2006-04-10 08:56 · Score: 0, Flamebait

aren't they like the brain creatures... I hear when they finally finish all the archiving and indexing in the universe they will blow up everything that is and ever would be in order to not create any new information... just to save the hastle of indexing that too

--
*''I can't believe it's not a hyperlink.''

Re:google... by Anonymous Coward · 2006-04-10 09:13 · Score: 0

Yes, and the only way to save the Universe is for you to join forces with the Nibblonians and go on a mission to trap the brain spawn in an alternate reality!
Re:google... by Anonymous Coward · 2006-04-10 09:48 · Score: 0

Fry to the rescue!

What does the article have to do with the subject? by xxxJonBoyxxx · 2006-04-10 08:56 · Score: 2

"...correlating Henslow's plant collections with the time of collection, the people involved, Darwin's published work and so on using a card index, was woefully inefficient. He designed a database to hold all the information available from Henslow's collections..."

This still looks like a basic, specialized database to me. Where's the great leap to "all your data are belong to us?"

Companies are doing it, but... by deanj · 2006-04-10 08:56 · Score: 3, Insightful

There are companies and research project that are doing this sort of thing. The trouble is, there are a LOT of people that are freaking out about it, and that's making companies less willing to 1) admit they're doing it, and 2) even think about starting to do it.

Considering how up and arms people are about it, how long before we have people accusing others of "data profiling"?

Re:Companies are doing it, but... by castoridae · 2006-04-10 09:20 · Score: 2, Interesting

Well there are a lot more areas where data mining is useful than just mining for consumer habits. People are freaking out about mining of personal information - ChoicePoint, Locate Plus, Lexus Nexus, to name a few examples - the article is discussing the lack of data mining in science and actually claims that data mining is commonplace in business.

A snippet from the article:

the tools taken as routine in business are being overlooked in academia

I can't see anybody getting upset about scientific data mining.
Re:Companies are doing it, but... by inKubus · 2006-04-20 11:26 · Score: 1

Legally change your name to John Smith and switch social security numbers once a month. That will teach them.

--
Cool! Amazing Toys.

misunderstood by Anonymous Coward · 2006-04-10 08:56 · Score: 0

it astounds me how little people know about data mining... ffs there is so much more to it than a relational DB....

I tell you why (from a bioinformatics viewpoint) by Neil+Blender · 2006-04-10 08:57 · Score: 5, Insightful

Programmers have no idea of context. Biologists have no idea about programming. It is very hard to mix the two. You can be the shit-hottest dba in the world but if you have no relevant (deep) biology background you are guaranteed to produce crap. Almost every piece of biological software is a POS because of this.

Did you know? by Anonymous Coward · 2006-04-10 08:57 · Score: 0

Francis Bacon was the first to propose that each fact was related to all other information by 6 degrees or less. And he is one of the most famous intellectuals that shares a name with strips of cured pork.

Re:Did you know? by kfg · 2006-04-10 09:56 · Score: 1

Francis Bacon was the first to propose that each fact was related to all other information by 6 degrees or less.

Sure, but it took Kevin to make it popular.

And he is one of the most famous intellectuals that shares a name with strips of cured pork.

Sure, but Roger did it first, and it took Xerox PORK to make practical.

KFG
Re:Did you know? by cndrr · 2006-04-10 10:00 · Score: 1

That suddenly gives new meaning to the six degrees of Kevin Bacon game.

--
cndrr
Re:Did you know? by Anonymous Coward · 2006-04-11 03:58 · Score: 0

Data mining is a field still very very much in active development. I've just started to breach it's depth in the last year and all I have to say is that it is HARD. It requires you to be a master statistician, domain expert of whatever it is you are mining, and a darn good dba as well. Lets not even get into the fun of trying to work with 4GB+ datasets.

Now obviously the number of times those three things come together in a single person is few and far between. I'm just the dba, i don't know what questions to ask. The researchers (i work for academia) are the ones with the questions, but they struggle getting the data out. Then we have to bring in a statistician to fit models to what i hope is the relevant data to what i think was the researchers question and attempt to actually prove some sort of coorelation.

Now even if you do all of this you usually just get a predictive model of some sort. What we have started working on now is using the same data to actually show causation (but to get the specifics on that you need to talk to the statisticians).

It's funny, but a huge percentage of the models out there for just about anything are still working on simple linear regression. I can almost gaurentee you that walmart did a simple linear regression of their product sales for the week(s) before a hurricane and just looked for the strongest positively correlated items.

Data Mining != RDBMS by EraserMouseMan · 2006-04-10 08:58 · Score: 1

Having the data in an RDBMS is only the first step to being able to mine data for knowledge. Data mining is a whole different discipline that requires statistical analysis of the aggregated data to find trends, etc.

Aristotle by Bacon+Bits · 2006-04-10 08:58 · Score: 1

Huh? Francis Bacon? Didn't Aristotle claim he created logic in his Prior Analytics? With his four types of statements (A is true about all X; A is false about all X; A is true about this X; A is false about this X) and the basic logical syllogism? The whole point of logic is to preserve truth so you can synthesize new knowledge.

--
The road to tyranny has always been paved with claims of necessity.

Re:Aristotle by TRACK-YOUR-POSITION · 2006-04-10 09:15 · Score: 2, Funny

Bah! Aristotle couldn't tell a horse's head from an animal's head!
Re:Aristotle by aminorex · 2006-04-10 10:10 · Score: 1

That's true of one form of logic, but not true of others. "Logic" has come to be a very large and fuzzy thing in the last 100 years or so.

--
-I like my women like I like my tea: green-
Re:Aristotle by Bacon+Bits · 2006-04-10 10:34 · Score: 1

True enough, but Mr. Bacon's statement was 400 years old. 19th century logic, while much more advanced than simple syllogisms, is entirely out.

--
The road to tyranny has always been paved with claims of necessity.

Of the two by black_shadow201 · 2006-04-10 08:58 · Score: 1

Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?

I'm going with the later...

Re:Of the two by Peter+Mork · 2006-04-10 09:07 · Score: 1

I'm going with the former. Google can do really well when there's textual information. Google hasn't addressed the ability to retrieve data based on numeric constraints. Relational databases, however, do quite well with numbers. For example, how would you query Google for medical databases containing patients with ICD9 code 12345 between the ages of 16 and 18?
Re:Of the two by GigsVT · 2006-04-10 09:25 · Score: 1

Not just numbers, Google chokes if you want to search for anything outside a-z case insensitive really.

There have been particular error messages and the like where it's a common phrase, but the spacing and punctuation are unique. Google "helpfully" tokenizes it and supresses all the case and punctuation, even when I quote the string.

Search for "HELLO?" in quotes like that. You'll see how the quotes really don't mean much to google. I'm sure the string "HELLO?" (exactly like that) is on the web somewhere, but you can't use google to find it easily.

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.

Semantic Web goodness by CRCulver · 2006-04-10 08:59 · Score: 1

Datamining would be a piece of cake if all data were kept in clear, standard XML dialects. See Visualising the Semantic Web , ed Geroimenko and Chen (Spring Verlag, 2004). Some of the possibilities of combing through information and elucidating it, combining it and converting it described in that book are simply awesome. Too bad that the Semantic Web is a pipe dream at the moment.

Re:Semantic Web goodness by poot_rootbeer · 2006-04-10 09:12 · Score: 1

Too bad that the Semantic Web is a pipe dream at the moment.

Too bad that the Semantic Web will always be a pipe dream, at least until the day comes when it's possible for a computer to understand the semantic content of a document with zero hinting from the author. The potential is there, but the willingness of humans to spend time explaining semantic structures to machines, when they're obvious enough to other humans, is lacking.
Re:Semantic Web goodness by TrappedByMyself · 2006-04-10 09:35 · Score: 2, Insightful

Datamining would be a piece of cake if all data were kept in clear, standard XML dialects. See Visualising the Semantic Web , ed Geroimenko and Chen (Spring Verlag, 2004). Some of the possibilities of combing through information and elucidating it, combining it and converting it described in that book are simply awesome. Too bad that the Semantic Web is a pipe dream at the moment.

Well, XML is not really import. The problem lies in going from the infinite real world to a well defined ontology or whatever. I can make the greatest data model ever, and the first time someone tries to put a large data set into it, it just won't fit. You hit a bazillion, "I have this as two fields, you have this as one" issues. You can jump a meta-level up to store all the data, but then you just lost a handle on context. The Semantci Web people have tackled the issue, but have yet to solve world hunger. Tossing a bunch of web and AI/ontology experts into a room produces great things, but they haven't gotten there yet. And the stuff they've produced is still academic level. The average high school kid isn't going to be hacking OWL into his web pages.

As with most things, we'll get closer and closer, and better and better things will happen. We'll never find the holy grail, but some pretty cool and useful technologies will eventually emerge. It just takes time.

--

Help me take back Slashdot. When did 'News for Nerds' become 'FUD and Conspiracy Theories for Extremist Nutjobs'?
Re:Semantic Web goodness by Narphorium · 2006-04-10 17:01 · Score: 1

And the stuff they've produced is still academic level. The average high school kid isn't going to be hacking OWL into his web pages.
The average high school kid has an RSS feed on their blog.
The average high school kid listens to MP3s tagged with IDV3 metadata.
The average high school kid annotates thier photos on Flickr with semantic metadata.
The average web user may not know what the Semantic Web is but that doesn't mean they're not using it.

Privacy by gurps_npc · 2006-04-10 08:59 · Score: 1

Privacy concerns stopped a lot of data mining.

Another thing is that it is only usefull for information we don't already know.

We don't exactly need data mining to realize that people that buy diapers also buy baby food.

--
excitingthingstodo.blogspot.com

Re:Privacy by DavidJSimpson · 2006-04-10 09:08 · Score: 1

> We don't exactly need data mining to realize that people that buy diapers also buy baby food.

But did you realize that people who buy diapers often also buy beer? The Business Intelligence Market (PDF)
Re:Privacy by scdeimos · 2006-04-10 09:18 · Score: 1

We don't exactly need data mining to realize that people that buy diapers also buy baby food.
Old people buying diapers tend to go with the generic brand sardines, actually.
Re:Privacy by LiquidCoooled · 2006-04-10 09:19 · Score: 1

No, but it would be good to find out when they come in and what else they buy along with it.
Like supermarkets now have fridges near the doors with sandwiches and drinks for the lunchtime folks, having a display which changes a couple of times a day (the whole display case moves, not item by item) could really improve throughput without blocking the store for the other regular customers.

Or picking out the flows people take around the store (inverse the order the items are placed on the conveyor and you roughly get the order they were placed in the trolley, but requires a large sample set).

Data mining your own data brings no privacy concerns, its when mega corps begin to bring together multiple company datasets that I start to see these kind of issues.

--
liqbase :: faster than paper
Re:Privacy by code+addict · 2006-04-10 09:20 · Score: 1

Exactly, if I had mod points I'd mod you up. There have been tons of data mining projects killed by Privacy restrictions, especially where government databases are concerned.

For example there was a big outrage a few years back when the Canadian goverment tried to link together databases from several different departments.

Perhaps datamining (as it relates to personal information) is one of those things that companies don't want to admit they're doing for fear of persecution?
Re:Privacy by Anonymous Coward · 2006-04-10 09:30 · Score: 0

We don't exactly need data mining to realize that people that buy diapers also buy baby food.

Depends.
Re:Privacy by wickedsteve · 2006-04-10 10:04 · Score: 1

Why would privacy come into this if it is not about anyones personal information?
Re:Privacy by plover · 2006-04-10 10:50 · Score: 1

Many people refuse to believe it's not personal. And in most cases it is personal. It's long been known that repeat customers are the most profitable, by a wide margin. With nothing else to go on, go back to your previous customers. It doesn't take long for them to feel "picked on".
The other side is that some places use loyalty cards which actually advertise and use the loss of privacy as a selling point: "This is a personal promotion just for you, PHILIP J. FRY!"
Some people are comfortable giving it up, while others never want to. And while it seems like it's an absolute -- either you're mining for private information, or you're not -- the bigger problem is that while you might not be using personal information today, later analysis might turn personal. What if data mining goes back in time and a coupon prints up "SPECIAL -- CONDOMS FOR OUR BEST REPEAT CUSTOMERS AT TWO-FOR-ONE PRICES!" while your new wife is using your credit card at the checkout lane? Not a big selling point.

--
John
Re:Privacy by c.gerritsen · 2006-04-10 12:12 · Score: 1

We don't exactly need data mining to realize that people that buy diapers also buy baby food.
Ah, but I know of a data mining study of super market receipts where they found that people who buy diapers also buy beer.
Now, a lot of people are going to look at that sentence and not believe it.
Together, we have given examples that cover what I think is one of the greater problems with adoption of data mining: the results often fit in two categories, obvious or hard to believe. So when PHB spends a bunch of money on data mining and gets some results, half of which his boss laughs at because he already knew and half of which he isn't willing to act on because doesn't believe there is a connection, does PHB even still have a job?
Re:Privacy by Anonymous Coward · 2006-04-10 14:17 · Score: 0

It is interesting how people equate data mining with privacy invasion. This supermarket transactional data has already been stored in a database, regardless of whether or not it is going to be used to predict who will buy condoms. This intelligent coupon is alarming because you realize how much data pertaining to you is available. It is just a manifestation of the data already stored about you, and probably the only place where you, the end user, can realize this.

If your wife wanted to pay a private investigator 100$ to find out if you have ever bought condoms in the past, the private investigator would just do a "SELECT * WHERE CONDOM_PURCHASE > 0". This is not data mining, but equally as bad from a privacy invasion standpoint. I think the real issue here is the data about you actually captured, regardless of the analysis performed.
Re:Privacy by debiansid · 2006-04-10 16:19 · Score: 1

Privacy concerns stopped a lot of data mining.

True, thats because most product/service providers see this as an opportunity to spy on their customers to find out every intricate detail about them so that they can "serve them better".

Surely there must be many other applications of data mining which would change the way of life for many people and do not require them to divulge their SSN at the same time. Out of the top of my head, collecting seasonal data to be able to discover some patterns in natural calamities. Its probably being done, but not very visible.
Re:Privacy by Anonymous Coward · 2006-04-10 23:31 · Score: 0

That whole "beer and diapers" thing is a complete urban legend.

It gets quoted all the time but there's never a source.

I tried to track it down myself some years ago and got nowhere. Someone used it as an anecdote in a lecture in the mid 90s but I emailed the guy and he admitted it was just "something he'd heard" and there was no cite for it.
Re:Privacy by gurps_npc · 2006-04-11 02:36 · Score: 1

A lot of it is personal information. Here is a simple one.
You sign up for a grocery datamining card. You give them your name, phone, address, and they give you a card to scan when you buy groceries. Now you use it to buy things. Among other things you buy:
a six pack of beer. Every day.
tampons, even though you are a man.
stop buying tampons, but pick up some penicillian at the pharmacy in the back.
These things are very, very personal. And they have your name, number, address.

--
excitingthingstodo.blogspot.com
Re:Privacy by midnighttoadstool · 2006-04-20 06:09 · Score: 1

"Another thing is that it is only useful for information we don't already know"
I'm working with a guy who is rich because he has taken "what we already know" and showen it to be either not true or, perhaps more commonly, a half truth (which, BTW, is the christian definition of a heresy, something that maths and science bods don't seem to think applies to them). He does this as a matter of routine just in the way that he thinks. He never fully accepts the convention. I am one of many who he uses to investigate assumed/obvious truths. My job mostly involves pivot tables.
An example might be :
Its obvious that smaller class sizes is better.
Who ever would say otherwise. I reckon few if any have questioned this "truth". Now stop assuming the truth of that. What do you get? Your solution doesn't have to be generally true as long as it shows that this truth isn't always true; the gain can still be big. Try finding a way of justifying that bigger class sizes are better. I have a solution to this which I'll post later, but maybe you can find one your self.
Lots of radical and wonderful things result from such thinking.

what a silly question... by buddyglass · 2006-04-10 09:00 · Score: 1

The question seem to ask whether, if we just put an amorphous mass of "scientific knowledge" into a big fat RDBMS and let it churn for a while, it would somehow spit out new scientific knowledge. Huh? Imho it displays not only an astounding lack of understanding regarding how knowledge is encoded, but also about the nature (and obvious limitations) of relational databases.

Re:what a silly question... by debiansid · 2006-04-10 16:25 · Score: 1

put an amorphous mass of "scientific knowledge" into a big fat RDBMS and let it churn for a while, it would somehow spit out new scientific knowledge

Unfortunately that is how many people seem to perceive data mining. Managers/decision makers seem to search for scientific technologies that will help them reduce their dependence on the scientists (experts); and perceive data mining as just that.
Re:what a silly question... by Josh+teh+Jenius · 2006-04-11 01:31 · Score: 1

Agreed: Correlation != Causality.

That is all.

--
Math is math. Regular expression is regular expression. The tools are there. The future is now.

The Math by RalphLeon · 2006-04-10 09:00 · Score: 1

Possibly, a reason is that most people don't have the math or statistical skills to learn the concepts. Most schools don't even teach the techniques of data mining until late in a masters of science program.

When [normal non slashdot] people hear data mining they think querying with something like google ("computer find bad people in the FBI database") not markov chains.

Clarification of your "b". by khasim · 2006-04-10 09:00 · Score: 1

It doesn't work.

How about "It doesn't work the way the vendor/consultant/salesguy/magazine said it would."

The information you get out depends upon the data you put in.

The people looking to "find" information in the data are the same people who decided what data to collect in the first place. And from whom to collect it. Etc.

That means that you'll find out that 2004 was a banner year for bubblegum ice cream. But you won't know what will be popular in the summer of 2006.

Re:I tell you why (from a bioinformatics viewpoint by networkBoy · 2006-04-10 09:01 · Score: 1

So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it and then you will have speciation withing the code (AKA a fork) and everything will be as it was.

Balence, restored.
-nB

--
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump

Because it's not sexy by beacher · 2006-04-10 09:02 · Score: 4, Insightful

From my expierience - The people who are subject matter experts in their field (outside of computers) and typically don't have the time to perform all of the data entry. So you have to get an ETL / Miner to do all of the work for you. ETL and data mining are *NOT* the sexiest jobs in the industry by a long shot. Auditing data makes you want to gouge your eyes out after the fourth day straight of reviewing loads.

Re:Because it's not sexy by Coryoth · 2006-04-10 10:03 · Score: 4, Interesting

As someone who has done datamining, ETL, and data auditing for very large systems (every transaction on every slot machine in a large Las Vegas casino for 5 years or so) I can assure you that the problem is not lack of data or issues with data entry. The problem, simply put, is that analysis is hard. The data is sitting there, but extracting meaningful information from it is far harder than you might imagine. The first hard part is determining what constitutes meaningful information, and yes that requires subject matter experts. Given the amount of money that can be made with even the slightest improvement, getting subject matter experts to sit down and work with the data people was not the problem. The problem is that, in the end, even subject matter experts can't say what is going to be meaningful - they know what sorts of things they currently extract for themselves as meaningful, but they simply don't know what patterns or connections are lying hidden that, if they knew about it, would be exceedingly meaningful. Because the pattern is a subtle one that they never even thought to connect they most certainly couldn't tell you to look for it. The best you can do is, upon finding an interesting pattern, is say "suppose I could tell you ..." and wait for the reaction. Often enough with some of the work I did they simply didn't know how to react: the pattern was beyond their experience; it might be meaningful, it might not, even the subject matter experts couldn't tell immediately.

So how do you arrive at all those possible patterns and connections? If you think the number of different ways of slicing, considering, and analysing a given large dataset is anything but stupendously amazingly big then you're fooling yourself. Aside from millions of ways of slicing and dicing the data there are all kinds of useful ways to transform or reinterpret the data to find other connections: do fourier transforms to look at frequency spaces, view it as a directed graph or a lattice, perform some manner of clustering or classification against [insert random property here] and reinterpret, and so on, each of which expose whole new levels of slice and dice that can be done. If you'ev got subject matter experts working closely with you then you can at least make some constructive guesses as to some directions that will be profitable, and some directions that definitely will not be, but in between is a vast space where you simply cannot know. Data mining, right now, involves an awful lot of fumbling in the dark because there are simply so many ways to analyse the sort of volume of data we have collected, and the only real way to judge any analysis is to present it to human because our computers simply aren't as good at seeing understanding an interpreting patterns to trust with the job. Anytime a process has to route everything through humans you know it is going to be very very slow.

Jedidiah.

--
Craft Beer Programming T-shirts

Question IN/ Answer Out. by Anonymous Coward · 2006-04-10 09:05 · Score: 0

"Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?""

Data Mining is about asking the right questions. Indexing is only a small part of that.

--
Data Mining Solutions: Methods and Tools for Solving Real-World Problems

Re:I tell you why (from a bioinformatics viewpoint by moochfish · 2006-04-10 09:05 · Score: 1

I'm pretty sure any elegant solution would be blind to the context of the implmentation.

How long has the RDBM been around? by Chabil+Ha' · 2006-04-10 09:06 · Score: 1

I can only think patience. It has been only ~35 years since E.F. Codd published the first white paper on the relational model. We have yet to see the full implementation of what he proposed.

While true that the mathematics (theory) has been around for a while now, the application of it is still in its infancy. Give it some time and additional innovation, and we will have what you seek.

--
We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others

Chloe to the rescue by Christopher_G_Lewis · 2006-04-10 09:08 · Score: 1

They use it all the time on 24.

--
www.christopherlewis.com

Re:Chloe to the rescue by patio11 · 2006-04-10 13:23 · Score: 1

Yeah. Chloe is apparently the only person in the office who knows the proper syntax for the all-powerful "cross-reference" operator. And she's hampered by incompetent upper management who, in all the years between the series, never thought to say "Hey, Chloe, cross-reference Los Angeles and upcoming terrorist attack", which would solve most seasons in three minutes or less.
I think Jack tells Tony to keep Chloe in reserve so he can play the hero more.

--
Help poke pirates in the eyepatch, arr.

Re:I tell you why (from a bioinformatics viewpoint by Anonymous+Crowhead · 2006-04-10 09:09 · Score: 3, Insightful

So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it

Well, that's pretty much how it works in academia (+/- the real dba). Problem is that this is a lab by lab (or department) solution to problems that appear in hundreds or thousands of institutions. The wheel is reinvented over and over again because either commercial/free solutions suck or don't exist. The commercial versions suck because they are built by software engineers and the free versions suck because they are built by scientists (who tend to have the mantra of "if it works, it's done").

Re:I tell you why (from a bioinformatics viewpoint by TRACK-YOUR-POSITION · 2006-04-10 09:13 · Score: 1

Is it really still true that Biologists have no idea about programming? That seems like the direction you'd want to solve this problem from--it's gotta be way easier to teach a biologist some SQL than it is to teach a programmer to be a biologist.

Give up by Anonymous Coward · 2006-04-10 09:14 · Score: 0

I ship next week.

Re:Give up by NoMoreNicksLeft · 2006-04-10 09:17 · Score: 0, Troll

Actually, there are a few out there besides my own. I googled for "porn database" and my stuff is only ranked like #20 and higher...

Of course the others are just half-assed tagging schemes.

Scooty Puff Jr!! by X1088LoD · 2006-04-10 09:14 · Score: 1

Once google is finished indexing EVERYTHING, it will then index itself, thus destroying the universe. Unless some hero can stop it before that happens and escape on a Scooty Puff Jr.....

Re:Scooty Puff Jr!! by stinerman · 2006-04-10 09:22 · Score: 1

Who's ready for safe fun?
Re:Scooty Puff Jr!! by maladr0it · 2006-04-10 09:54 · Score: 0

Remember, Scooty Puff Jr sucks!
Re:Scooty Puff Jr!! by tompaulco · 2006-04-11 09:51 · Score: 1

Once google is finished indexing EVERYTHING, it will then index itself, thus destroying the universe.
Better hurry. Google's already indexed 805,000 pages on "Beavers mate for life".

--
If you are not allowed to question your government then the government has answered your question.

Re:I tell you why (from a bioinformatics viewpoint by TrappedByMyself · 2006-04-10 09:15 · Score: 2, Interesting

Hmmm, why don't the developers and biologists...gasp!....work together to design something? Yes, the developers may have to actually listen to the biologists and not spend their days doing cool programming tricks, and the biologists may actually have to do real requirememns work. If no one wants to put the effort in, then no one has the right to bitch about the results.

--

Help me take back Slashdot. When did 'News for Nerds' become 'FUD and Conspiracy Theories for Extremist Nutjobs'?

Re:I tell you why (from a bioinformatics viewpoint by Anonymous Coward · 2006-04-10 09:16 · Score: 0

So fund me..

This is just a skills gap problem. Find the right people. FUND the right people. It will happen.

Basically, stop trying to do things on the cheap. Interview hundreds of people. Pay the premium for the one who will make it happen.

FTFA . . . by Dausha · 2006-04-10 09:18 · Score: 1

"Darwin was his pupil (Henslow helped arrange for Darwin's presence on the Beagle), but Darwin made the intellectual leap that allowed him to interpret Henslow's records of variation - not as evidence of a fixed set of created species with variations, but as evidence of the evolution of new species in action."

Hmm, I read recently that Darwin's grandfather was also a Naturalist, as was Chuck. So, I don't think Darwin made the "leap," so much as his family was already in that direction. Methinks the article presumes Darwin was first in a family of thought--rather than merely one in a clan. He is just the first to gain widespread noteriety for it. (See "Darwin Amongst the Machines.")

--
What those who want activist courts fear is rule by the people.

Data mining is DIFFICULT by GlobalEcho · 2006-04-10 09:19 · Score: 4, Informative

The blurb hit on a fundamental reason data mining is still at (or beyond) the horizon...defining relations between the various elements is hard. Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

Consider the following boring but difficult task I was given: two large organizations were to merge, each with a portfolio of about 100,000 items. Each item had a short history, some descriptive information, and some data such as internal quality ratings or sector assignments. This data was available (for various reasons) as big CSV file dumps. Questions to answer were: (1) how much overlap did the portfolios have? (2) were the sector distributions similar?

These are very simple, concrete questions. But you can imagine that since the categorizations differed, and descriptors differed within the CSV files, let alone between the two, the questions were difficult to answer. It required a lot of approximate matching, governed intelligently (or so I flatter myself).

Contrast this situation with what people typically think of as data-mining: answering interesting questions, and you can appreciate that without a whole lot of intelligence, artificial or otherwise, those questions will be unanswerable.

Re:Data mining is DIFFICULT by pbadot · 2006-04-11 09:01 · Score: 1

After my very similar experiences with datamining
I have concluded it is not data mining
but data archeology.
Re:Data mining is DIFFICULT by inKubus · 2006-04-20 11:23 · Score: 1

Standards are the key. I work in the mortgage banking business and they are trying to build standards for data as there are really only a limited number of relevant fields and everyone in the industry uses the same sort of format. This is largely due to extensive government regulation and oversight (which has held the industry back, of course). There are thousands of fields, but it's not a huge deal to make a big list of them. What it will do is help everyone do business more efficiently because banking is all about sending and receiving various information back and forth (money is just a message, in cybernetic terms), and then storing it for later use. When two companies merge, it's easy to move everything over because the namespace is the same.

Of course, we already have standards, such as the English language and stuff like that, but it's so ambiguous, difficult for a computer to understand probably. They should just take a 64 or 128 bit hash and index every word, every object, etc.

--
Cool! Amazing Toys.

Cool idea! by jigjigga · 2006-04-10 09:19 · Score: 1

I'll just have to patent it and somehow get this article erased ;) On a serious note, it is an interesting question. I guess big business and government aren't interested (as everything they have been interested in seems to get done.)

Re:I tell you why (from a bioinformatics viewpoint by Neil+Blender · 2006-04-10 09:19 · Score: 1

Is it really still true that Biologists have no idea about programming?

To make a sweeping generalization: They usually gain just enough knowledge to make them dangereous as the saying goes. They are not going to build apps with design and usabiltiy in mind. They want something that solves their problem. They don't really care how they get there as long as they get there. And they usually don't care if the road to the answer is paved with bugs and work-arounds.

Nothing to do with Technology by wdavies · 2006-04-10 09:21 · Score: 3, Informative

This is a hoary chestnut. I have a masters in AI, and a PhD in machine learning (and had a lot of interest in machine discovery).

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn't even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

Think of this in terms of permutations. Lets say you have variable A, B, and C. They are all binary (have values 1 or 0). Now, you are given a set of these assigments (eg A=1, B=1,C=1, A=1,B=1, C=1, and so on). Now, try to tell me what the correct partition is. Sort them in to two sets of any size. See the problem ? I didn't tell you what I wanted as characteristics of those sets - so in effect, they are all possible good partitions.

So, data-mining ultimately relies on human's deciding what they want to read from the tea-leaves of the data.

Now, give it up, and start addressing issues of efficient algorithms given that you have a specific performance task :)

Winton

Re:Nothing to do with Technology by Anonymous Coward · 2006-04-10 11:05 · Score: 0

A Masters and a PhD yet still unable to use apostrophes correctly? /shakes head
Re:Nothing to do with Technology by trosenbl · 2006-04-11 01:19 · Score: 1

A Masters and a PhD yet still unable to use apostrophes correctly? /shakes head

You haven't read many research papers, have you?
Re:Nothing to do with Technology by alienmole · 2006-04-12 10:20 · Score: 1

And those are edited...

Re:A SLASHDOT PRAYER by Anonymous Coward · 2006-04-10 09:23 · Score: 0

Is there any particular reason for Flint, MI?

Re:I tell you why (from a bioinformatics viewpoint by Neil+Blender · 2006-04-10 09:24 · Score: 1

Hmmm, why don't the developers and biologists...gasp!....work together to design something?

We'll for starters, you get developers convincing the biologists that they need Oracle...and it only goes downhill from there.

I'm not saying that it can't happen, only in my experience (15+ years worth) it usually doesn't.

Math Quiz! by Telastyn · 2006-04-10 09:26 · Score: 1

How many relations exist for any combination of N pieces of data?

That's right, a shit ton.

What's the question? What are the barriers? by g8orade · 2006-04-10 09:27 · Score: 1

I think the issue with Google or other search engine is how to do analytics.
How do I write a multi-variable where clause?
How do I ask a multi-variable question and then hone it or drill into it along one or more parameters, unfolding detail but preserving multiple layers of an outline hierarchy?

So just there is the idea of a different presentation layer, hierarchy and tabular perhaps.

Then, what kind of barriers do I have to getting at the data? Privacy issues? Copyright or patent issues?

If you want to connect two or more points, won't we have to move beyond keyword searches?

title misleading by flynt · 2006-04-10 09:35 · Score: 1

After RingTFA, this doesn't seem to be about data mining in the computer science/statistic sense at all. Instead, the article suggests that scientists in academia aren't using the best database tools and techniques available. This I agree with strongly, there is often a disconnect between experiments done in scientific fields and proper database techniques to store that data efficiently. However, I don't call that data mining.

TFA by wfberg · 2006-04-10 09:37 · Score: 1

What about that TFA? Some one converted a stack of indexcards to a relational database? And this warrants a post on regdeveloper AND slashdot, exactly why?
Like there aren't things to write about like the Open Archives Initiative Protocol.. Geez.

--
SCO employee? Check out the bounty

Copyrights in the way? by miffo.swe · 2006-04-10 09:39 · Score: 1

Correct me if im wrong but arent copyrights the biggest obstacle against this? You canl only mine your own data as IBM and others already does today. Im interested in when you can mine data from all the various sources and combine those into conclusions. File formats are another thing hampering this kind of technology, especially if you look at it in a longer time frame. Try mining those Lotus 123 documents for historic facts ;D

--
HTTP/1.1 400

Google and Self-joins by CrazedWalrus · 2006-04-10 09:39 · Score: 1

I just want to comment on this question from the summary:

[...]or will Google make the art obsolete once they finish indexing everything?

Isn't the value of relational databases in the ability to "relate" indexed datasets? Google doesn't support a "join" syntax, as far as I know.

Even Google's fantastic text indexing doesn't break the data up into the discreet "fields" that would be needed to do any meaningful relating. It's sort of like having all of your data in a single column in a single table, and trying to self-join on "like" expressions.

Yeah, you can probably make-do if your data has some degree of consistency, but as the dataset incorporates a higher degree of "chaos" (read: different languages, topics, author's fluency in the language, etc), the more difficult any real relations become.

It's not impossible, given some significant (human) enrichment of the data, but we're nowhere near the ability to "join" conceptual data from widely disparate data sources. Maybe as AI improves to the point that it can read and "understand" natural languages (and forms of them spoken by non-native speakers), this will become more of a realistic concept. Certainly something to work toward, anyway.

What's everybody waiting on? by GOD_ALMIGHTY · 2006-04-10 09:42 · Score: 0

I thought this was why we built the Internet in the first place?

--
Arrogance is Confidence which lacks integrity. -- me

Easy answer by El_Muerte_TDS · 2006-04-10 09:42 · Score: 1

How much do we know that we still don't know?

We don't know

Re:Easy answer by drachenstern · 2006-04-10 16:31 · Score: 1

How much do we know that we still don't know?
We don't know
I knew that already! Sheesh!!

--
2^3 * 31 * 647

A better lighted shot by hackwrench · 2006-04-10 09:43 · Score: 1

The article said that the researchers were prepared to store their database on index cards. With people still prepared to keep vast amounts of data off computers, it seems that there are still a number of hurdles to jump through before much data is in a state suitable for correlation.

dm by Aradorn · 2006-04-10 09:44 · Score: 1

The government and businesses are very very interested in data mining. The government can use data mining to assist law enforcement with tracking frauds, criminal activities, and even possibly stopping terrorist attacks. Businesses can use data mining to analyze sensors which monitor temperature, barometric preasure and a ton of other measurements and then ues this information to figure out production procedures for that particular time period. The possibilities for data mining are pretty much endless given you have the processing power to compute all the numbers. Data mining does not just deal with data that is in a DB. Rather finding some pattern or correlation between sets of data that otherwise seem completely random or pointless given the context. One of the main problems for data mining is privacy and until we can find a secure way to share data between DBs it will only hinder the advancement. A really good book to read is Data Mining:Next Generation Challenges and Future Directions which discusses pretty much everything you need to know about data mining. Also it is pretty much impossible for google to index all the information on the internet consider there is about ~110 billion webpages and it grows each year.

The problem is both easier and more difficult by zappepcs · 2006-04-10 09:44 · Score: 3, Insightful

The problem is both easier and more difficult than it first appears, or even second and third times:

Data, whether held in databases (usually nice and tidy) or in flatfiles, or random text files spread all over hell's half acre, is simply data, not the information required to link it to other data. Even meta data about the data held in any data store is not the information required to link it to other data.

One of the things I believe will help (possibly) is ODF (buzzword warning sounds) because it begins to help format data in a universally accepted manner. Though it is not the only way, universal access methods are required for accessible data. Second, the structure of the data must be presented in a universal manner. This second part allows query languages to support cognitive understanding of the structure, and thus (with some work) the value of data held in a storage location, where ever and whatever that location is, be it RDBMS, text files, or phone bills.

Indexing is simply not enough. The ability to retrieve and utilize the index with the most probability of having relevent data is what is needed. We all know that any search engine can get you too many 'hits' that contain useless data. Google or anyone else is helpless until there are accepted methods for applying metadata and data structure descriptions on all data.

When there is far more organization to data storage, there will be a great sucking sound of people actually using data from the internet in brand new ways.... until then, its all hit and miss.

--
Support NYCountryLawyer RIAA vs People

It's still only an index by Anonymous Coward · 2006-04-10 09:46 · Score: 1, Insightful

A table of contents doth not a book make.

I don't think Google will replace good old fashioned research by humans. I think we're still light years from computers having anything even *close* to intelligence high enough to replace humans in 'connecting the dots' of data libraries.

$0.02

******************
Slow Down, Cowboy! It's been x minutes since you last successfully posted a relevant comment anyone wants to read.

Re:I tell you why (from a bioinformatics viewpoint by quanticle · 2006-04-10 09:47 · Score: 1

Any solution general enough to be blind to the context of implementation would either be so slim that you'd have to add context-specific information to it in order to get anything done, or so fat that it'd try to be everything to everybody and would end up being nothing to nobody.

--
We all know what to do, but we don't know how to get re-elected once we have done it

How do Google do their queries? by caluml · 2006-04-10 09:48 · Score: 1

I want to know how, if I put a random string on my webpage (say ioeuhncio38u9384hynfxiuhfnx847uvh04897x ), and wait for Google to index it, that searching for that string will return my page in milliseconds. It obviously can't be a pre-executed query. So how the hell do they do that? SELECT * FROM index WHERE text ILIKE '%foo%' just won't cut it.
I'd love to know how search engines do do it - anyone reading this worked for one?

--
Get your own free personal location tracker

Re:How do Google do their queries? by Peter+Mork · 2006-04-10 10:04 · Score: 1

Behind virtually every keyword retrieval system is some form of an inverted word index. You first probe the index using the search term. By chopping the index up into pieces (for example one piece for each letter of the alphabet) and replicating each piece across a large number of machines, you can massively parallelize this lookup. The inverted index returns a list of documents identifiers (URLs) containing the keyword, probably pre-sorted in descending Page rank. If you provide multiple keywords, you need to compute the intersection of these lists.
Re:How do Google do their queries? by Anonymous Coward · 2006-04-10 10:14 · Score: 0

check out mysql fulltext search. Postgresql implements it too.
Re:How do Google do their queries? by Anonymous Coward · 2006-04-14 06:48 · Score: 0

Holy Cow, you don't have to wait long: Google indexed it already!

Results 1 - 1 of about 2 for ioeuhncio38u9384hynfxiuhfnx847uvh04897x

Slashdot | Why Is Data Mining Still A Frontier?
I want to know how, if I put a random string on my webpage (say ioeuhncio38u9384hynfxiuhfnx847uvh04897x ), and wait for Google to index it, that searching ...
rss.slashdot.org/Slashdot/slashdot?m=4806 - 125k - Cached - Similar pages

In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed.
If you like, you can repeat the search with the omitted results included.

Re:To: Brain Dead U.S. Population by Anonymous Coward · 2006-04-10 09:49 · Score: 0

Fortunately I can data mine the submission IP numbers and find out who you are, you subversive son of a bitch!

- GWB

Spot on, and.. by swordfishBob · 2006-04-10 09:49 · Score: 1

.. and it's (relatively) easy to spend money on a "solution" as a once-off expense, but getting value requires someone to stay in the environment and work with it. How easy is it to justify employing someone with a good mix of background and intelligence (even if you can find them), to deliver, well, their job is to find out what they can deliver..

--
-- All your bass are below two Hz

42 by DesertWolf0132 · 2006-04-10 09:52 · Score: 4, Insightful

"I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."-Hitchhiker's Guide to the Galaxy"

One must remember when undertaking to find answers in the data to first figure out the question. Otherwise the answer you find will be as useful to you as the answer 42.

Without context you only have a neat compilation of arranged meaningless facts.

On the small scale data mining is used daily by marketing people and the like to figure out who would be most receptive to their approach. Webmasters use it to optimize content and respond to user trends. In most large corporations data mining is used on some level.

Data mining on the scale discussed here may be practical at some point in the future once we determine the questions we wish answers to.

Let us hope the answer is more useful than 42.

--
No animals were harmed in the making of this sig.
Well, there was that one puppy, but he is all better now.

Re:42 by drachenstern · 2006-04-10 16:33 · Score: 1

MY ANSWER!!! You have found it. Now, quickly, before they get here, let me tell you what my question wa

--
2^3 * 31 * 647
Re:42 by inKubus · 2006-04-20 11:39 · Score: 1

There's also the element of surprise. People get behind the weirdest shit sometimes, such as All Your Base are Belong to Us or Snakes on a Plane. I think the reason for these outbreaks of madness is the inherent need of the human mind to lash out at the increasing structure of today's lifestyle based on data. The human mind knows that this goes against nature, where everything tends toward chaos (at least during this portion of the cycle).

Norbert Wiener said a few things about this (from The Human Use of Human Beings):

"Just as entropy is a measure of disorganization, the information carried by a set of messages is a measure of organization. In fact, it is possible to interpret the information carried by a message as essentially the negative of its entropy, and the negative logarithm of its probability. That is, the more probable the message, the less information it gives. Cliches, for example, are less illuminating than great poems. As we have said, nature's statistical tendency to disorder, the tendency for entropy to increase in isolated systems, is expressed in the second law of thermodynamics. We as human beings, are not isolated systems. Organism is opposed to chaos, to disintegration, to death, as message is to noise."

"Life is an island here and now in a dying world. The process by which we living beings resist the general stream of corruption and decay is known as homeostasis. We can continue to live in the very special environment which we carry forward with us until we begin to decay more quickly than we reconstitute ourselves. Then we die. We are but whirlpools in a river of ever-flowing water. We are not stuff that abides, but patterns that perpetuate themselves."

Of course, Douglas Adams based a lot of the ideas in that book on The Human Use of Humans (which was much in vogue when he was at school in Cambridge).

--
Cool! Amazing Toys.

(Machine Learning == Data Mining) does work ! by copdk4 · 2006-04-10 09:54 · Score: 2, Interesting

what used to be called 'data-mining' in 80 and 90s is now machine learning in 21st century.. and there are several instances where machine learning has shown tremendous success (probably this is the only by-product of AI that has shown promising real world applications)

- The DARPA Grand Challenge - Stanely, the winning robot from Stanford used 'Adaptive vision' which used some real-time learning algorithms
- Clustering and Micro-Array Analysis - Once genetic-medicine will become a reality, the physicians will unknowingly be using clustering algorithms underneath..
- Froogle, Clusty, Amazon recommending etc all use learning underneath..

I havent RTFA but I think "RDBMS-view" is too naive for given scale of problem. What one has to understand is that data-mining is not a "push-button" technology, one has to have a total understanding of data and 'interesting questions' that one wants to answer then choose right set of algorithms and tune them properly. In biomedicine, there has always been 'bio-statisticians' in the hospital who perform these tasks.

relational data mining by kurtdg · 2006-04-10 10:09 · Score: 1

Someone who "unlocks the keys to understanding" of the data in a relational database is not called a relational database programmer. There is an entire active CS research field specialized in this task, it's called 'relational data mining'. The theoretical foundation is inductive logic programming (http://en.wikipedia.org/wiki/Inductive_logic_prog ramming). The Wikipedia article contains a link to the freely downloadable book of Dzeroski and Navrac, which is a good start.

Re:or... by symbolic · 2006-04-10 10:11 · Score: 1

Our ability to produce meaningful results, in most cases, is little more than a crapshoot.

Disappointed by Jon+Chatow · 2006-04-10 10:14 · Score: 1

I'm really quite astonishingly disappointed that the summary made no reference to the priceless phrase "unknown knows" to describe data left 'buried' in the dross, presumably from sources left, as it were, under-mined.

--
James F.

GPL data mining software by nanorc · 2006-04-10 10:21 · Score: 1

I have played around with WEKA a bit. I am interested in data classification and WEKA has many classifiers you can try out on your sample datasets.

easy answer: by circletimessquare · 2006-04-10 10:21 · Score: 1

entropy

data mining will always be a frontier, because consolidaiton and standardization of data will always be a frontier, because simple entropy leads to fragmentation. furthermore, for various reasons, some good, some bad, some data will always be purposefully constrained from consolidation, only to be released into freer usage later, when data mining can commence

it's a permanent frontier

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it

You accuse me of subersion? by Anonymous Coward · 2006-04-10 10:30 · Score: 0

Then, why do the Iraqi oil pipelines continue to burn?

Case closed.

Just like programmers then. by HornWumpus · 2006-04-10 10:35 · Score: 1

They usually gain just enough knowledge to make them dangereous as the saying goes. They are not going to build apps with design and usabiltiy in mind. They want something that solves their problem. They don't really care how they get there as long as they get there. And they usually don't care if the road to the answer is paved with bugs and work-arounds.

Sounds like 90%+ of the programmers I've known.

You don't need to know anything about Databases to know how to produce a unique hash from chaotic biological data. You don't need to know where the big honking hash came from to know it's going to be a big honking index to search.

Going back to GP post, what does any of this have to do with DBAs. They are gloified backup script maintainers.

--
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'

Semantic Web goodness-Gaming. by Anonymous Coward · 2006-04-10 10:54 · Score: 0

"The potential is there, but the willingness of humans to spend time explaining semantic structures to machines, when they're obvious enough to other humans, is lacking."

That's why you want to make it into a game

I'd be Happy if... by VonSkippy · 2006-04-10 11:17 · Score: 1

I'd be happy if Infoworld Magazine (and the rest of the trade journals) could just remember the last set of lies I told them, instead of making me make up shit every 10 months or so.

SlashDot Article Misses Point of Original Article by Anonymous Coward · 2006-04-10 11:17 · Score: 0

The original article is about moving data from paper and index cards to SQL Server 2005 - that simple.

The SlashDot article spazzes out about data mining - an area unrelated and unmentioned in the original article.

So the SlashDot poster of the article is either a bot or an idiot (I lean toward the latter).

Editors, you aren't doing your job. This article should never have been posted. The OP has nothing of interest and the unsupported additions of the poster appear to be the result of a crack pipe. Enough of these types of posts and SlashDot will become useless.

Aristotle's two bits: by novus+ordo · 2006-04-10 11:24 · Score: 1

"But if it was always true to say that a thing is or will be, it is not possible that it should not be or not be about to be, and when a thing cannot not come to be, it is impossible that it should not come to be, and when it is impossible that it should not come to be, it must come to be." Aristotle, On Interpretation.

--
"You're everywhere. You're omnivorous."

Let the monkeys mine that data by suv4x4 · 2006-04-10 12:01 · Score: 2, Funny

Data Mining is still a frontier for the same reason monkeys are still having trouble reproducing Hamlet despite all the theoretical knowledge of all the incredible opportunities.

Too much assumption, too much possibilities, too little knowdledge, and not enough monkeys. You can never have enough friggin' monkeys.

Its long and hard, just to get started by benow · 2006-04-10 12:18 · Score: 1

Well, I've been collecting data from various sources lately, and most is still in 'data' form, ie no real revelevant difference one set of bits to the next. I've been on a push to surface the interactions between the data, but to even get to that point, there is alot of data massaging to do.. decompression, format interchange, subject recognition, etc. In theory, once the data is in an understood format it can be searched and indexed and the searches mined. It requires a general idea of where to go with the idea, then combination leading to a certain interrelated knowledge network, which is of more benefit than a single point. Data interrelationship is important, but hardly present today. Open and extensible metadata tied to exchanged data is vital to bridging systems providing the foundataions for greater useful data... a metadata system interchange. Right now the links are simple... one way and (usually) in a context which only a human can understand.

I've built a tv app that grabs xmltv data and throws it into a db, presents listings and a higher level 'remote control' via web pages on a zaurus (with channel changes via an ir transmitter). As I watch, what I watch is noted, and every half hour what I have noted is matched against what is currently on. It's very simple, only a join, but the results are usually quite relevant. The lesson being that relevant results only come from intentful interaction. I plan on carrying forward the simple mining... spidering out across credits, intersecting with imdb info and recommendations, moderation of selections, etc. I expect the results to be quite person-centric, but a mining of distributed relevancy could lead to better results (ie intersecting preferences of many watchers). That might contradict the problem of seeing outside the bubble... suggestions are usually the result of personal decision, and are therefor skewed towards the person.. something relevant and unusual is a beautiful thing.

There's also no money in it... trying to sell a new idea to a suit most often results in the death of the idea... skewed to benefit the uninventive status quo, tho there are exceptions, I guess. That often kills even the idea of attempting a feat that requires the acquisition and maintenance of terabytes of data to be useful. AFAIC profit driven crap is just that... inane drivel from a self-serving path. There's promise, surely, and challenge even moreso, but whether there is societal acceptance or encouragement of such un-triviality is another question. The internet is young tho, and if you think there's alot of data now, you ain't seen nothing yet.

Re:Its long and hard, just to get started by Anonymous Coward · 2006-04-12 20:28 · Score: 0

Ignore this if you already knew about the IMDB database downloads, but just in case you haven't heard about it: you can download the factual IMDB data for personal use from here. You won't get their discussion forum posts, or any pictures, but as far as I know the download is complete with regards to the factual data such as actors, production details etc.
Re:Its long and hard, just to get started by benow · 2006-04-13 11:55 · Score: 1

Yeah, totally. Thanks for the response, tho. I've dl'd the data and have made a .gz to sql importer, which I've yet to fully run... 300M of compressed ascii takes ages to import... 600k+ actors alone. When done and validated, should mean a local imdb cache which should be faster than imdb. I plan an exception handler which queries and fetches from imdb when the data is not available locally, and then to create lightweight pda-friendly dynamic pages for presentation of data. May go live with the 'mobile imdb'... I'm sure there'd be demand tho I'd have to get clearance from imdb.

It's all in the management. by Ruff_ilb · 2006-04-10 13:19 · Score: 1

I used to work a simple job where I did database work for a company doing medical studies. It wasn't a lab, but it wasn't your typical cubicled office either. Although I had very little knowledge on the actual medical component of the studies I was doing, certianly not enough to design the stuff I needed to do, the management was superb - I wasn't REQUIRED to know anything about the medical component, and they trusted me to do the programming. What I didn't know they were happy to fill me in on - I knew enough about medicine for what they were saying to make sense, and they knew enough about programming to give me some idea where to start. If the management can effectively coordinate biologists and pgrogrammers, you don't need to have dbas with deep biology backgrounds.

--
http://www.TheGamerNation.com/Forums

Re:I tell you why (from a bioinformatics viewpoint by jlarocco · 2006-04-10 13:32 · Score: 1

We'll for starters, you get developers convincing the biologists that they need Oracle...and it only goes downhill from there.

Is it possible the developers are saying something like "It starts out with the biologists saying they need 30 TB of data available 24/7 with 99.999% uptime and 200-250 concurrent users, and goes downhill from there..."

Simply saying the developers are idiots because they suggest Oracle really doesn't make sense without more context. If more than one group of developers suggest Oracle, they might have a point. Are you sure you're not exagerating the requirements? If you need really high uptime and lots of concurrent users with really, really large amounts of data Oracle is one of the better choices. There's a reason it's so expensive...

--
Maybe not

Re:I tell you why (from a bioinformatics viewpoint by TheSpoom · 2006-04-10 13:54 · Score: 1

Indeed. This is why I prefer a compromise: modularity. Generalization in the parent software, specialization in the modules. Plus it allows for third parties, if they so choose, to easily integrate with the parent software.

--
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs

New Use for Google. by Allnighterking · 2006-04-10 14:41 · Score: 1

The Patent administration takes your idea puts it into google, filtering out you and any article talking about you. If they get a hit, prior art, eeeeeeeeh patent rejected!

--

I'm sorry, I'm to tired to be witty at the moment so this message will have to do.

Re:Shot in the dark: Who benefits? by OldBaldGuy · 2006-04-10 14:48 · Score: 1

It's to my benefit if you spend the time to publish your data.

Publishing my data is too bothersome.

Any questions?

As I've Said Repeatedly by Master+of+Transhuman · 2006-04-10 14:52 · Score: 1

without conceptual processing, data is just so much bits and bytes. Some of it can be analyzed as such, but much of it cannot without some conceptual comprehension on the part of the software (if not the analyst - which is the other problem).

A decent (read, relatively effective and efficient) simulation of conceptual processing would change the entire world of computer use from development to databases to computer education to robotics. It is THE world-class issue that needs to be resolved and soon.

--
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!

weft by Anonymous Coward · 2006-04-10 14:52 · Score: 0

skipjak goes to quadrant 4.

ha ha, it's chaos theory by recharged95 · 2006-04-10 14:56 · Score: 1

Mining is like looking at a huge phase diagram. Constanting changing as it's very sensitive to conditions (initial ones at least), which are constantly changing too.

Patterns will be recognized, but to extract useful information other than information about nature (i.e. physical laws) will always be subjective, or in other words, chaotic.

Google will never be 'done' with indexing--just another infinite loop in the making :)

McDonald's & Chaos: I'm lovin it.

SQL Server to discover patterns... by JoshRoss · 2006-04-10 14:58 · Score: 1

When using the simplest of models, I get nothing but junk coming out of Analysis Server. I am sure there is something that it will find. I just have not found it yet. This is with sales, time and products as the dimensions.

Diapers & Baby Food by Anonymous Coward · 2006-04-10 15:10 · Score: 0

Next time you're hosting or attending a (potentially) riotous party, gladden anonymous coward's weary heart by pulling the old "irresponsible" parent prank:

Get yourself down to [supermarket of choice]

Load up the trolly with slabs of beer, bottles of spirits, nibbles and smokes.

Add a couple of packs of disposable diapers and a few packs of baby formula.

Go to the checkout, wait your turn.

When the cashier has rung up all your purchases and given you the total, go: "Oh Lordy! that's too much!". Push the diapers and baby formula to one side. "I'll have to leave these!"

Expect a call from child protection in the near future...

maybe Semantic Web is close... by spage · 2006-04-10 15:20 · Score: 1

Too bad that the Semantic Web is a pipe dream at the moment.

You can download the Semantic MediaWiki extension right now and add semantics to a wiki. Currently all the links between pages in a MediaWiki have no meaning, and all the facts in each page can only be extracted by humans reading it. With the upgrade a page can state [[is located in::California]] to explain the type of relationship implied by a link, and can express attribute values like [[population:=1,305,736]]. The current version summarizes all such facts in each page and can export them as RDF. It's a simple extension, but once it's implemented in Wikipedia, you could query for, e.g. the population of every major city in California. Doing such semantic queries using Google is basically impossible, you'll just get a list of pages and have to read and filter each one to create your own list.

Sharing semantics between datastores would require people agreeing on ontologies, which according to people like Clay Shirky is indeed a pipe dream. I'm not so sure, that's like saying categories in Wikipedia are useless because they're disorganized. Just using the Dublin Core metadata to identify authors of information in a common way would be a big breakthrough, and there are simple enough ways to do it in XHTML that I think it'll pick up steam in the next few years.

--
=S

Re:I tell you why (from a bioinformatics viewpoint by espressojim · 2006-04-10 15:23 · Score: 1

This sounds like bioinformatics.

The um...field I've been working in for the last 6 years.

Programming + Biology + Statistics + Algorhitm development.

Re:I tell you why (from a bioinformatics viewpoint by drachenstern · 2006-04-10 16:20 · Score: 1

isn't that the whole idea behind well written and concieved code? just thought i would check.

--
2^3 * 31 * 647

Re:I tell you why (from a bioinformatics viewpoint by ebuck · 2006-04-10 18:39 · Score: 1

I happen to be one of those few fools that have both a degree in Biology and in Computer Science. And at one time I relied on my research skills in Biology as my ONLY income, until the dreaded and softly spoken "balancing" of the budget that spelled doom to most low level Biologists of my time.

It is hard to mix the two. This is even more frustrating if you're marginally inclined to understand where things come from and how they are designed. Some of the earliest proponets of object oriented software programming envisioned "cells" of code that "signaled" each other. It's no cooincidence that Biological terms were used, because that (I've now forgotten) person held a Biology degree. The iterative (or step-wise) approach identically mirrors the laborious procedures for running most lab jobs. I consider myself blessed to have a firm grasp on both techniques, but there are plenty of others who have managed to master both.

Today, I'm told that my Biology degree has little to do with my work, and that it has no bearing on my career as a software developer. I understand the misconception behind such a statement; however, I do not share that opinion. To some degree, ALL of the natural sciences are related, and ALL research and fact finding skills can be leveraged in other environments. I'm shocked at the poor quality of "professional" Biological software, but then again, it's mostly written by Biologists that have the domain knowledge, but lack the skills to produce polished software. Even when they attempt to hire, there's precious few people out there with enough skill to know if they are hiring good programmers, leading to skill poor shops shipping the best they can produce.

I'm sure that not all shops are skill poor, but I had an opportunity to work for a Biological company that was writing software where I didn't see a real opportunity. The position promised a worse hierachy in terms of status and prestige, with fewer opportunities for me to contribute than its non-Biology competitors. Maybe I was unlucky, and parhaps I picked the one bad example to interview with. They were telling me things about their software development that made my CS blood run cold, expecting me to admin 80 computers "on the side" and offering to pay me about 70% the going rate.

Fewer people really want programmers, because it is unclear to a non-development company what a programmer can offer, but it is clear what a DBA can offer. Reliable reporting, offline data storage and retrieval, performance analysis, and data warehousing needs are directly tied to the survival of many businesses. The things a computer science person can offer might be able to exceed that of a DBA, but it's not as certain it will. Given the conservative bent of today, I'd say that most companies feel they can't afford a risk. Perhaps it partially explains the appeal of finding candidates with a 100% buzzword match compliance?

Which Bacon? by Anonymous Coward · 2006-04-10 19:28 · Score: 0

It is interesting to see that some people in this thread still remember Roger, rather than Francis, Bacon.

Francis was, of course, wrote his 'New Guide' in 1620, around 400 years ago. Much of his work was a straight copy of ancient authorities - you can see a lot of Cicero in his treatment of philosophy. A lot of his science seems to be a direct repetition of Roger's work. However, he was trying to rediscover the lost Golden Age, the wisdom of the ancients, rather than develop a new field of study, like his predecessor.

Roger, by comparison, was writing in the 1260s, around 400 years before Francis! Though both men proposed a structured, experimental approach to the investigation of Nature, it was Roger, with his iconoclastic approach to 'unworthy' authority who most closely resembled the modern questioning approach to science, and he is usually considered as the inventor of the scientific method. That was what got him locked up in the March of Ancona for 14 years!

It is a shame that Roger's language is so inpenetrable to modern ears. There are no readily-available translations, and even less commentary by technically-aware persons who understand classical scientific concepts. As Blish points out, a first glance at 'De multiplicatione specierum' would make you think that Roger was talking about biology rather than physics.

This reference - http://www.nndb.com/people/582/000114240/ - suggests that there are still lots of unpublished Bacon manuscripts in British and French libraries. I think that a complete collection of Roger's works in an accessible format would add more to scientific progress than many studies which receive funding today.

Pardon the self-promotion, but... by dfetter · 2006-04-10 20:03 · Score: 1

I've written a thing called DBI-Link http://pgfoundry.org/projects/dbi-link/

which helps do the job by making data sources easily available, one to the other. Of course, it's not done yet, but it's a long way in the right direction :)

--
What part of "A well regulated militia" do you not understand?

Re:I tell you why (from a bioinformatics viewpoint by dodobh · 2006-04-10 20:04 · Score: 1

Then wouldn't it be useful for the biologists to define the context for the programmers? It shouldn't be impossible to do so (very hard, I will grant you).

--
I can throw myself at the ground, and miss.

Re:I tell you why (from a bioinformatics viewpoint by Hast · 2006-04-10 20:54 · Score: 1

There's a reason it's so expensive...

Because Larry Ellison needs a new sub-woofer?

Re:I tell you why (from a bioinformatics viewpoint by mlush · 2006-04-10 23:03 · Score: 1

Then wouldn't it be useful for the biologists to define the context for the programmers? It shouldn't be impossible to do so (very hard, I will grant you).

It woudl be Very useful

However taking a programmer with no biological experence, I'd guess it would take about a years full time study to properly define the context to him/her and perhaps 3 months (full time) to give them a reasonable working knowledge.

Its easy enough to give the basics (DNA makes RNA makes Protein(1)) its that biology is wall to wall special cases. Biological systems run the worst spagettee code you can imagine written in a language thats barely documented(2), written by a developer who is willing to hack the executable, the source code, the compiler, the operating system and in extreme cases the hardware to get a functional system.

(1) except RNA can 'make' DNA, RNA can act like a protein (enzyme)

(2) Using language only comprehensible if you know the subject already

It's alive and well in government by Anonymous Coward · 2006-04-11 00:24 · Score: 0

We are collecting local, state and IRS tax data, and information from thousands of business around the country that do business with citizens of our state. So far, we are at 6TB on our SANS, and growing. We used to use a VFP database just to compare state tax returns against IRS returns. The process took over 40 hours of 486 CPU time just to process the data into a form that could be queried. Now, with P4's, 2GB of RAM, and 3.3GHz processors working against our Oracle DB we can churn data as easy as making ice cream.

We "data mine" by comparing individual and corporate tax returns with spending habits to see if
a) state tax returns match federal tax returns
b) the spending matches the income
c) there is unreported out of state sales on which state taxes can be collected
d) any other infomration that might help in the collection of unpaid taxes.

On the average, 10% lie about their taxable income, some in excess of $100K, and more don't report online purchases. Look for the "StreamLine Internet Tax Initiative" to be passed by a majority of state legislatures in the near future.

math and science by Anonymous Coward · 2006-04-11 00:39 · Score: 0

Yes and no... believe it or not, at least a few people in the biological sciences are utterly incapable of understanding algebra. (I said a few, not necessarily a majority...) So, even basic SQL is a mystery to them. They can become productive workers through just assimilation of information. This is not necessarily easy; quite a few people (including programmers) don't have the memory to do it.

A biotech scientists point of view by cinnamon+colbert · 2006-04-11 00:55 · Score: 1

I don't see problems that are susceptible to data mining. I suspect this view is shared by many of my colleagues. (this is similar to the view of most bio oriented scientists that desgin of experiment is not useful)

What would change the field ?

In science, what usually changes peoples minds is a BIOLOGICAL results obtained with a new technique that could not be obtained (easily) another way.
this may just be restating the old truism that success breeds success, but to get biologists interested in large scale database mining sorts of thinkgs, you have to convince us that there are questions that can be anwered witht his technique that can't be answered more easily with other techniques.

The other problem is the high noise of bio experiments; you can't simply aggregate data on rats blood pressure in france with ozone levels in arkansaw and come up with hypothesises on weather and strokes; the data quality is to low, and it would be prohibitively $$ to design the experiments to have that level of quality.

a word on design of experiment: as I understand it, if you have n interelated variables (temp, pressure, time, ..etc) and each variable has x levels (Temp = 20 deg, 25, 30...) a "complete" experiment would be some sort of n by x dimensional test. DOE says if this number is so large that you can't look at a significant %, you are better off changing two variables at a time to sample the space. None of hte doe people have been able to put this into comprehensible useable form, and for various reasons, most biologists don't think this sort of hting works (the variable are obvious, and the response surface is highly non linear and non smooth)

Re:Sig Wars by Josh+teh+Jenius · 2006-04-11 01:27 · Score: 1

Hey man, interesting project.

Is this of any use/interest to you? http://joshthejenius.com/experiments/technorati_sp am.php

--
Math is math. Regular expression is regular expression. The tools are there. The future is now.

Re:Honest question from serious lackey- by Josh+teh+Jenius · 2006-04-11 01:40 · Score: 1

You post rings true with me.

In your opinion, is this do-able:

1. Design a series of algorithmic searching functions, each catered to specific datatypes? (i.e. porn, or PHP functions or whatever it might be we are looking at/for).

2. Connect all of these specialized search algorithms together, with a single, simple UI.

3. Use natural language processing (AI) to direct each query to the proper algorithm.

4. Convince 300 million people to stop using Google every day.

5. ???? (NOT ADS!)

6. ALL YOUR PROFITS ARE BELONG TO ME!

Joking aside, as someone smart enough to make it through grad school ("BS" = good description of my education), am I on the right path here, or waaaaaay off?

--
Math is math. Regular expression is regular expression. The tools are there. The future is now.

Please correct your terminology! by Medievalist · 2006-04-11 02:48 · Score: 1

The "shit ton" is an obsolete SAE measurement, derived from the British "Imperial Arse Load".

As the USA is now on the SI system, please update your nomenclature to the currently correct "Metric Fuck Ton".

Thank you in advance for your co-operation.

--US Department of Weights and Measures

Re:Please correct your terminology! by nagora · 2006-04-18 10:59 · Score: 1

As the USA is now on the SI system, please update your nomenclature to the currently correct "Metric Fuck Ton".
You are clearly unaware that the Standard Metric Fuck Ton(ne), which is stored in Paris, France has recently be found to be shrinking at a rate of "Shit-all Squared" per year.
The current US administration has jumpped on this as a pretext to move to the new "God-damned Freedom Ton" which is defined to be exactly equal to 1 original Metric Fuck Tonne, except it is not in any way connected with France. It is stored in a freezer in Austin Texas. Beside the meat.
You must not have been on that mailing list.
TWW

--
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"

Re:I tell you why (from a bioinformatics viewpoint by Miraba · 2006-04-11 03:03 · Score: 1

Its easy enough to give the basics (DNA makes RNA makes Protein(1)) its that biology is wall to wall special cases. Biological systems run the worst spagettee code you can imagine written in a language thats barely documented(2), written by a developer who is willing to hack the executable, the source code, the compiler, the operating system and in extreme cases the hardware to get a functional system.
(1) except RNA can 'make' DNA, RNA can act like a protein (enzyme)
(2) Using language only comprehensible if you know the subject already

An accurate and funny analogy!

Re:I tell you why (from a bioinformatics viewpoint by Anonymous Coward · 2006-04-11 03:21 · Score: 0

We'll for starters, you get developers convincing the biologists that they need Oracle...and it only goes downhill from there.

Definitely! A certain biotech department I know that shall remain unnamed bought several Oracle licenses for some database they were going to use.

Then one of their programmers convinced them to use MySQL instead, because free software is good.. Or whatever.

Just insane.. It's like buying a Ferrari and then deciding to drive a Yugo because it has better gas-mileage.

google ties up the loose ends by ashwinds · 2006-04-11 03:58 · Score: 1

Just to illustrate.... I read the article, thought the photos were nice and sharp and I wanted to know which camera was used. Saw the footer credit telling who took those pics - did a search for the photographer name + Flickr on google and came up with a match. Looked up and found a pic with exif info which said its was Olympus E-300 Now I know I maybe wrong - but what you should be thinking is Hell, I could be right. The power of google is tying the loose ends in unstructured content. Its awesome.

bioinformatics by Anonymous Coward · 2006-04-11 09:10 · Score: 0

I am also a bioinformatician: I am a biologist an I write crappy code. Yet using only public domain data from excellent repositories like ncbi, ebi, dip etc. I can make new discoveries and publish reports on these discoveries in high impact factor journals. So ... I disagree with this news item. As far as I am concerned there is no problem.

Re:Honest question from serious lackey- by alienmole · 2006-04-12 10:25 · Score: 1

You haven't given enough information to go on. Points 2 & 3 are far too general. No offense, but point 2 reminds me of sales execs who ask software developers to "just give me a single button that does what I want". It's all very well to talk about a "single, simple UI" to do something very complicated, but it's something entirely different to design and implement such a UI. Think of existing applications and tell us which ones do something like your point 2. If there are any, then how is your system going to be better? If there aren't, why do you think that is? Point 3 suffers from similar problems.

You'd be better off just giving people a menu of search algorithms, and letting them pick which ones they want to use. (Maybe after a while, you could use Amazon-style "people who searched for X found good results using algorithm Y" logic to help automate algorithm-picking, but how often have you actually bought a product suggested to you that way?)

Re:Honest question from serious lackey- by Josh+teh+Jenius · 2006-04-12 11:31 · Score: 1

Sorry to sound like a PHB, let me give an example:

One spider is hitting craig's list, in order to find the liquidity of real estate (in other words, X properties at an avg price of Y listed FSBO in Albany, NY today). A second spider is hitting various government sites in order to find taxes, appraisals, etc. Then, using some fancy-pants math, I am able to reduce everything into a single search portal which is responding with the most valuable leads for my real-estate investor slave masters (this is my current day job).

I took the idea further by combining Wiki and Google News into another hybrid search engine; students enter a topic, and it parses notes, and cites sources in MLA format.

Two applications, same idea: spider *specific* data from *specific* places, so that these "magical algorithms" can actually do something useful with all this data.

You mention Amazon's AI, which many seem to think is marketing fluff. In my experience, when looking at obscure books, or obscure authors, the recommendations ARE excellent, and I HAVE purchased several of them (to be fair, I'll take a chance on just about any non-fiction book). However, when looking at the latest big DVD release, the recommendations are crap.

Back to my objective: as we see with Amazon, and your original post, the more *specific* (and credible) the information, the more we can do with it. As for the single UI, this would basically be the "last step" in tying all these wacky "hybrid engines" together under one name.

And as you mention, this raises several UI questions for which I have no answers. I like the Google "click-and-go" experience...when it works. Sometimes I wish I could "talk with it" and help it understand my search *before* showing me a billion links to crap. Of course they offer hundreds of options, but somehow, it "feels" like Google is losing its edge these days (as far as accuracy goes).

If the above made no sense, please disregard. It's been a long day.

--
Math is math. Regular expression is regular expression. The tools are there. The future is now.

Re:Honest question from serious lackey- by alienmole · 2006-04-12 15:54 · Score: 1

The examples you give could be described as expert systems, of a sort, relying partly on socially-produced data. I think there's little doubt that such systems will proliferate, and at some point it could start making sense to tie them together. In a way, the "semantic web" is working towards supporting that sort of thing. But this approach is at odds with almost everything that's succeeded on the commercial Internet so far: it's all been about, essentially, exploiting simple business models (Amazon) or clever tricks (Google), and leveraging them to the hilt. That's what the short-term VC-funded approach to product- and company-building is good at.

I think your idea is the sort of thing which could evolve over time, but is unlikely to be built by any one team unless they start with a whole bunch of existing systems that are ripe for integrating. Of course, if I'm wrong and you end up founding a billion-dollar dotcom, please post a Slashdot article about it so I can kick myself!

BTW, if you're working on AI-like applications, I hope you've read PAIP. And of course, Norvig's at Google now...

Re:Honest question from serious lackey- by Josh+teh+Jenius · 2006-04-12 16:44 · Score: 1

Thanks for the link. I loved ELIZA- that was the prog that really turned me on to this AI stuff in the first place.

Sadly, I am too stupid to care about fame or money, I just want to make it work. That said, back I go to the land of math and dreams...

Peace.

--
Math is math. Regular expression is regular expression. The tools are there. The future is now.

Re:Honest question from serious lackey- by alienmole · 2006-04-12 19:07 · Score: 1

Math and dreams are fine with me, I'm into functional programming myself, which (right now) is one of the least commercially applicable branches of programming imaginable. The only reason I brought up the dotcom stuff is that you were talking about improving on Google, and my point is just that despite the marketing hype, companies like Google are really doing more like high-end IT than anything deeper, because the deeper stuff takes decades to develop. I could be wrong, but what you're describing sounds more like it would be the latter.

Re:Honest question from serious lackey- by Josh+teh+Jenius · 2006-04-13 04:16 · Score: 1

OK, I admit it: I had to go read the Wiki on functional programming before I could reply. Having read this article, I feel a bit better about some of the work I've been doing (I have a weird habit of *doing things* and only later learning what the proper name of the activity is).

Would this count as "functional programming"? http://joshthejenius.com/experiments/technorati_sp am.php

I'm using PHP, but only because it is the easiest language I've ever worked with. I started coding as a wee youth on a PS/2 286 in Qbasic. As best I can tell, it really doesn't matter *which* language we use, so long as (at the end of the day) x = what we want.

I am a tad insecure here because I never "officially" learned any of this. I am really trying to follow the "standards" better, but I am having a hard time understand *which* standard(s) I should use to express this.

Case in point: f[0|-1] = (((u/w)-Avg(u/w))/1-Avg(u/w))-1

Is that right? Math aside, have I expressed this formula properly?

I have an MLA guide that I *love* to reference on various "english rules" I get stuck on. Does a similar guide exist for mathematics? I.e. to ignore a "1" we do such and such...Or am I really just discovering that this is a free-for-all and no one is really sure which end is up?

Why is this field one of the least commercially applicable? It would seem to me a simple equation would have unlimitted portablity; don't all languages use math? If I remember correctly, Even Qbasic handled exponents and logs.

As for taking years and years- the way I see it, years and years is what I got (I'm currently 25). Best case, I get to participate in an entirely new wave of innovation. Worst case, I'll write a book and leave it for the next generation. (Chapters 1-58: Don't waste your time doing the following...).

--
Math is math. Regular expression is regular expression. The tools are there. The future is now.

Re:Honest question from serious lackey- by alienmole · 2006-04-13 14:38 · Score: 1

Sure, your algorithm would count as functional programming, but part of the point about functional programming is that entire programs, not just individual functions, consist of nothing but the composition of pure mathematical functions. Which means that a lot of what you do is "normal" languages just isn't allowed. Point 2.1 in the comp.lang.functional FAQ, which gives a simple comparison between imperative and functional style, although to really understand it you'd need to learn a bit more about at least one of the functional languages.

Just about any language allows you to implement simple pure mathematical functions - as you say, "a simple equation would have unlimited portablity" - but very few languages allow you to construct entire programs that way. None of the mainstream languages (PHP, Perl, Python, Ruby, Java, C, C++, BASIC, etc.) support functional programming to that extent. The languages which do support it are ones like Haskell, ML, OCaml, and Scheme. The reason I say that functional programming is not commercially applicable is not because there aren't commercial applications of it, but because these languages are hardly being used at all commercially. There's a bit of a catch-22 there, because people don't know them because they aren't being used, and they aren't being used because people don't know them.

Here's your algorithm implemented in the Haskell language. If you want to try running it, download and install Glasgow Haskell (on Debian, you can do "apt-get install ghc6"), and run "ghci" to get an interactive Haskell prompt.

-- define a list of p values let ps = [0.825, 0.8868, 1, 0.8542, 0.8889, 0.8, 0.9118, 0.95, 0.9487, 1, 1, 0.8333, 0.8197, 0.6383, 1, 0.8727, 0.875, 0.7879, 0.8667, 0.8636] -- define a function which averages a list of numbers. -- The 'fromIntegral' is needed because Haskell is a strongly, statically typed language let avg xs = sum xs / fromIntegral (length xs) -- work out the average of all the p's let avgP = avg ps -- define the function, f(p) -- this would usually be written on multiple lines, but the interactive shell doesn't allow that (?) let z p = x / y - 1 where x = p - avgP; y = 1 - avgP -- Finally, "map" the function f over the list ps, giving a list of results. map z ps

The output from this is:

[-1.4721965172036693,-0.9523008328426015,0.0,-1.22 65500126188285,-0.9346344746361576,-1.682510305375 6212,-0.7419870446706487,-0.4206275763439058,-0.43 156389332884704,0.0,0.0,-1.4023723395305803,-1.516 783040296123,-3.0428198872718117,0.0,-1.0709178093 715828,-1.0515689408597635,-1.7843021788508464,-1. 1213931185328516,-1.1474720282661737]

The above is oriented towards experimenting with it at the command line. For a real program, you'd probably ultimately package it into a single function that takes an input list and produces an output list, e.g.:

-- repeating the avg definition for the sake of completeness let avg xs = sum xs / fromIntegral (length xs) let f ps = map z ps where avgP = avg ps; z p = x / y - 1 where x = p - avgP; y = 1 - avgP

("f" is defined over multiple lines for readability, the way it would be written in a program file, although the indentation didn't make it through Slashdot; to enter it interactively, you'd have to do it on one line.)

This last version defines your entire formula quite clearly, for anyone who knows functional programming. That's one answer to your question of how to express these things mathematically: if you express them using high-level programming languages, it has the benefit being concise and unambiguous, but also checked by the compiler, so if it works you know you haven't made any mistake

Semantic web by inKubus · 2006-04-20 11:11 · Score: 1

Maybe Tim Berners-Lee and his semantic web will make something happen. That's the real problem. When you have to write like 30 or 40 layers of SQL queries to get what you want, and then to get a decent report you have to spend 100 hours in crystal or make compromises, and in the end all you have is more data. What is the MEANING of the data? I think a lot of the knowledge of humanity is stored in words and books and not indexed. Most db data is just statistics, which are useless ;)

What if you could "explain" what "The apple tree is 15 feet tall" means using a structured language?

Then, it would be pretty trival to search for 15 foot tall things, apple trees that are taller than a man, etc.

--
Cool! Amazing Toys.

Mining at its worst by woolio · 2006-04-22 18:41 · Score: 1

I have to wonder if data mining isn't the problem -- the real problem seems to be that there are few obvious problems data mining will solve.

Consider WalM*rt. When the 2005 hurricanes were predicted, they mined their sales data for previous hurricanes. They found that in the last hurricane people stocked up on beer, pop tarts and peanut butter, so they sent trucks full of that stuff to the stores in the path of the hurricanes. They made lots of sales, and provided a valuable service to the communities. Capitalism at its finest.

Well, as a resident in a city that was about 200 miles inland, I would disagree.

They managed to run out of coolers, bottled water, battery-powered lights, batteries, propane, camping stoves, laterns. And when I mean out, I mean out: Not even a single "C" battery was to be found in the whole store!!!!

And yes, businesses and schools were closed before it hit -- so this area apparently thought the effects could have been severe. I think a less technological solution involving "common sense" should have been applied.

Re:Mining at its worst by plover · 2006-04-23 16:06 · Score: 1

They shuffled stock down to the affected stores, and they emptied the warehouses in that general direction.
Did they have enough on hand to satisfy demand? Obviously not. Did the manufacturers of batteries and bottled water see the problem coming 12 weeks in advance to ramp up the manufacturing process? Obviously not. Do you think batteries and bottled water appear just because a disaster is on its way? Apparently so.
Walmart did what they did with what they had on hand. Nobody could have satisfied all the demand with a situation like that approaching; at least not without more notice.

--
John

Slashdot Mirror

Why Is Data Mining Still A Frontier?

223 comments