Why Is Data Mining Still A Frontier?
bbsguru writes "How much do we know that we still don't know? A story in The Register points out that little has changed since Francis Bacon proposed combining knowledge to learn new things 400 years ago, despite all the computer power we now have. Scientific (and other) data is still housed in unrelated collections, waiting for some enterprising Relational Database Programmer to unlock the keys to understanding. Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?"
Either
a) There's not enough money in it to make it worthwhile
or
b) It doesn't work.
I'm on it.
--
Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?
--
I hope not. I spent a day searching Slashdot, Digg and Google for something I saw a couple of weeks ago. I couldn't remember if it was here or there and I knew the story linked out, but neither the "search" capabilities nor Google was able to find what I was looking for.
Really, how hard is it to find the story about the whiny bitch who couldn't install ubuntu and wouldn't listen to any of the suggestions given to him?
So, just to clear things up, people do know that relational DB's aren't about "relationships" between data, but are in fact about storing data in mathematical relations. Seriously, go look it up.
aren't they like the brain creatures... I hear when they finally finish all the archiving and indexing in the universe they will blow up everything that is and ever would be in order to not create any new information... just to save the hastle of indexing that too
*''I can't believe it's not a hyperlink.''
This still looks like a basic, specialized database to me. Where's the great leap to "all your data are belong to us?"
There are companies and research project that are doing this sort of thing. The trouble is, there are a LOT of people that are freaking out about it, and that's making companies less willing to 1) admit they're doing it, and 2) even think about starting to do it.
Considering how up and arms people are about it, how long before we have people accusing others of "data profiling"?
it astounds me how little people know about data mining... ffs there is so much more to it than a relational DB....
Programmers have no idea of context. Biologists have no idea about programming. It is very hard to mix the two. You can be the shit-hottest dba in the world but if you have no relevant (deep) biology background you are guaranteed to produce crap. Almost every piece of biological software is a POS because of this.
Francis Bacon was the first to propose that each fact was related to all other information by 6 degrees or less. And he is one of the most famous intellectuals that shares a name with strips of cured pork.
Having the data in an RDBMS is only the first step to being able to mine data for knowledge. Data mining is a whole different discipline that requires statistical analysis of the aggregated data to find trends, etc.
Huh? Francis Bacon? Didn't Aristotle claim he created logic in his Prior Analytics? With his four types of statements (A is true about all X; A is false about all X; A is true about this X; A is false about this X) and the basic logical syllogism? The whole point of logic is to preserve truth so you can synthesize new knowledge.
The road to tyranny has always been paved with claims of necessity.
I'm going with the later...
Datamining would be a piece of cake if all data were kept in clear, standard XML dialects. See Visualising the Semantic Web , ed Geroimenko and Chen (Spring Verlag, 2004). Some of the possibilities of combing through information and elucidating it, combining it and converting it described in that book are simply awesome. Too bad that the Semantic Web is a pipe dream at the moment.
Another thing is that it is only usefull for information we don't already know.
We don't exactly need data mining to realize that people that buy diapers also buy baby food.
excitingthingstodo.blogspot.com
The question seem to ask whether, if we just put an amorphous mass of "scientific knowledge" into a big fat RDBMS and let it churn for a while, it would somehow spit out new scientific knowledge. Huh? Imho it displays not only an astounding lack of understanding regarding how knowledge is encoded, but also about the nature (and obvious limitations) of relational databases.
Possibly, a reason is that most people don't have the math or statistical skills to learn the concepts. Most schools don't even teach the techniques of data mining until late in a masters of science program.
When [normal non slashdot] people hear data mining they think querying with something like google ("computer find bad people in the FBI database") not markov chains.
The information you get out depends upon the data you put in.
The people looking to "find" information in the data are the same people who decided what data to collect in the first place. And from whom to collect it. Etc.
That means that you'll find out that 2004 was a banner year for bubblegum ice cream. But you won't know what will be popular in the summer of 2006.
So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it and then you will have speciation withing the code (AKA a fork) and everything will be as it was.
Balence, restored.
-nB
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
From my expierience - The people who are subject matter experts in their field (outside of computers) and typically don't have the time to perform all of the data entry. So you have to get an ETL / Miner to do all of the work for you. ETL and data mining are *NOT* the sexiest jobs in the industry by a long shot. Auditing data makes you want to gouge your eyes out after the fourth day straight of reviewing loads.
"Is RDBMS still a Brave New Frontier, or will Google make the art obsolete once they finish indexing everything?""
Data Mining is about asking the right questions. Indexing is only a small part of that.
--
Data Mining Solutions: Methods and Tools for Solving Real-World Problems
I'm pretty sure any elegant solution would be blind to the context of the implmentation.
I can only think patience. It has been only ~35 years since E.F. Codd published the first white paper on the relational model. We have yet to see the full implementation of what he proposed.
While true that the mathematics (theory) has been around for a while now, the application of it is still in its infancy. Give it some time and additional innovation, and we will have what you seek.
We're all hypocrites. We all have hidden parts, it's the contrast between them that make us more a hypocrite than others
They use it all the time on 24.
www.christopherlewis.com
So what you need is a so-so dba who has a passionate hobby of biology to hack something together, then the real dba's can tune it and the biologists can hack it
Well, that's pretty much how it works in academia (+/- the real dba). Problem is that this is a lab by lab (or department) solution to problems that appear in hundreds or thousands of institutions. The wheel is reinvented over and over again because either commercial/free solutions suck or don't exist. The commercial versions suck because they are built by software engineers and the free versions suck because they are built by scientists (who tend to have the mantra of "if it works, it's done").
Is it really still true that Biologists have no idea about programming? That seems like the direction you'd want to solve this problem from--it's gotta be way easier to teach a biologist some SQL than it is to teach a programmer to be a biologist.
I ship next week.
Once google is finished indexing EVERYTHING, it will then index itself, thus destroying the universe. Unless some hero can stop it before that happens and escape on a Scooty Puff Jr.....
Hmmm, why don't the developers and biologists...gasp!....work together to design something? Yes, the developers may have to actually listen to the biologists and not spend their days doing cool programming tricks, and the biologists may actually have to do real requirememns work. If no one wants to put the effort in, then no one has the right to bitch about the results.
Help me take back Slashdot. When did 'News for Nerds' become 'FUD and Conspiracy Theories for Extremist Nutjobs'?
So fund me..
This is just a skills gap problem. Find the right people. FUND the right people. It will happen.
Basically, stop trying to do things on the cheap. Interview hundreds of people. Pay the premium for the one who will make it happen.
"Darwin was his pupil (Henslow helped arrange for Darwin's presence on the Beagle), but Darwin made the intellectual leap that allowed him to interpret Henslow's records of variation - not as evidence of a fixed set of created species with variations, but as evidence of the evolution of new species in action."
Hmm, I read recently that Darwin's grandfather was also a Naturalist, as was Chuck. So, I don't think Darwin made the "leap," so much as his family was already in that direction. Methinks the article presumes Darwin was first in a family of thought--rather than merely one in a clan. He is just the first to gain widespread noteriety for it. (See "Darwin Amongst the Machines.")
What those who want activist courts fear is rule by the people.
The blurb hit on a fundamental reason data mining is still at (or beyond) the horizon...defining relations between the various elements is hard. Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.
Consider the following boring but difficult task I was given: two large organizations were to merge, each with a portfolio of about 100,000 items. Each item had a short history, some descriptive information, and some data such as internal quality ratings or sector assignments. This data was available (for various reasons) as big CSV file dumps. Questions to answer were: (1) how much overlap did the portfolios have? (2) were the sector distributions similar?
These are very simple, concrete questions. But you can imagine that since the categorizations differed, and descriptors differed within the CSV files, let alone between the two, the questions were difficult to answer. It required a lot of approximate matching, governed intelligently (or so I flatter myself).
Contrast this situation with what people typically think of as data-mining: answering interesting questions, and you can appreciate that without a whole lot of intelligence, artificial or otherwise, those questions will be unanswerable.
I'll just have to patent it and somehow get this article erased ;)
On a serious note, it is an interesting question. I guess big business and government aren't interested (as everything they have been interested in seems to get done.)
Is it really still true that Biologists have no idea about programming?
To make a sweeping generalization: They usually gain just enough knowledge to make them dangereous as the saying goes. They are not going to build apps with design and usabiltiy in mind. They want something that solves their problem. They don't really care how they get there as long as they get there. And they usually don't care if the road to the answer is paved with bugs and work-arounds.
This is a hoary chestnut. I have a masters in AI, and a PhD in machine learning (and had a lot of interest in machine discovery).
:)
The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn't even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.
Think of this in terms of permutations. Lets say you have variable A, B, and C. They are all binary (have values 1 or 0). Now, you are given a set of these assigments (eg A=1, B=1,C=1, A=1,B=1, C=1, and so on). Now, try to tell me what the correct partition is. Sort them in to two sets of any size. See the problem ? I didn't tell you what I wanted as characteristics of those sets - so in effect, they are all possible good partitions.
So, data-mining ultimately relies on human's deciding what they want to read from the tea-leaves of the data.
Now, give it up, and start addressing issues of efficient algorithms given that you have a specific performance task
Winton
Is there any particular reason for Flint, MI?
Hmmm, why don't the developers and biologists...gasp!....work together to design something?
We'll for starters, you get developers convincing the biologists that they need Oracle...and it only goes downhill from there.
I'm not saying that it can't happen, only in my experience (15+ years worth) it usually doesn't.
How many relations exist for any combination of N pieces of data?
That's right, a shit ton.
I think the issue with Google or other search engine is how to do analytics.
How do I write a multi-variable where clause?
How do I ask a multi-variable question and then hone it or drill into it along one or more parameters, unfolding detail but preserving multiple layers of an outline hierarchy?
So just there is the idea of a different presentation layer, hierarchy and tabular perhaps.
Then, what kind of barriers do I have to getting at the data? Privacy issues? Copyright or patent issues?
If you want to connect two or more points, won't we have to move beyond keyword searches?
After RingTFA, this doesn't seem to be about data mining in the computer science/statistic sense at all. Instead, the article suggests that scientists in academia aren't using the best database tools and techniques available. This I agree with strongly, there is often a disconnect between experiments done in scientific fields and proper database techniques to store that data efficiently. However, I don't call that data mining.
What about that TFA? Some one converted a stack of indexcards to a relational database? And this warrants a post on regdeveloper AND slashdot, exactly why?
Like there aren't things to write about like the Open Archives Initiative Protocol.. Geez.
SCO employee? Check out the bounty
Correct me if im wrong but arent copyrights the biggest obstacle against this? You canl only mine your own data as IBM and others already does today. Im interested in when you can mine data from all the various sources and combine those into conclusions. File formats are another thing hampering this kind of technology, especially if you look at it in a longer time frame. Try mining those Lotus 123 documents for historic facts ;D
HTTP/1.1 400
[...]or will Google make the art obsolete once they finish indexing everything?
Isn't the value of relational databases in the ability to "relate" indexed datasets? Google doesn't support a "join" syntax, as far as I know.
Even Google's fantastic text indexing doesn't break the data up into the discreet "fields" that would be needed to do any meaningful relating. It's sort of like having all of your data in a single column in a single table, and trying to self-join on "like" expressions.
Yeah, you can probably make-do if your data has some degree of consistency, but as the dataset incorporates a higher degree of "chaos" (read: different languages, topics, author's fluency in the language, etc), the more difficult any real relations become.
It's not impossible, given some significant (human) enrichment of the data, but we're nowhere near the ability to "join" conceptual data from widely disparate data sources. Maybe as AI improves to the point that it can read and "understand" natural languages (and forms of them spoken by non-native speakers), this will become more of a realistic concept. Certainly something to work toward, anyway.
I thought this was why we built the Internet in the first place?
Arrogance is Confidence which lacks integrity. -- me
We don't know
The article said that the researchers were prepared to store their database on index cards. With people still prepared to keep vast amounts of data off computers, it seems that there are still a number of hurdles to jump through before much data is in a state suitable for correlation.
The government and businesses are very very interested in data mining. The government can use data mining to assist law enforcement with tracking frauds, criminal activities, and even possibly stopping terrorist attacks. Businesses can use data mining to analyze sensors which monitor temperature, barometric preasure and a ton of other measurements and then ues this information to figure out production procedures for that particular time period. The possibilities for data mining are pretty much endless given you have the processing power to compute all the numbers. Data mining does not just deal with data that is in a DB. Rather finding some pattern or correlation between sets of data that otherwise seem completely random or pointless given the context. One of the main problems for data mining is privacy and until we can find a secure way to share data between DBs it will only hinder the advancement. A really good book to read is Data Mining:Next Generation Challenges and Future Directions which discusses pretty much everything you need to know about data mining. Also it is pretty much impossible for google to index all the information on the internet consider there is about ~110 billion webpages and it grows each year.
The problem is both easier and more difficult than it first appears, or even second and third times:
Data, whether held in databases (usually nice and tidy) or in flatfiles, or random text files spread all over hell's half acre, is simply data, not the information required to link it to other data. Even meta data about the data held in any data store is not the information required to link it to other data.
One of the things I believe will help (possibly) is ODF (buzzword warning sounds) because it begins to help format data in a universally accepted manner. Though it is not the only way, universal access methods are required for accessible data. Second, the structure of the data must be presented in a universal manner. This second part allows query languages to support cognitive understanding of the structure, and thus (with some work) the value of data held in a storage location, where ever and whatever that location is, be it RDBMS, text files, or phone bills.
Indexing is simply not enough. The ability to retrieve and utilize the index with the most probability of having relevent data is what is needed. We all know that any search engine can get you too many 'hits' that contain useless data. Google or anyone else is helpless until there are accepted methods for applying metadata and data structure descriptions on all data.
When there is far more organization to data storage, there will be a great sucking sound of people actually using data from the internet in brand new ways.... until then, its all hit and miss.
Support NYCountryLawyer RIAA vs People
I don't think Google will replace good old fashioned research by humans. I think we're still light years from computers having anything even *close* to intelligence high enough to replace humans in 'connecting the dots' of data libraries.
$0.02
******************
Slow Down, Cowboy! It's been x minutes since you last successfully posted a relevant comment anyone wants to read.
Any solution general enough to be blind to the context of implementation would either be so slim that you'd have to add context-specific information to it in order to get anything done, or so fat that it'd try to be everything to everybody and would end up being nothing to nobody.
We all know what to do, but we don't know how to get re-elected once we have done it
I want to know how, if I put a random string on my webpage (say ioeuhncio38u9384hynfxiuhfnx847uvh04897x ), and wait for Google to index it, that searching for that string will return my page in milliseconds. It obviously can't be a pre-executed query. So how the hell do they do that? SELECT * FROM index WHERE text ILIKE '%foo%' just won't cut it.
I'd love to know how search engines do do it - anyone reading this worked for one?
Get your own free personal location tracker
- GWB
.. and it's (relatively) easy to spend money on a "solution" as a once-off expense, but getting value requires someone to stay in the environment and work with it. How easy is it to justify employing someone with a good mix of background and intelligence (even if you can find them), to deliver, well, their job is to find out what they can deliver..
-- All your bass are below two Hz
"I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."-Hitchhiker's Guide to the Galaxy"
One must remember when undertaking to find answers in the data to first figure out the question. Otherwise the answer you find will be as useful to you as the answer 42.
Without context you only have a neat compilation of arranged meaningless facts.
On the small scale data mining is used daily by marketing people and the like to figure out who would be most receptive to their approach. Webmasters use it to optimize content and respond to user trends. In most large corporations data mining is used on some level.
Data mining on the scale discussed here may be practical at some point in the future once we determine the questions we wish answers to.
Let us hope the answer is more useful than 42.
No animals were harmed in the making of this sig.
Well, there was that one puppy, but he is all better now.
what used to be called 'data-mining' in 80 and 90s is now machine learning in 21st century.. and there are several instances where machine learning has shown tremendous success (probably this is the only by-product of AI that has shown promising real world applications)
- The DARPA Grand Challenge - Stanely, the winning robot from Stanford used 'Adaptive vision' which used some real-time learning algorithms- Clustering and Micro-Array Analysis - Once genetic-medicine will become a reality, the physicians will unknowingly be using clustering algorithms underneath..
- Froogle, Clusty, Amazon recommending etc all use learning underneath..
I havent RTFA but I think "RDBMS-view" is too naive for given scale of problem. What one has to understand is that data-mining is not a "push-button" technology, one has to have a total understanding of data and 'interesting questions' that one wants to answer then choose right set of algorithms and tune them properly. In biomedicine, there has always been 'bio-statisticians' in the hospital who perform these tasks.
Someone who "unlocks the keys to understanding" of the data in a relational database is not called a relational database programmer. There is an entire active CS research field specialized in this task, it's called 'relational data mining'. The theoretical foundation is inductive logic programming (http://en.wikipedia.org/wiki/Inductive_logic_prog ramming). The Wikipedia article contains a link to the freely downloadable book of Dzeroski and Navrac, which is a good start.
Our ability to produce meaningful results, in most cases, is little more than a crapshoot.
I'm really quite astonishingly disappointed that the summary made no reference to the priceless phrase "unknown knows" to describe data left 'buried' in the dross, presumably from sources left, as it were, under-mined.
James F.
I have played around with WEKA a bit. I am interested in data classification and WEKA has many classifiers you can try out on your sample datasets.
entropy
data mining will always be a frontier, because consolidaiton and standardization of data will always be a frontier, because simple entropy leads to fragmentation. furthermore, for various reasons, some good, some bad, some data will always be purposefully constrained from consolidation, only to be released into freer usage later, when data mining can commence
it's a permanent frontier
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
Then, why do the Iraqi oil pipelines continue to burn?
Case closed.
Sounds like 90%+ of the programmers I've known.
You don't need to know anything about Databases to know how to produce a unique hash from chaotic biological data. You don't need to know where the big honking hash came from to know it's going to be a big honking index to search.
Going back to GP post, what does any of this have to do with DBAs. They are gloified backup script maintainers.
John McAfee 'It was like that time I hired that Bangkok prostitute; to do my taxes, while I fucked my accountant'
"The potential is there, but the willingness of humans to spend time explaining semantic structures to machines, when they're obvious enough to other humans, is lacking."
That's why you want to make it into a game
I'd be happy if Infoworld Magazine (and the rest of the trade journals) could just remember the last set of lies I told them, instead of making me make up shit every 10 months or so.
The SlashDot article spazzes out about data mining - an area unrelated and unmentioned in the original article.
So the SlashDot poster of the article is either a bot or an idiot (I lean toward the latter).
Editors, you aren't doing your job. This article should never have been posted. The OP has nothing of interest and the unsupported additions of the poster appear to be the result of a crack pipe. Enough of these types of posts and SlashDot will become useless.
"But if it was always true to say that a thing is or will be, it is not possible that it should not be or not be about to be, and when a thing cannot not come to be, it is impossible that it should not come to be, and when it is impossible that it should not come to be, it must come to be." Aristotle, On Interpretation.
"You're everywhere. You're omnivorous."
Data Mining is still a frontier for the same reason monkeys are still having trouble reproducing Hamlet despite all the theoretical knowledge of all the incredible opportunities.
Too much assumption, too much possibilities, too little knowdledge, and not enough monkeys. You can never have enough friggin' monkeys.
I've built a tv app that grabs xmltv data and throws it into a db, presents listings and a higher level 'remote control' via web pages on a zaurus (with channel changes via an ir transmitter). As I watch, what I watch is noted, and every half hour what I have noted is matched against what is currently on. It's very simple, only a join, but the results are usually quite relevant. The lesson being that relevant results only come from intentful interaction. I plan on carrying forward the simple mining... spidering out across credits, intersecting with imdb info and recommendations, moderation of selections, etc. I expect the results to be quite person-centric, but a mining of distributed relevancy could lead to better results (ie intersecting preferences of many watchers). That might contradict the problem of seeing outside the bubble... suggestions are usually the result of personal decision, and are therefor skewed towards the person.. something relevant and unusual is a beautiful thing.
There's also no money in it... trying to sell a new idea to a suit most often results in the death of the idea... skewed to benefit the uninventive status quo, tho there are exceptions, I guess. That often kills even the idea of attempting a feat that requires the acquisition and maintenance of terabytes of data to be useful. AFAIC profit driven crap is just that... inane drivel from a self-serving path. There's promise, surely, and challenge even moreso, but whether there is societal acceptance or encouragement of such un-triviality is another question. The internet is young tho, and if you think there's alot of data now, you ain't seen nothing yet.
I used to work a simple job where I did database work for a company doing medical studies. It wasn't a lab, but it wasn't your typical cubicled office either. Although I had very little knowledge on the actual medical component of the studies I was doing, certianly not enough to design the stuff I needed to do, the management was superb - I wasn't REQUIRED to know anything about the medical component, and they trusted me to do the programming. What I didn't know they were happy to fill me in on - I knew enough about medicine for what they were saying to make sense, and they knew enough about programming to give me some idea where to start. If the management can effectively coordinate biologists and pgrogrammers, you don't need to have dbas with deep biology backgrounds.
http://www.TheGamerNation.com/Forums
Is it possible the developers are saying something like "It starts out with the biologists saying they need 30 TB of data available 24/7 with 99.999% uptime and 200-250 concurrent users, and goes downhill from there..."
Simply saying the developers are idiots because they suggest Oracle really doesn't make sense without more context. If more than one group of developers suggest Oracle, they might have a point. Are you sure you're not exagerating the requirements? If you need really high uptime and lots of concurrent users with really, really large amounts of data Oracle is one of the better choices. There's a reason it's so expensive...
Maybe not
Indeed. This is why I prefer a compromise: modularity. Generalization in the parent software, specialization in the modules. Plus it allows for third parties, if they so choose, to easily integrate with the parent software.
It's better to vote for what you want and not get it than to vote for what you don't want and get it.
- E. Debs
The Patent administration takes your idea puts it into google, filtering out you and any article talking about you. If they get a hit, prior art, eeeeeeeeh patent rejected!
I'm sorry, I'm to tired to be witty at the moment so this message will have to do.
Publishing my data is too bothersome.
Any questions?
without conceptual processing, data is just so much bits and bytes. Some of it can be analyzed as such, but much of it cannot without some conceptual comprehension on the part of the software (if not the analyst - which is the other problem).
A decent (read, relatively effective and efficient) simulation of conceptual processing would change the entire world of computer use from development to databases to computer education to robotics. It is THE world-class issue that needs to be resolved and soon.
Richard Steven Hack - This sig is TOO GODDAMN SHORT TO DO ANYTHING USEFUL WITH! MORONS!
skipjak goes to quadrant 4.
Patterns will be recognized, but to extract useful information other than information about nature (i.e. physical laws) will always be subjective, or in other words, chaotic.
Google will never be 'done' with indexing--just another infinite loop in the making :)
McDonald's & Chaos: I'm lovin it.
When using the simplest of models, I get nothing but junk coming out of Analysis Server. I am sure there is something that it will find. I just have not found it yet. This is with sales, time and products as the dimensions.
Next time you're hosting or attending a (potentially) riotous party, gladden anonymous coward's weary heart by pulling the old "irresponsible" parent prank:
Get yourself down to [supermarket of choice]
Load up the trolly with slabs of beer, bottles of spirits, nibbles and smokes.
Add a couple of packs of disposable diapers and a few packs of baby formula.
Go to the checkout, wait your turn.
When the cashier has rung up all your purchases and given you the total, go: "Oh Lordy! that's too much!". Push the diapers and baby formula to one side. "I'll have to leave these!"
Expect a call from child protection in the near future...
You can download the Semantic MediaWiki extension right now and add semantics to a wiki. Currently all the links between pages in a MediaWiki have no meaning, and all the facts in each page can only be extracted by humans reading it. With the upgrade a page can state [[is located in::California]] to explain the type of relationship implied by a link, and can express attribute values like [[population:=1,305,736]]. The current version summarizes all such facts in each page and can export them as RDF. It's a simple extension, but once it's implemented in Wikipedia, you could query for, e.g. the population of every major city in California. Doing such semantic queries using Google is basically impossible, you'll just get a list of pages and have to read and filter each one to create your own list.
Sharing semantics between datastores would require people agreeing on ontologies, which according to people like Clay Shirky is indeed a pipe dream. I'm not so sure, that's like saying categories in Wikipedia are useless because they're disorganized. Just using the Dublin Core metadata to identify authors of information in a common way would be a big breakthrough, and there are simple enough ways to do it in XHTML that I think it'll pick up steam in the next few years.
=S
This sounds like bioinformatics.
The um...field I've been working in for the last 6 years.
Programming + Biology + Statistics + Algorhitm development.
isn't that the whole idea behind well written and concieved code? just thought i would check.
2^3 * 31 * 647
I happen to be one of those few fools that have both a degree in Biology and in Computer Science. And at one time I relied on my research skills in Biology as my ONLY income, until the dreaded and softly spoken "balancing" of the budget that spelled doom to most low level Biologists of my time.
It is hard to mix the two. This is even more frustrating if you're marginally inclined to understand where things come from and how they are designed. Some of the earliest proponets of object oriented software programming envisioned "cells" of code that "signaled" each other. It's no cooincidence that Biological terms were used, because that (I've now forgotten) person held a Biology degree. The iterative (or step-wise) approach identically mirrors the laborious procedures for running most lab jobs. I consider myself blessed to have a firm grasp on both techniques, but there are plenty of others who have managed to master both.
Today, I'm told that my Biology degree has little to do with my work, and that it has no bearing on my career as a software developer. I understand the misconception behind such a statement; however, I do not share that opinion. To some degree, ALL of the natural sciences are related, and ALL research and fact finding skills can be leveraged in other environments. I'm shocked at the poor quality of "professional" Biological software, but then again, it's mostly written by Biologists that have the domain knowledge, but lack the skills to produce polished software. Even when they attempt to hire, there's precious few people out there with enough skill to know if they are hiring good programmers, leading to skill poor shops shipping the best they can produce.
I'm sure that not all shops are skill poor, but I had an opportunity to work for a Biological company that was writing software where I didn't see a real opportunity. The position promised a worse hierachy in terms of status and prestige, with fewer opportunities for me to contribute than its non-Biology competitors. Maybe I was unlucky, and parhaps I picked the one bad example to interview with. They were telling me things about their software development that made my CS blood run cold, expecting me to admin 80 computers "on the side" and offering to pay me about 70% the going rate.
Fewer people really want programmers, because it is unclear to a non-development company what a programmer can offer, but it is clear what a DBA can offer. Reliable reporting, offline data storage and retrieval, performance analysis, and data warehousing needs are directly tied to the survival of many businesses. The things a computer science person can offer might be able to exceed that of a DBA, but it's not as certain it will. Given the conservative bent of today, I'd say that most companies feel they can't afford a risk. Perhaps it partially explains the appeal of finding candidates with a 100% buzzword match compliance?
It is interesting to see that some people in this thread still remember Roger, rather than Francis, Bacon.
Francis was, of course, wrote his 'New Guide' in 1620, around 400 years ago. Much of his work was a straight copy of ancient authorities - you can see a lot of Cicero in his treatment of philosophy. A lot of his science seems to be a direct repetition of Roger's work. However, he was trying to rediscover the lost Golden Age, the wisdom of the ancients, rather than develop a new field of study, like his predecessor.
Roger, by comparison, was writing in the 1260s, around 400 years before Francis! Though both men proposed a structured, experimental approach to the investigation of Nature, it was Roger, with his iconoclastic approach to 'unworthy' authority who most closely resembled the modern questioning approach to science, and he is usually considered as the inventor of the scientific method. That was what got him locked up in the March of Ancona for 14 years!
It is a shame that Roger's language is so inpenetrable to modern ears. There are no readily-available translations, and even less commentary by technically-aware persons who understand classical scientific concepts. As Blish points out, a first glance at 'De multiplicatione specierum' would make you think that Roger was talking about biology rather than physics.
This reference - http://www.nndb.com/people/582/000114240/ - suggests that there are still lots of unpublished Bacon manuscripts in British and French libraries. I think that a complete collection of Roger's works in an accessible format would add more to scientific progress than many studies which receive funding today.
I've written a thing called DBI-Link http://pgfoundry.org/projects/dbi-link/
:)
which helps do the job by making data sources easily available, one to the other. Of course, it's not done yet, but it's a long way in the right direction
What part of "A well regulated militia" do you not understand?
Then wouldn't it be useful for the biologists to define the context for the programmers? It shouldn't be impossible to do so (very hard, I will grant you).
I can throw myself at the ground, and miss.
Because Larry Ellison needs a new sub-woofer?
It woudl be Very useful
However taking a programmer with no biological experence, I'd guess it would take about a years full time study to properly define the context to him/her and perhaps 3 months (full time) to give them a reasonable working knowledge.
Its easy enough to give the basics (DNA makes RNA makes Protein(1)) its that biology is wall to wall special cases. Biological systems run the worst spagettee code you can imagine written in a language thats barely documented(2), written by a developer who is willing to hack the executable, the source code, the compiler, the operating system and in extreme cases the hardware to get a functional system.
(1) except RNA can 'make' DNA, RNA can act like a protein (enzyme)
(2) Using language only comprehensible if you know the subject already
We are collecting local, state and IRS tax data, and information from thousands of business around the country that do business with citizens of our state. So far, we are at 6TB on our SANS, and growing. We used to use a VFP database just to compare state tax returns against IRS returns. The process took over 40 hours of 486 CPU time just to process the data into a form that could be queried. Now, with P4's, 2GB of RAM, and 3.3GHz processors working against our Oracle DB we can churn data as easy as making ice cream.
We "data mine" by comparing individual and corporate tax returns with spending habits to see if
a) state tax returns match federal tax returns
b) the spending matches the income
c) there is unreported out of state sales on which state taxes can be collected
d) any other infomration that might help in the collection of unpaid taxes.
On the average, 10% lie about their taxable income, some in excess of $100K, and more don't report online purchases. Look for the "StreamLine Internet Tax Initiative" to be passed by a majority of state legislatures in the near future.
Yes and no... believe it or not, at least a few people in the biological sciences are utterly incapable of understanding algebra. (I said a few, not necessarily a majority...) So, even basic SQL is a mystery to them. They can become productive workers through just assimilation of information. This is not necessarily easy; quite a few people (including programmers) don't have the memory to do it.
I don't see problems that are susceptible to data mining. I suspect this view is shared by many of my colleagues. (this is similar to the view of most bio oriented scientists that desgin of experiment is not useful)
..etc) and each variable has x levels (Temp = 20 deg, 25, 30...) a "complete" experiment would be some sort of n by x dimensional test. DOE says if this number is so large that you can't look at a significant %, you are better off changing two variables at a time to sample the space. None of hte doe people have been able to put this into comprehensible useable form, and for various reasons, most biologists don't think this sort of hting works (the variable are obvious, and the response surface is highly non linear and non smooth)
What would change the field ?
In science, what usually changes peoples minds is a BIOLOGICAL results obtained with a new technique that could not be obtained (easily) another way.
this may just be restating the old truism that success breeds success, but to get biologists interested in large scale database mining sorts of thinkgs, you have to convince us that there are questions that can be anwered witht his technique that can't be answered more easily with other techniques.
The other problem is the high noise of bio experiments; you can't simply aggregate data on rats blood pressure in france with ozone levels in arkansaw and come up with hypothesises on weather and strokes; the data quality is to low, and it would be prohibitively $$ to design the experiments to have that level of quality.
a word on design of experiment: as I understand it, if you have n interelated variables (temp, pressure, time,
Hey man, interesting project.
Is this of any use/interest to you? http://joshthejenius.com/experiments/technorati_sp am.php
Math is math. Regular expression is regular expression. The tools are there. The future is now.
You post rings true with me.
In your opinion, is this do-able:
1. Design a series of algorithmic searching functions, each catered to specific datatypes? (i.e. porn, or PHP functions or whatever it might be we are looking at/for).
2. Connect all of these specialized search algorithms together, with a single, simple UI.
3. Use natural language processing (AI) to direct each query to the proper algorithm.
4. Convince 300 million people to stop using Google every day.
5. ???? (NOT ADS!)
6. ALL YOUR PROFITS ARE BELONG TO ME!
Joking aside, as someone smart enough to make it through grad school ("BS" = good description of my education), am I on the right path here, or waaaaaay off?
Math is math. Regular expression is regular expression. The tools are there. The future is now.
The "shit ton" is an obsolete SAE measurement, derived from the British "Imperial Arse Load".
As the USA is now on the SI system, please update your nomenclature to the currently correct "Metric Fuck Ton".
Thank you in advance for your co-operation.
--US Department of Weights and Measures
We'll for starters, you get developers convincing the biologists that they need Oracle...and it only goes downhill from there.
Definitely! A certain biotech department I know that shall remain unnamed bought several Oracle licenses for some database they were going to use.
Then one of their programmers convinced them to use MySQL instead, because free software is good.. Or whatever.
Just insane.. It's like buying a Ferrari and then deciding to drive a Yugo because it has better gas-mileage.
Just to illustrate.... I read the article, thought the photos were nice and sharp and I wanted to know which camera was used. Saw the footer credit telling who took those pics - did a search for the photographer name + Flickr on google and came up with a match. Looked up and found a pic with exif info which said its was Olympus E-300 Now I know I maybe wrong - but what you should be thinking is Hell, I could be right. The power of google is tying the loose ends in unstructured content. Its awesome.
I am also a bioinformatician: I am a biologist an I write crappy code. Yet using only public domain data from excellent repositories like ncbi, ebi, dip etc. I can make new discoveries and publish reports on these discoveries in high impact factor journals. So ... I disagree with this news item. As far as I am concerned there is no problem.
You haven't given enough information to go on. Points 2 & 3 are far too general. No offense, but point 2 reminds me of sales execs who ask software developers to "just give me a single button that does what I want". It's all very well to talk about a "single, simple UI" to do something very complicated, but it's something entirely different to design and implement such a UI. Think of existing applications and tell us which ones do something like your point 2. If there are any, then how is your system going to be better? If there aren't, why do you think that is? Point 3 suffers from similar problems.
You'd be better off just giving people a menu of search algorithms, and letting them pick which ones they want to use. (Maybe after a while, you could use Amazon-style "people who searched for X found good results using algorithm Y" logic to help automate algorithm-picking, but how often have you actually bought a product suggested to you that way?)
Sorry to sound like a PHB, let me give an example:
One spider is hitting craig's list, in order to find the liquidity of real estate (in other words, X properties at an avg price of Y listed FSBO in Albany, NY today). A second spider is hitting various government sites in order to find taxes, appraisals, etc. Then, using some fancy-pants math, I am able to reduce everything into a single search portal which is responding with the most valuable leads for my real-estate investor slave masters (this is my current day job).
I took the idea further by combining Wiki and Google News into another hybrid search engine; students enter a topic, and it parses notes, and cites sources in MLA format.
Two applications, same idea: spider *specific* data from *specific* places, so that these "magical algorithms" can actually do something useful with all this data.
You mention Amazon's AI, which many seem to think is marketing fluff. In my experience, when looking at obscure books, or obscure authors, the recommendations ARE excellent, and I HAVE purchased several of them (to be fair, I'll take a chance on just about any non-fiction book). However, when looking at the latest big DVD release, the recommendations are crap.
Back to my objective: as we see with Amazon, and your original post, the more *specific* (and credible) the information, the more we can do with it. As for the single UI, this would basically be the "last step" in tying all these wacky "hybrid engines" together under one name.
And as you mention, this raises several UI questions for which I have no answers. I like the Google "click-and-go" experience...when it works. Sometimes I wish I could "talk with it" and help it understand my search *before* showing me a billion links to crap. Of course they offer hundreds of options, but somehow, it "feels" like Google is losing its edge these days (as far as accuracy goes).
If the above made no sense, please disregard. It's been a long day.
Math is math. Regular expression is regular expression. The tools are there. The future is now.
The examples you give could be described as expert systems, of a sort, relying partly on socially-produced data. I think there's little doubt that such systems will proliferate, and at some point it could start making sense to tie them together. In a way, the "semantic web" is working towards supporting that sort of thing. But this approach is at odds with almost everything that's succeeded on the commercial Internet so far: it's all been about, essentially, exploiting simple business models (Amazon) or clever tricks (Google), and leveraging them to the hilt. That's what the short-term VC-funded approach to product- and company-building is good at.
I think your idea is the sort of thing which could evolve over time, but is unlikely to be built by any one team unless they start with a whole bunch of existing systems that are ripe for integrating. Of course, if I'm wrong and you end up founding a billion-dollar dotcom, please post a Slashdot article about it so I can kick myself!
BTW, if you're working on AI-like applications, I hope you've read PAIP. And of course, Norvig's at Google now...
Thanks for the link. I loved ELIZA- that was the prog that really turned me on to this AI stuff in the first place.
Sadly, I am too stupid to care about fame or money, I just want to make it work. That said, back I go to the land of math and dreams...
Peace.
Math is math. Regular expression is regular expression. The tools are there. The future is now.
Math and dreams are fine with me, I'm into functional programming myself, which (right now) is one of the least commercially applicable branches of programming imaginable. The only reason I brought up the dotcom stuff is that you were talking about improving on Google, and my point is just that despite the marketing hype, companies like Google are really doing more like high-end IT than anything deeper, because the deeper stuff takes decades to develop. I could be wrong, but what you're describing sounds more like it would be the latter.
OK, I admit it: I had to go read the Wiki on functional programming before I could reply. Having read this article, I feel a bit better about some of the work I've been doing (I have a weird habit of *doing things* and only later learning what the proper name of the activity is).
Would this count as "functional programming"? http://joshthejenius.com/experiments/technorati_sp am.php
I'm using PHP, but only because it is the easiest language I've ever worked with. I started coding as a wee youth on a PS/2 286 in Qbasic. As best I can tell, it really doesn't matter *which* language we use, so long as (at the end of the day) x = what we want.
I am a tad insecure here because I never "officially" learned any of this. I am really trying to follow the "standards" better, but I am having a hard time understand *which* standard(s) I should use to express this.
Case in point: f[0|-1] = (((u/w)-Avg(u/w))/1-Avg(u/w))-1
Is that right? Math aside, have I expressed this formula properly?
I have an MLA guide that I *love* to reference on various "english rules" I get stuck on. Does a similar guide exist for mathematics? I.e. to ignore a "1" we do such and such...Or am I really just discovering that this is a free-for-all and no one is really sure which end is up?
Why is this field one of the least commercially applicable? It would seem to me a simple equation would have unlimitted portablity; don't all languages use math? If I remember correctly, Even Qbasic handled exponents and logs.
As for taking years and years- the way I see it, years and years is what I got (I'm currently 25). Best case, I get to participate in an entirely new wave of innovation. Worst case, I'll write a book and leave it for the next generation. (Chapters 1-58: Don't waste your time doing the following...).
Math is math. Regular expression is regular expression. The tools are there. The future is now.
Sure, your algorithm would count as functional programming, but part of the point about functional programming is that entire programs, not just individual functions, consist of nothing but the composition of pure mathematical functions. Which means that a lot of what you do is "normal" languages just isn't allowed. Point 2.1 in the comp.lang.functional FAQ, which gives a simple comparison between imperative and functional style, although to really understand it you'd need to learn a bit more about at least one of the functional languages.
Just about any language allows you to implement simple pure mathematical functions - as you say, "a simple equation would have unlimited portablity" - but very few languages allow you to construct entire programs that way. None of the mainstream languages (PHP, Perl, Python, Ruby, Java, C, C++, BASIC, etc.) support functional programming to that extent. The languages which do support it are ones like Haskell, ML, OCaml, and Scheme. The reason I say that functional programming is not commercially applicable is not because there aren't commercial applications of it, but because these languages are hardly being used at all commercially. There's a bit of a catch-22 there, because people don't know them because they aren't being used, and they aren't being used because people don't know them.
Here's your algorithm implemented in the Haskell language. If you want to try running it, download and install Glasgow Haskell (on Debian, you can do "apt-get install ghc6"), and run "ghci" to get an interactive Haskell prompt.
-- define a list of p values
let ps = [0.825, 0.8868, 1, 0.8542, 0.8889, 0.8, 0.9118, 0.95, 0.9487, 1, 1, 0.8333, 0.8197, 0.6383, 1, 0.8727, 0.875, 0.7879, 0.8667, 0.8636]
-- define a function which averages a list of numbers.
-- The 'fromIntegral' is needed because Haskell is a strongly, statically typed language
let avg xs = sum xs / fromIntegral (length xs)
-- work out the average of all the p's
let avgP = avg ps
-- define the function, f(p)
-- this would usually be written on multiple lines, but the interactive shell doesn't allow that (?)
let z p = x / y - 1 where x = p - avgP; y = 1 - avgP
-- Finally, "map" the function f over the list ps, giving a list of results.
map z ps
The output from this is:
[-1.4721965172036693,-0.9523008328426015,0.0,-1.22 65500126188285,-0.9346344746361576,-1.682510305375 6212,-0.7419870446706487,-0.4206275763439058,-0.43 156389332884704,0.0,0.0,-1.4023723395305803,-1.516 783040296123,-3.0428198872718117,0.0,-1.0709178093 715828,-1.0515689408597635,-1.7843021788508464,-1. 1213931185328516,-1.1474720282661737]
The above is oriented towards experimenting with it at the command line. For a real program, you'd probably ultimately package it into a single function that takes an input list and produces an output list, e.g.:
-- repeating the avg definition for the sake of completeness
let avg xs = sum xs / fromIntegral (length xs)
let f ps = map z ps
where avgP = avg ps;
z p = x / y - 1
where x = p - avgP;
y = 1 - avgP
("f" is defined over multiple lines for readability, the way it would be written in a program file, although the indentation didn't make it through Slashdot; to enter it interactively, you'd have to do it on one line.)
This last version defines your entire formula quite clearly, for anyone who knows functional programming. That's one answer to your question of how to express these things mathematically: if you express them using high-level programming languages, it has the benefit being concise and unambiguous, but also checked by the compiler, so if it works you know you haven't made any mistake
Maybe Tim Berners-Lee and his semantic web will make something happen. That's the real problem. When you have to write like 30 or 40 layers of SQL queries to get what you want, and then to get a decent report you have to spend 100 hours in crystal or make compromises, and in the end all you have is more data. What is the MEANING of the data? I think a lot of the knowledge of humanity is stored in words and books and not indexed. Most db data is just statistics, which are useless ;)
What if you could "explain" what "The apple tree is 15 feet tall" means using a structured language?
Then, it would be pretty trival to search for 15 foot tall things, apple trees that are taller than a man, etc.
Cool! Amazing Toys.
I have to wonder if data mining isn't the problem -- the real problem seems to be that there are few obvious problems data mining will solve.
Consider WalM*rt. When the 2005 hurricanes were predicted, they mined their sales data for previous hurricanes. They found that in the last hurricane people stocked up on beer, pop tarts and peanut butter, so they sent trucks full of that stuff to the stores in the path of the hurricanes. They made lots of sales, and provided a valuable service to the communities. Capitalism at its finest.
Well, as a resident in a city that was about 200 miles inland, I would disagree.
They managed to run out of coolers, bottled water, battery-powered lights, batteries, propane, camping stoves, laterns. And when I mean out, I mean out: Not even a single "C" battery was to be found in the whole store!!!!
And yes, businesses and schools were closed before it hit -- so this area apparently thought the effects could have been severe. I think a less technological solution involving "common sense" should have been applied.