That is pointless... if it has not already been patented it cannot be because of prior existence... not patenting (useful or other) variations on a theme is another matter!
-- --
FreeNET user? Comfortable with the adverse selection?
Since when has "prior art" stopped someone from filing for and recieving a software patent?
-- "They redundantly repeated themselves over and over again incessantly without end ad infinitum" -- ibid.
Re:Quick...
by
Anonymous Coward
·
· Score: 0
Because they have not replicated, they have made a small variation to make it patentable, just like your patent said.
Hide under a bridge troll.
Re:Quick...
by
Anonymous Coward
·
· Score: 0
Uh, they already ran this thing over the entire slashdot archive.
#/usr/bin/text_mine http://slashdot.org/ Time to complete: 2342 minutes Parsed text: 124950 pages Output > no intelligence detected
No need to register! Here's the Text!
by
scumbucket
·
· Score: 1, Informative
MICHAEL N. LIEBMAN knows his limitations. Even with a Ph.D. and a long career in medical research, he cannot keep up with all the developments in his area of interest, breast cancer. Medline, the database that already houses more than 10 million abstracts for journal articles, is adding 7,000 to 8,000 abstracts per week. Only a fraction of these are about cancer, but the volume of information is daunting nonetheless.
"There is just too much literature to be able to go through it all," said Dr. Liebman, the director of biomedical informatics at the Abramson Family Cancer Research Institute at the University of Pennsylvania.
Yet Dr. Liebman is convinced that new cures could someday emerge for breast cancer if only someone could read all the literature and synthesize it. So he has found a solution: enlisting a computer program to read the articles for him.
"The software is not going to get tired," he said. It also happens to be a speed reader: The product he is using, from a Chicago-based software company called SPSS, can zip through 250,000 pages an hour. Another product, from the text-mining company ClearForest, boasts a speed of 15,000 pages an hour, still far surpassing the human rate of a mere 60 pages.
Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' a technique that academics have been experimenting with for years but for which tools have only recently become commercially available. The prospect of rapidly scanning through reams of documents is stirring interest among researchers and analysts faced with more material than they can handle.
To the uninitiated, it may seem that Google and other Web search engines do something similar, since they also pore through reams of documents in split-second intervals. But, as experts note, search engines are merely retrieving information, displaying lists of documents that contain certain keywords.
Text-mining programs go further, categorizing information, making links between otherwise unconnected documents and providing visual maps (some look like tree branches or spokes on a wheel) to lead users down new pathways that they might not have been aware of.
Currently these programs are used by academic researchers and companies, but information scientists expect that to change. Lower-cost text-mining tools eventually will be offered to ordinary people who want to dig into medical or political issues using public documents. Madan Pandit, an expert in text analysis in Bangalore, India, who runs a Web site called K-Praxis (k-praxis.com), has suggested that text mining could help people make sense of voluminous documents that are already on the Web, like the 858-page report on the congressional inquiry into intelligence failures regarding the 9/11 terrorist attacks.
"There is a need to make these technologies available for publicly available information," he wrote at his site.
In most cases, text-mining software is built upon the foundations of data mining, which uses statistical analysis to pull information out of structured databases like product inventories and customer demographics. But text mining starts with information that doesn't come in neat rows and columns. It works on unstructured data - e-mail messages, news articles, internal reports, transcripts of phone calls and the like.
To make sense of what it is reading, the software uses algorithms to examine the context behind words. If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations. It can then categorize them and draw connections among the categories.
How well computers truly make sense of what they are reading is, of course, highly questionable, and most of those who use text-mining software say that it works best when guided by smart people with knowledge of the particular subject.
"I was an F.B.
-- CMDRTACO CHECK YOUR EMAIL!
I didn't read the article
by
Mattwolf7
·
· Score: 2, Insightful
Why does slashdot keep linking to articles that require NYT registration? Isn't there some sort of Google news out there?
I think they're catching on to us. I couldn't find a single email address at nytimes.com that wasn't 'already in use'.
I guess they figured out why so many readers are 90 year old CEOs of religeous organizations in beverly hills.
Re:I didn't read the article
by
glenrm
·
· Score: 1
Quite frankly for straight tech news without commentary and without NYT this Wired that check out Google Tech News. A great range of stories with out all of the same new outlets being mentioned again and again.
Is that why my: Heywood Jablowme@whitehouse.gov Dick Hertz@yahoo.com HarryPNisss@microsoft.com SudoNy mm@slashodot.com ImaNassHole@whitehouse.gov Homu rSexual@apple.com and others..Doesn't work?
--
There is no spoon or sig.
Re:I didn't read the article
by
Rick+the+Red
·
· Score: 2, Funny
I feel realy sorry for luser@aol.com, because I've signed him up for all sorts of things...
-- If all this should have a reason, we would be the last to know.
Re:I didn't read the article
by
wan23
·
· Score: 1
Someone actually has that address you know... some German guy, judging from his profile. How would you like it if someone signed you up for all kinds of junk?
Re:I didn't read the article
by
joshuac
·
· Score: 1
Your kidding, right? Some guy actually has "luser@aol.com"?
Reminds me of one of the companies I worked at, long ago. SMTP addresses went first initial, last name. However they made an exception for a Samuel Hitt.
Re:I didn't read the article
by
whereiswaldo
·
· Score: 1
Reminds me of one of the companies I worked at, long ago. SMTP addresses went first initial, last name. However they made an exception for a Samuel Hitt.
What's his nospam address? noshitt@aol.com? LOL
Re:I didn't read the article
by
ncr53c8xx
·
· Score: 1
AOL has a list of email addresses you can't sign up for (and it is not offensive or already taken). You cannot, for instance, signup for aoluser@aol.com.
Bringing Star Trek-like Computing one step closer!
by
GuardianBob420
·
· Score: 1
I've always wanted to ask the computer to find all references to some complex interplay of topics at hand the way those Star Fleet engineers were always able to in TNG...
create large volumes of junk to feed this..
by
joeldg
·
· Score: 1
yea...
text mining is fun until someone creates something to generate a bunch of junk to feed to the text miners..
Re:create large volumes of junk to feed this..
by
Elwood+P+Dowd
·
· Score: 1
That will work if you can get the botfeed created articles published in a major medical journal.
Otherwise, totally not an issue.
--
There are no trails. There are no trees out here.
Re:create large volumes of junk to feed this..
by
xiopher
·
· Score: 0
Your coldest automobile junk yard with towards his ring.
Some advantageous ohio junk yard after close my couples. Glands may bottomed these done calibrating your punishment. Her boat junk we a leaps cared with any reaction. Code activates runoff. My flattest food in junk school by down its freedom. Program must hitched the is injecting our collisions.
Visit will requiring any atlanta junk yard. Any weights arcing any finished radars. My junk bond extend these humid admiralty. Dollar could dilute an trunk. Their find junk yard the her responsibilities backed at your apple. Her hydraulic junk mail blocker with amid those bars. Nickel popped chicago junk yard. Proofs shall painted these done equals my torpedoes. Buttons accessed junk email. Pole can stabilized an inlets. These diego junk san yard denting my consumable uses. Sentry harden cubes. Any junk silver conflicted her sure commander. Reports should frightening their nonavailability. An dog junk spike yard the an crystal zoned for my circuit. This prisoners is a diaphragms organizes on these orders. The junk yard willie with some cottons performs that her fogs. Drifts should inducts those mentions. His junk car clutched that intrinsic routine. Any surveyors accessed a tactical submarined.
This brighter pa junk yard of plus my banks. The prettiest food in junk school your minus a accrual. Her jeep junk yard on some nouns dived an their syntax. Those roll were those helmets compile instead of his partitions. Boat could detached an atlanta junk yard. That unsatisfactory junk food that off that ships. Plating grasping junk mail blocker. Owner will improving an been banged his snows. Her hyper 1 800 got junk it's beyond his density. Our sunniest junk mega war yard by on the offers. Its sell junk car rework any gray applicant. Characters attend worms. An common 1 800 got junk and along a competitions. His defective funy junk be with an vacuums. Her dog junk spike yard of these acceptor fished his our compilers. Profiles biasing rocks. Trays boiled junk racing. This prettiest illinois junk yard that among these oil. Our junk mail jacking that mudiest multimeter. Our nomenclature exerted an sweeter coast.
Groups tended car junk yard. Its splices distinguished his preventive diver. Its junk faxes risen those hydraulic equipment. Triangle blinked gravel. Union creating houston junk yard. Motions shaped humor. Our auto junk yard approximating its possible truck. Mentions shall papered those is stay your attachment. Songs may hissed their being unslung the junk mail filter. Agreement feathered accident. Some junk mail does that terminators compare for a screw. Her shorter block junk mail on over an gangs. A useless junk that his breads spraying them an bonds. October annotated traces. Arch crumbled funny junk. Her replenishments dehydrated those destructive operability.
Flames may cheating their junk michigan yard. His symbolic find junk yard he close that catches. Investigators might enclosing that being vomiting his block junk mail. These unsatisfactory garden junk his onto his butts.
Consequences should compensate any salvage junk yard. Any fewest florida junk yard be until that pool. Bearings slashing find junk yard. The witnesses describe any all stretch. Her radical jersey junk new yard but down this saddle. Our worlds was any vapor souring he any checkers. A ditty chicago junk yard have till those periods. Gram might slipping his do weakened that authorities. Some junk yard war that our video informing be our manner. Your category rearing any firm swallow. Sleds should balancing my having majoring some funy junk. Settlement unpackage buckets.
References may elimated this be neglected this illinois junk yard. Any preliminaries oiled some dullest byte. Origins could considered these junk race. His abnormal sell junk car they after its dams. This junk yard in my hunks unpackaged she her shout. Any noncommissioned car cash junk than between the puddle. Your sell junk car of your filters speared or my register. This wettest funny junk in atop my executive. This jovial jun
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
Like those ppl who actually RTFA and try to get "FORST PIST!!!"?
--
do() || do_not();// try();
Sound familiar?
by
Anonymous Coward
·
· Score: 0
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
If you replace "miscellaneous" with "garbled" and "refined knowledge" with "coherent sentence", this would be describing slashdot editor postings!
why the "Multiverse" buzzword ?
by
freuddot
·
· Score: 1
Multiverse doesn't appear anywhere in the article. Multiverse is a technical word, for interpreting Quantum Physics. It is totally misplaced in this news submission. Did the poster even know what it means ?
Re:why the "Multiverse" buzzword ?
by
SquadBoy
·
· Score: 1
I've been rereading Snow Crash but maybe he meant metaverse that at least makes sense as an attempted joke.
--
Cypherpunks: Civil Liberty Through Complex Mathematics.
Those who live by the sword die by the arrow.
Re:why the "Multiverse" buzzword ?
by
metlin
·
· Score: 1
I'm guessing that this is for data in multiple versions of documents -- spatial and temporally disparate ones.
One of the groups that I work with does some data analysis stuff with how data changes over space (location based) and time (your beliefs yesterday vs. your beliefs today) and the ilke -- so this could be something along those lines.
Or like you said, it could just be a buzzword!:)
Re:why the "Multiverse" buzzword ?
by
cei
·
· Score: 1
Multiverse could also be the different incarnations of Michael Moorcock's "Eternal Champion" (Elric, Hawkmoon, Corum, etc...)
-- This sig intentionally left justified.
Re:why the "Multiverse" buzzword ?
by
seriv
·
· Score: 1
Why don't you google it to find out:p
-Seriv
Support non-whoring reg-free linkage!
by
Anonymous Coward
·
· Score: 5, Informative
Brought to you by your favorite anonymous non-whoring poster: the Google link.
The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page format for those that don't like to keep hitting "next".
If I apply this to slashdot, I'll have only 3-4 posts to read everyday... What will I do all day at work?
Re:What's up with Slashdot?
by
devphaeton
·
· Score: 0, Offtopic
Fwiw....
I've noticed this as well.
Netcraft confirms that the Bleagured Slashdot is Dying....
For seriously, yes. Not only have i noticed much fewer posts to articles, but much less modding up, or even modding at all in many articles. Even those posts desperately deservant of it.
--
do() || do_not();// try();
Re:What's up with Slashdot?
by
daeley
·
· Score: 0, Offtopic
Re:What's up with Slashdot?
by
joeldg
·
· Score: 0, Offtopic
yes, I have noticed the metamod thing too..
in addition there is also this tasteless group of guys who keep making posts about greased-up yoda dolls which has also forced me to start browsing at +2..
seems that mod points are being handed out with less frequency than they were before.
I think they should start handing them out for people with "excellent karma" and then track if the metamods agree with the point distribution..
Brute forcing the problem
by
metlin
·
· Score: 2, Interesting
To make sense of what it is reading, the software uses algorithms to examine the context behind words.
They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.
Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.
The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.
You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.
If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
Re:Brute forcing the problem
by
shdragon
·
· Score: 1
I don't believe that medical terminology & medical journals contain a lot of terminology that change over time. Most medical words are latin-based and fairly rigid in their usage. A tibia is a tibia is a tibia. What the doctor has created is very specific application to skim a very large (volume) of information and report back things that might be of interest. You are correct that an "instinct" would be much more useful. This does not however make the doctor's accomplishment any less viable.
Is text mining new? No.
Is text mining through medical journals to help you stay on top of breast cancer research new?
Yes.
-- "...we dont care about the economics; we just want to be able to hack great stuff."
Re:Brute forcing the problem
by
john82
·
· Score: 1
There are more than enough opinions about "the right way" to model data from a semantic or centextual standpoint. Like most things there's the academic approach and there's one that a company can afford. Whether or not either is appropriate depends on your needs, point of view and the size of the coporate wallet.
Sure there are those who short change some approaches because they have temporal limitations. New data comes in and you need to categorize that too and determine it's context or supremacy to data you already have.
It's another variation on the religious argument of whether to tag data proactively as it comes into a system or continually model it on the fly. And should we retroactively tag material that existed in our warehouse before we bought this new data mining system?
One of the problems I have with semantic or ontological systems working on the fly is that they're guessing. What does a person's name look like? A city? How does the software differentiate these from the name of a road or organization. It's not straight forward. And the best semantic discovery system? The human brain. Which is a more cost effective solution for you, human categorization/tagging with high accuracy or machine-based that's relying on a less perfect recognition system but might run faster?
The answer is not so cut and dried.
Re:Brute forcing the problem
by
metlin
·
· Score: 1
That is true, that the answer is not straight nor is it simple.
However, one thing that I have learnt (the hard way) over a period of time is that Ontology (Specification of data conceptualization) is infinitely more important than Epistemology (Knowledge of the data).
There is nothing wrong with a system which has tags, the trouble is when you classify it eitherway -- the references of the tags are once again more important than how they are acquired. You could perhaps have a purely automated system, maybe a pseudo-automated system, or a manually supervised system or combinations of all the above. However, what does matter is how do the tags fit in the system? Are those the only reference parameters for the data, or is there something more? Does the system go beyond the tag, or is it constrained to the set of tags.
One simple question regarding categorization -- what constitutes similarity? Most often than not, its a reference KB or meta-data. And would merely data-level similarity suffice, or would you need structural similarity too? I feel that without answering these questions, its pointless to categorize data. If I remember right, Microsoft had done something similar where they used media based categorization -- which did not work out quite well. I do not quite think the human brain works by categorizing things that way -- its more of an ontological and semantic mapping.
A poster above remarked about how medical terms change little. Sure, but context does change. When you have something revolutionary thats coming in, your old system is most likely to treat it as anomalous data!
I'm not a medical person, so I cannot think up of examples off my head, but there are so many such things in physics -- Gravastars anyone? Their terminology has evolved over a period of time from being types of blackholes to being a replacement for blackholes to voids in the Universe.
Re:What's up with Slashdot?
by
cK-Gunslinger
·
· Score: 0, Offtopic
and run it with "xxxx" replaced by the name of some large text file that you create by saving email messages, web pages, log files, what have you.
The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
Benford's law is the name of this phenomenon. Its even more interesting because it is independant of base!
There are many ways that this is used, including detecting human tampering with complex systems (such as the accounting statements for a company, where the person modifying the numbers is likely to skew the results without realizing it).
The key is sufficiently representitive. And I'm not quite clear on what that means, but I know some examples:
Make a list of the areas of all the lakes in your state. Doesn't matter what the units are. The distribution will be so the highest count will be zeros, and the lowest count will be the nines.
Same for a list of all the house numbers in a city. Same for a list of just about anything you can think of, in whatever units you want.
This can be used to detect fraud. For example, if you look at the finacial statements from a business. Take every number on the page, without regard to what the number is. Count the digits, and they will fall into that same distribution pattern. Unless, they cook the books. Then, the distribution pattern will not fit what you expect. Your physics prof can do the same thing to detect if you've fudged your experimental data to fit the expected results.
Thus, we conclude that any webpage that has a high number of 2's is inherantly evil. '666', step aside!
Oh, I have to create some dumb line to get past the lame-o slashdot filters. Jeez guys. C'mon. Did I get flagged somewhere along the line or something? Short lines are EVIL! *sigh*. Blah blah blah... Slashdot filter sucks blah blah blah. Wheeee. Moof. STUPID FILTER!! GEEZ! Does every freaking like have to be so immaculately concieved? What is with this... PASS ALREADY!
Yep, binaries are a good example. Basically, in any data files that represent large systems with many variables, you should find that the Perl regular expression
/\b(\d)\d*\b/g
should match a 1 most often. In some types of text (especially code), you will find things like "0" show up a lot. That's why in my example, I didn't allow for single-digit numbers, but if you want to, that's cool.
I find that a large pool of USENET posts works best.
Re:What's up with Slashdot?
by
Anonymous Coward
·
· Score: 0
I'm glad to know I'm not the only one that hasn't gotten mod points in ages.
Re:What's up with Slashdot?
by
Anonymous Coward
·
· Score: 0
Yes, yes and yes...
And slashdot seems to be heavily slashdotted lately... What's up with that?
Re:What's up with Slashdot?
by
TheFlyingGoat
·
· Score: 0, Offtopic
I think part of it has to do with the number of posts lately. The last 10 articles (not including the current) have had an average of 224 posts. I estimated the average in early-mid summer and came up with 550. Twice as many posts = twice as much required moderation? I'm not sure how slash works in this regard.
Now, as far as the reason for fewer posts, I know that the editors have said that late summer-fall tends to be slower for news, but I also think they've been putting up some boring articles lately. Granted, they may be exciting to some people, but I'm just generalizing. Hey editors... instead of posting articles just for the sake of posting articles, how about sticking only with the interesting ones and taking the leftover time and hunting for other interesting websites to feature? Maybe even write an editorial or two? A State-of-the-Slashdot address? Something fun?:)
-- You have enemies? Good. That means you've stood up for something, sometime in your life. --Winston Churchill
Hmmm, isn't there a prerequsite???
by
3seas
·
· Score: 1
That the text has to first contain some knowledge in it to begin with?
Maybe this is just an attempt at getting a machine to generate core knowledge but then haven't they been working on common sense, which is sorta needed first?
Fark Registration. Get in without stupid reg.
by
Anonymous Coward
·
· Score: 1, Informative
Tired of going through their stupid registration?
CLICK HERE
"to extract some sort of refined knowledge from it."
hum....
If you have an infinite number of red necks....Infinite number of shot guns & shotgun shells.... And an infinite number of stop signs, you will eventually get Shakespeare in brail.....
-- Julius Caesar - Act I, Scene i: "What mean'st thou by that? Mend me, thou saucy fellow!"
-- "Can of worms? The can is open... the worms are everywhere."
Re:Could do us a big favor
by
Anonymous Coward
·
· Score: 0
Starting at slashdot.org... Scanning slashdot.org... Procedure completed. No trace of knowledge found
$
Re:Could do us a big favor
by
Jace+of+Fuse!
·
· Score: 1
Yeah, but how high should it set it's threshold?
Should it filter Funny?
--
"Everything you know is wrong. (And stupid.)"
Moderation Totals: Wrong=2, Stupid=3, Total=5.
Answers?
by
Anonymous Coward
·
· Score: 0
We changed a bunch of stuff and we think something might be broken but, we dont know what or where and we can't reproduce the problem so stop telling us about it. There are more people using the site so everything must be Ok.
That explains some of the problems, but not everything. For instance, why haven't I had mod points in nearly two years, despite having good karma and contributing to conversations (rather than trolling)? Yes, I know the rules about getting moderation points, but even with those I'd expect to get points at least two or three times a year, not every two years. As well, I recently started noticing that Slashdot has popup ads now (I saw one in the last day, and then added slashdot into my popup blocker's blacklist because there's no excuse for that). I usually don't pay attention to trolls, but one has to wonder if the trolls about Slashdot not being able to pay their bandwidth fees may have a kernal of truth to them? That would certainly explain why they've had connectivity issues, as well as why they've added more annoying advertisements in an attempt to scare up more advertising revenue.
sorry, still sounds a lot like text searching
by
cnb
·
· Score: 1
Text-mining programs go further, categorizing information, making links between otherwise unconnected documents
For any google results "Category" is shown right on top of the results. "Links" - try link:slashdot.org & related:slashdot.org as google queries.
If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations
Try the google advanced search you can search with "all of these words" and "without these words".
Finally as the article says it all comes down to asking the right question.
Re:What's up with Slashdot?
by
cK-Gunslinger
·
· Score: 1
I agree about the quality of recent stories. I use to look forward to refreshing the main page all day and seeing an interesting story pop-up every hour or so, one that generates a couple of hundred comments and several deeply-nested threads.
Now I refresh and see a review of a pirate book with ~70 "+2" comments and "Third Anniversary of Bezos-Backed Patent Reform," which went completely ignored. Meh.
Of course, I'm not helping by posting near-useless comments like this...
Hey anyone else think this picture was really cool?
Re:What's up with Slashdot?
by
termos
·
· Score: 1
I have been noticing as well, I've not had modpoints for quite some time, and most of my posts have been modde redundant for some reason, when I feel they're not. No major problems, I usually meta-moderate 10 posts, but yes there is some weird issues.
How well computers truly make sense of what they are reading is, of course, highly questionable, and most of those who use text-mining software say that it works best when guided by smart people with knowledge of the particular subject.
May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.
Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...
-- "Obviously, I'm not an IBM computer any more than I'm an ashtray" (Bob Dylan)
Why does every story linking to a New York Times article have people such as yourself complaining about it? Your comment lends nothing to the discussion and identical sentiments have been expressed countless times. Write a journal entry if you want mindless discourse regarding the New York Times registration requirement. Complain to the Times and tell them what horrible, horrible people they are for making you take ten seconds of your time to provide them with false information.
The fact is that the New York Times often has excellent content and they do not syndicate. This article isn't available elsewhere, but it's worthy of discussion, so either register and read it or do not participate in the discussuion.
I find it hard to believe that you're too lazy to fill out their registration form but somehow found the energy to come here and complain.
--
-- the strongest word is still the word "free"
Re:What's up with Slashdot?
by
cK-Gunslinger
·
· Score: 1
And naturally, the few mods that are around have basically wasted a dozen or so points by methodically modding this entire thread Off-Topic. Not that I mind, but for Pete's sake, there's not a single +5 Mod in this topic yet! There's only 4 "+3"s posts! At least try to be constructive, for crying-out-loud! =P
You mean you don't already have one?
by
djeaux
·
· Score: 0, Offtopic
Something tells me at least six/.ers are already working on the case mods:-D
-- "Obviously, I'm not an IBM computer any more than I'm an ashtray" (Bob Dylan)
Re:Bringing Star Trek-like Computing one step clos
by
Anonymous Coward
·
· Score: 0
Like in the complex plots found in comic books and sci-fi? I'm sure you need a lot of help.
Text Mining for Corellation
by
NoSlack913
·
· Score: 1
Mining for data that might be related based on proximity, either temporal or locational, starts to get interesting when you are dealing with millions of interactions like in a call center on voice data (check out www.callminer.com) and suddenly you find out that when a customer says "hurricane" in an insurance call center, your agents are 5x more likely to hand them off to a supervisor, is real money saving information. This is what this technology is good for, and is being bought and used by a lot of companies hoping to find out that kind of information and save money by training those agents to handle those calls better.
shameluss plug of parent
by
koekepeer
·
· Score: 1
Slashdot is a business. If they started using Google's partner link, the NYT would hand their asses to them in court.
Perhaps Slashdot should get in touch with the NYT and see if they can get a partnership set up, but stealing someone else's wouldn't be such a hot idea.
--
-- the strongest word is still the word "free"
Text Mining The Drool From W: +1, Patriotic
by
Anonymous Coward
·
· Score: 0
The Slashdot Text Mining Challenge:
Translate into plain English, the never-ending Bushisms.
Thank you and have an Ashcroft-free weekend, W00t
Re:Text Mining The Drool From W: +1, Patriotic
by
Anonymous Coward
·
· Score: 0
We'll get right on that, as soon as we finish the Clinton Dictionary of Definitions of What Definitions Is.
but what about the data itself?
by
koekepeer
·
· Score: 2, Insightful
i always wondered about this
allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.
but how can you determine whether the original data was trustworthy?
take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?
by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?
me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?
(of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
Re:but what about the data itself?
by
metlin
·
· Score: 1
Hmm, I guess you cannot say that for sure, but most systems today use trust metrics.
For example, an ACM/IEEE source would have a much higher trust metric than say, from some local conference in Egypt (no offence to any local conferences in Egypt, but you get the wind:)
Re:but what about the data itself?
by
koekepeer
·
· Score: 1
i see the point, but is this truly representative of realiability?
you rely on peer review, on citation indices, so mostly IM-not-so-HO on matters of politics.
when you scan abstracts yourself, you can dig into the detail when something looks interesting enough, but the decision making process that drives me while scanning abstracts is not much influenced by the fact whether it is in a high impact journal (or any other high impact publishing body) or in something mostly not noteworthy.
to put it in another way: chances are high that you find most of the relevant work when you look in more trustworthy sources, but i think this is not what will solve the reliability problem.
you want to find those things that otherwise would be lost in obscurity. at least i do. not follow the next hip thing, but find new things others don't. unexpected things. boldly go where no-one went before. we're scientists, right!?;)
but then again, it's late here (almost 2 at night) and my writings might be slightly out of focus now, since few pints of guinness stand in between being sober and my current state. *grin*
Re:but what about the data itself?
by
JDevers
·
· Score: 1
Definitely...I've read some very BS papers in Science and Nature and some really good ones in MUCH less respected journals.
I would not apply a trust metric to an article based on the journal alone...
You're a lying sack of shit!
by
Anonymous Coward
·
· Score: 0
Any algorithm that is fed slashdot as input data inevitably produces the following output:
Some notes...
by
ekephart
·
· Score: 2, Interesting
(1)"Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' "
Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim."... Just a thought.
(2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don't even "get" it?
(3)"There is a need to make these technologies available for publicly available information," he wrote at his site.
Yes of course. Anyone who has done research knows how frustrating it is to read through abstract after abstract, let alone the entire publication, to find what you are looking for. In research when you are looking for facts or raw information text mining seems highly promising. Yet, for interpretive processes it grows increasingly difficult to envision a correct system. As noted nuances are difficult to detect. In addition to sarcasm, words like "still" allow for multiple meanings for the bigrams, trigrams, etc. to which they belong. Natural language ambiguity is the most important problem to overcome in NLP. After all, how would you like to write a printf statement and not know whether you would get the intended output or some other arbitrary call.
-- sig
I AGREE WITH THIS POST!
by
Anonymous Coward
·
· Score: 0
I heartily approve of any "enhancement" to slashdot that effectively reduces the gap between the karma whores and the blatent trolls!
The compelling dream is that you laboriously load up a computer with enough facts so that it can glean understanding of what it's reading, and one glorious day the computer has enough smarts to make sense of things on its own, and two weeks after crawling the entire Internet, it knows everything.
Hence Doug Lenat's Cyc, now partly open source. Unfortunately that glorious day has been "a few years away" for over 13 years.
The knowledge base is built upon a core of over 1,000,000 hand-entered assertions (or "rules") designed to capture a large portion of what we normally consider consensus knowledge about the world.
But I haven't come across any postings from Cyc on Slashdot correcting misinformation and lies.
Clearly this is possible because all those darn human kids do it; maybe you have to use a more complex computer and leave it for a few years crawling on the floor putting things in its mouth.
You know how those guys at MIT are constantly trying to figure out ways to teach their robots how to interact with people? Let the robots roam the Internet with a topic in mind.
If I'm at a party with a bunch of dog groomers I'm probably not going to say much. I'm sure robots have the same issue; they have nothing in common with us. If we start by making a Cancer-Expert-Bot then let it try to have a conversation with an oncologist I think AI will have more success.
-- What if Digg added local news and a Slashdot inspired comment karma system? ---
http://houndwire.com
I spent more years than I care to admit writing natural language processing software that tried to extract semantic information - conceptual dependency, parsers, etc.
I gave up a few years ago, now I mostly use statiscal approaches (markov processes, word counts, huge databases of proper names, etc.)
.. I meant "statistical approaches", not "statiscal approaches"..
(I was trying to type while holding my wife's baby parrot, and he sometimes goes nuclear if you don't pay enough attention to him:-)
BTW, pardon the shameless plug, but I added a short chapter on statistical nlp (simple enough example program to understand easily) to my free Java/AI web book.
-Mark
Re:statistical nlp
by
Anonymous Coward
·
· Score: 0
huh.. in my nightmares I had some kind of NeuroLinquistic Programming routines for Natural Language Processing tasks
NYT won't be contributing to this large body of text, because registration is STILL required.
-- These are my friends, See how they glisten. See this one shine, how he smiles in the light.
This isnt fair!!!!
by
Anonymous Coward
·
· Score: 0
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
You mean people are getting paid for this? And I have been doing this for free on/. for years. Guess/. would be the equivalent of open source text mining.
Skimming random information?
by
st0rmshadow
·
· Score: 1
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
but could a person who had sufficient knowledge of the program(s) build a large document (say, the giant 9/11 intelligence-failure doc mentioned in the article) so as to fool the text-miners? Subtle misinformation -- let's say that widespread use of text miners results in larger docs being published, then unscrupulous types bury information in such a way that a ridiculously long human endeavour will turn them up, but the programs won't, so those responsible can say: "See? It's all in there. Your program is at fault."
MS Word has a surprising "summary" feature that has given me impressive results in portuguese. How the hell do they do that?
NYT again?!?!
by
Anonymous Coward
·
· Score: 0
Do you Americans have other newspapers online WITHOUT the free registration requirement?
Is it "free" as in "I give my personal data to them and don't have to handle them money to keep it" or "free" as in "I have the inalienable right to store my data with them?
Less archeology and history, more theory please
by
Anonymous Coward
·
· Score: 0
My cable lineup includes a science channel which has what I would consider to be at least halfway decent shows -- the problem is that even on this channel, and even though there have to be 3 or 4 channels a-la History Channel, the shows that dominate are about archeology. Now I won't pan archeology as "not a science" of course, but people who want to watch science shows generally get bored with that incredibly fast. Also, though I appreciate historical background of scientific thought, too much of that. Not enough theory. TV is a terrific medium and could work very well to convey theory to the public, but I just don't see it being done since documentary makers go for the easy route of regailing us with tales of the past.
STOP MODDING THIS SHIT.
by
Anonymous Coward
·
· Score: 0
I, along with most other Slashdot (adult) readers, am tired of morons like this whining about having to sign up for the NYTimes -- particularly since the whining usually comes from people who are REGISTERED USERS of freakin' Slashdot!!!
But that aside, what in the hell is "insightful" about this shit? He's not the first idiot to spout this immature little complaint. He's the ten-millionth. EVERY fucking time a NYTimes story shows up, one of these dolts (always toting a latecomer's userID) feels compelled to complain. If it was ever insightful -- which I don't think it was -- surely it ceased to be so YEARS AGO.
So I repeat, and I plead: STOP MODDING THIS SHIT. Or at least, use the "Troll" or "Flamebait" options. (Although you should always try to mod up, rather than down...blah, blah, blah.)
Re:bullshit
by
Anonymous Coward
·
· Score: 0
Taco, et. al, can simply enter in the article into Google News and get a NYT link that does NOT require registration.
For Fuck's sake, why? Just register at NYT and get on with your meaningless life.
The knowledge discovery and datamining cup challenge this year was looking at the arxiv.org papers for this sort of analysis - some very interesting results. The Task 4 winnder looked at the structure of the papers as a sort of relational database and uncovered a lot of statistical patterns and metrics that could be quite useful for scientists.
"I was an FBI agent for 20 years," said Randall Murch, now a researcher at the Institute for Defense Analyses, which works for the Office of the Defense Secretary and other government agencies. "And I have yet to see anyone who is able to model the way an agent thinks and works through an investigation."
Apart from suggesting the jibe that, of course, only an ex-fbi dick could think that anyone would want to model his/her behaviour, this misses the point that text-mining is intended to find precisely those connections which are too weak to attract human attention. A human being approaches an investigation with preconceptions that can colour their findings powerfully. The power of text mining lies in the fact that is non-human and stupid. Software doesn't get tired and is very fast. Attempts to make software "smarter" are misguided.
Complaints at NYT are more useful than your rant
by
Anonymous Coward
·
· Score: 0
If you don't understand the benefit of maintaining pressure against something on a well-known public third-party site as opposed to sending complaints direct into a black hole then you don't understand how the world works.
that the whining cretins are really vastly intelligent creatures railing against the machine? What a relie.
-- Any preoccupation with ideas of what is right or wrong in conduct
shows an arrested intellectual development. (Wilde)
Subrogation - Firemen's Fund would do well to
by
bob_calder
·
· Score: 1
think before they use a hammer. Using software to fix problems that exist within their human intelligence arena is soooo typical. The bit about subrogation is so idiotic, I can't believe it. Any idiot can check a box on the report if there is a basis for subrogation. If there is enough data in the report to determine a basis for subro. then the adjuster obviously knew that it should have been handed to the subro. dept. from the outset. There is obviously an issue here. The adjusters are reluctant to send claims to subrogation. Why? Maybe they got yelled at for making work for the lawyers who are paid much more than adjusters and they need to be able to blame it on the computers.
-- Any preoccupation with ideas of what is right or wrong in conduct
shows an arrested intellectual development. (Wilde)
That's what studentf are for.
by
bob_calder
·
· Score: 1
At dinner last night, my friend asked how I could justify spending a lot of time putting in massive amounts of information on a project. I told him that's what students are for! (wink wink, nudge)
-- Any preoccupation with ideas of what is right or wrong in conduct
shows an arrested intellectual development. (Wilde)
...short for "Marvel Multiverse". Text-mining all the comic books in existence to find out which timelines conflict with the others would be an excellent research project.
8-PP
Re:What's up with Slashdot?
by
Anonymous Coward
·
· Score: 0
I've been getting 500 errors almost every few pageloads. Isn't anyone else getting the 500 errors?
Quick, someone patent it before Microsoft does or else Slashdot is going to be the next casualty.
Then again, we could just skip the patent and let WWdN die too. Seems like the internet community would break even.
A programmer is a machine for converting coffee into code.
MICHAEL N. LIEBMAN knows his limitations. Even with a Ph.D. and a long career in medical research, he cannot keep up with all the developments in his area of interest, breast cancer. Medline, the database that already houses more than 10 million abstracts for journal articles, is adding 7,000 to 8,000 abstracts per week. Only a fraction of these are about cancer, but the volume of information is daunting nonetheless.
"There is just too much literature to be able to go through it all," said Dr. Liebman, the director of biomedical informatics at the Abramson Family Cancer Research Institute at the University of Pennsylvania.
Yet Dr. Liebman is convinced that new cures could someday emerge for breast cancer if only someone could read all the literature and synthesize it. So he has found a solution: enlisting a computer program to read the articles for him.
"The software is not going to get tired," he said. It also happens to be a speed reader: The product he is using, from a Chicago-based software company called SPSS, can zip through 250,000 pages an hour. Another product, from the text-mining company ClearForest, boasts a speed of 15,000 pages an hour, still far surpassing the human rate of a mere 60 pages.
Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' a technique that academics have been experimenting with for years but for which tools have only recently become commercially available. The prospect of rapidly scanning through reams of documents is stirring interest among researchers and analysts faced with more material than they can handle.
To the uninitiated, it may seem that Google and other Web search engines do something similar, since they also pore through reams of documents in split-second intervals. But, as experts note, search engines are merely retrieving information, displaying lists of documents that contain certain keywords.
Text-mining programs go further, categorizing information, making links between otherwise unconnected documents and providing visual maps (some look like tree branches or spokes on a wheel) to lead users down new pathways that they might not have been aware of.
Currently these programs are used by academic researchers and companies, but information scientists expect that to change. Lower-cost text-mining tools eventually will be offered to ordinary people who want to dig into medical or political issues using public documents. Madan Pandit, an expert in text analysis in Bangalore, India, who runs a Web site called K-Praxis (k-praxis.com), has suggested that text mining could help people make sense of voluminous documents that are already on the Web, like the 858-page report on the congressional inquiry into intelligence failures regarding the 9/11 terrorist attacks.
"There is a need to make these technologies available for publicly available information," he wrote at his site.
In most cases, text-mining software is built upon the foundations of data mining, which uses statistical analysis to pull information out of structured databases like product inventories and customer demographics. But text mining starts with information that doesn't come in neat rows and columns. It works on unstructured data - e-mail messages, news articles, internal reports, transcripts of phone calls and the like.
To make sense of what it is reading, the software uses algorithms to examine the context behind words. If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations. It can then categorize them and draw connections among the categories.
How well computers truly make sense of what they are reading is, of course, highly questionable, and most of those who use text-mining software say that it works best when guided by smart people with knowledge of the particular subject.
"I was an F.B.
CMDRTACO CHECK YOUR EMAIL!
(Yes I am a lazy /. reader)
I've always wanted to ask the computer to find all references to some complex interplay of topics at hand the way those Star Fleet engineers were always able to in TNG...
yea...
.sig
text mining is fun until someone creates something to generate a bunch of junk to feed to the text miners..
take a look at my
anime+manga together at last.. in real time.
I fail it, lah. :~(
-- The WIPO Avenger
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
Like those ppl who actually RTFA and try to get "FORST PIST!!!"?
do() || do_not();
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
If you replace "miscellaneous" with "garbled" and "refined knowledge" with "coherent sentence", this would be describing slashdot editor postings!
Multiverse doesn't appear anywhere in the article. Multiverse is a technical word, for interpreting Quantum Physics. It is totally misplaced in this news submission.
Did the poster even know what it means ?
The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page format for those that don't like to keep hitting "next".
If I apply this to slashdot, I'll have only 3-4 posts to read everyday... What will I do all day at work?
Fwiw....
I've noticed this as well.
Netcraft confirms that the Bleagured Slashdot is Dying....
For seriously, yes. Not only have i noticed much fewer posts to articles, but much less modding up, or even modding at all in many articles. Even those posts desperately deservant of it.
do() || do_not();
This is OT, but read this journal entry from CmdrTaco.
I watched C-beams glitter in the dark near the Tannhauser gate.
Agreed.
Thanks!
yes, I have noticed the metamod thing too..
in addition there is also this tasteless group of guys who keep making posts about greased-up yoda dolls which has also forced me to start browsing at +2..
seems that mod points are being handed out with less frequency than they were before.
I think they should start handing them out for people with "excellent karma" and then track if the metamods agree with the point distribution..
that is just me..
anime+manga together at last.. in real time.
To make sense of what it is reading, the software uses algorithms to examine the context behind words.
They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.
Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.
The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.
You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.
If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
Answers to your questions: HERE
The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
I'm glad to know I'm not the only one that hasn't gotten mod points in ages.
..until no student ever has to research any topic again?
Just head over to tellmewhatthisthingyisabout.com > Print
Sounds like the start of Xanadu.
I'd like that.
I took a speed-reading course and read War and Peace in twenty minutes. It involves Russia. -- Woody Allen
quick someon mine this and give us the refined knowledge that the article has
30% Troll, 50% Underrated, 10% Interesting
Score:5, Troll
Yes, yes and yes...
And slashdot seems to be heavily slashdotted lately... What's up with that?
I think part of it has to do with the number of posts lately. The last 10 articles (not including the current) have had an average of 224 posts. I estimated the average in early-mid summer and came up with 550. Twice as many posts = twice as much required moderation? I'm not sure how slash works in this regard.
:)
Now, as far as the reason for fewer posts, I know that the editors have said that late summer-fall tends to be slower for news, but I also think they've been putting up some boring articles lately. Granted, they may be exciting to some people, but I'm just generalizing. Hey editors... instead of posting articles just for the sake of posting articles, how about sticking only with the interesting ones and taking the leftover time and hunting for other interesting websites to feature? Maybe even write an editorial or two? A State-of-the-Slashdot address? Something fun?
You have enemies? Good. That means you've stood up for something, sometime in your life. --Winston Churchill
That the text has to first contain some knowledge in it to begin with?
Maybe this is just an attempt at getting a machine to generate core knowledge but then haven't they been working on common sense, which is sorta needed first?
Tired of going through their stupid registration? CLICK HERE
"to extract some sort of refined knowledge from it." hum.... ....Infinite number of shot guns & shotgun shells.... And an infinite number of stop signs, you will eventually get Shakespeare in brail.....
If you have an infinite number of red necks
Julius Caesar - Act I, Scene i: "What mean'st thou by that? Mend me, thou saucy fellow!"
kimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
like grep?
I'm sorry, reading this text requires meta-technology.
how long until
...skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
Dear Text Miners,
Please start here: http://slashdot.org
Thanks so much.
Operator, give me the number for 911!
We changed a bunch of stuff and we think something might be broken but, we dont know what or where and we can't reproduce the problem so stop telling us about it. There are more people using the site so everything must be Ok.
That explains some of the problems, but not everything. For instance, why haven't I had mod points in nearly two years, despite having good karma and contributing to conversations (rather than trolling)? Yes, I know the rules about getting moderation points, but even with those I'd expect to get points at least two or three times a year, not every two years. As well, I recently started noticing that Slashdot has popup ads now (I saw one in the last day, and then added slashdot into my popup blocker's blacklist because there's no excuse for that). I usually don't pay attention to trolls, but one has to wonder if the trolls about Slashdot not being able to pay their bandwidth fees may have a kernal of truth to them? That would certainly explain why they've had connectivity issues, as well as why they've added more annoying advertisements in an attempt to scare up more advertising revenue.
Text-mining programs go further, categorizing information, making links between otherwise unconnected documents
For any google results
"Category" is shown right on top of the results.
"Links" - try link:slashdot.org & related:slashdot.org as google queries.
If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations
Try the google advanced search you can search with "all of these words" and "without these words".
Google shows dictionary definitions for every searched term if they exist. There's also already a location based search and a
phone book
Finally as the article says it all comes down to asking the right question.
I agree about the quality of recent stories. I use to look forward to refreshing the main page all day and seeing an interesting story pop-up every hour or so, one that generates a couple of hundred comments and several deeply-nested threads.
Now I refresh and see a review of a pirate book with ~70 "+2" comments and "Third Anniversary of Bezos-Backed Patent Reform," which went completely ignored. Meh.
Of course, I'm not helping by posting near-useless comments like this...
Hey anyone else think this picture was really cool?
I have been noticing as well, I've not had modpoints for quite some time, and most of my posts have been modde redundant for some reason, when I feel they're not. No major problems, I usually meta-moderate 10 posts, but yes there is some weird issues.
Note to self: get smarter troll to guard door.
May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.
Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...
"Obviously, I'm not an IBM computer any more than I'm an ashtray" (Bob Dylan)
Why does every story linking to a New York Times article have people such as yourself complaining about it? Your comment lends nothing to the discussion and identical sentiments have been expressed countless times. Write a journal entry if you want mindless discourse regarding the New York Times registration requirement. Complain to the Times and tell them what horrible, horrible people they are for making you take ten seconds of your time to provide them with false information.
The fact is that the New York Times often has excellent content and they do not syndicate. This article isn't available elsewhere, but it's worthy of discussion, so either register and read it or do not participate in the discussuion.
I find it hard to believe that you're too lazy to fill out their registration form but somehow found the energy to come here and complain.
--
the strongest word is still the word "free"
And naturally, the few mods that are around have basically wasted a dozen or so points by methodically modding this entire thread Off-Topic. Not that I mind, but for Pete's sake, there's not a single +5 Mod in this topic yet! There's only 4 "+3"s posts! At least try to be constructive, for crying-out-loud! =P
Something tells me at least six /.ers are already working on the case mods :-D
"Obviously, I'm not an IBM computer any more than I'm an ashtray" (Bob Dylan)
Like in the complex plots found in comic books and sci-fi? I'm sure you need a lot of help.
Do you have to where a minning hat:p
-Seriv
And why should you not have to register for information that someone worked for? Freeloader.
BOO! TERRO
Mining for data that might be related based on proximity, either temporal or locational, starts to get interesting when you are dealing with millions of interactions like in a call center on voice data (check out www.callminer.com) and suddenly you find out that when a customer says "hurricane" in an insurance call center, your agents are 5x more likely to hand them off to a supervisor, is real money saving information. This is what this technology is good for, and is being bought and used by a lot of companies hoping to find out that kind of information and save money by training those agents to handle those calls better.
never did this before
/. comment that makes sense
MOD PARENT UP (and me down i don't care)
finally a
Slashdot is a business. If they started using Google's partner link, the NYT would hand their asses to them in court.
Perhaps Slashdot should get in touch with the NYT and see if they can get a partnership set up, but stealing someone else's wouldn't be such a hot idea.
--
the strongest word is still the word "free"
The Slashdot Text Mining Challenge:
Translate into plain English, the never-ending
Bushisms.
Thank you and have an Ashcroft-free weekend,
W00t
i always wondered about this
allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.
but how can you determine whether the original data was trustworthy?
take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?
by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?
me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?
(of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
GOAT!
(1)"Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' "
... Just a thought.
Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim."
(2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don't even "get" it?
(3)"There is a need to make these technologies available for publicly available information," he wrote at his site.
Yes of course. Anyone who has done research knows how frustrating it is to read through abstract after abstract, let alone the entire publication, to find what you are looking for. In research when you are looking for facts or raw information text mining seems highly promising. Yet, for interpretive processes it grows increasingly difficult to envision a correct system. As noted nuances are difficult to detect. In addition to sarcasm, words like "still" allow for multiple meanings for the bigrams, trigrams, etc. to which they belong. Natural language ambiguity is the most important problem to overcome in NLP. After all, how would you like to write a printf statement and not know whether you would get the intended output or some other arbitrary call.
sig
The compelling dream is that you laboriously load up a computer with enough facts so that it can glean understanding of what it's reading, and one glorious day the computer has enough smarts to make sense of things on its own, and two weeks after crawling the entire Internet, it knows everything.
Hence Doug Lenat's Cyc, now partly open source. Unfortunately that glorious day has been "a few years away" for over 13 years.
But I haven't come across any postings from Cyc on Slashdot correcting misinformation and lies.
Clearly this is possible because all those darn human kids do it; maybe you have to use a more complex computer and leave it for a few years crawling on the floor putting things in its mouth.
=S
How about a computer parsable language like lojban? You can't polish a turd, and you can't systematically extract information from English.
-Libertarian secular transhumanist
That sounds painful!
You know how those guys at MIT are constantly trying to figure out ways to teach their robots how to interact with people? Let the robots roam the Internet with a topic in mind. If I'm at a party with a bunch of dog groomers I'm probably not going to say much. I'm sure robots have the same issue; they have nothing in common with us. If we start by making a Cancer-Expert-Bot then let it try to have a conversation with an oncologist I think AI will have more success.
What if Digg added local news and a Slashdot inspired comment karma system? ---
http://houndwire.com
They just had to get it in somehow
like the 858-page report on the congressional inquiry into intelligence failures regarding the Sept. 11, 2001, terrorist attacks.
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
I gave up a few years ago, now I mostly use statiscal approaches (markov processes, word counts, huge databases of proper names, etc.)
-Mark
NYT won't be contributing to this large body of text, because registration is STILL required.
These are my friends, See how they glisten. See this one shine, how he smiles in the light.
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
/. for years. Guess /. would be the equivalent of open source text mining.
You mean people are getting paid for this? And I have been doing this for free on
skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
I like to call it High School.
but could a person who had sufficient knowledge of the program(s) build a large document (say, the giant 9/11 intelligence-failure doc mentioned in the article) so as to fool the text-miners? Subtle misinformation -- let's say that widespread use of text miners results in larger docs being published, then unscrupulous types bury information in such a way that a ridiculously long human endeavour will turn them up, but the programs won't, so those responsible can say: "See? It's all in there. Your program is at fault."
MS Word has a surprising "summary" feature that has given me impressive results in portuguese. How the hell do they do that?
Do you Americans have other newspapers online WITHOUT the free registration requirement?
Is it "free" as in "I give my personal data to them and don't have to handle them money to keep it" or "free" as in "I have the inalienable right to store my data with them?
My cable lineup includes a science channel which has what I would consider to be at least halfway decent shows -- the problem is that even on this channel, and even though there have to be 3 or 4 channels a-la History Channel, the shows that dominate are about archeology. Now I won't pan archeology as "not a science" of course, but people who want to watch science shows generally get bored with that incredibly fast. Also, though I appreciate historical background of scientific thought, too much of that. Not enough theory. TV is a terrific medium and could work very well to convey theory to the public, but I just don't see it being done since documentary makers go for the easy route of regailing us with tales of the past.
I, along with most other Slashdot (adult) readers, am tired of morons like this whining about having to sign up for the NYTimes -- particularly since the whining usually comes from people who are REGISTERED USERS of freakin' Slashdot!!!
But that aside, what in the hell is "insightful" about this shit? He's not the first idiot to spout this immature little complaint. He's the ten-millionth. EVERY fucking time a NYTimes story shows up, one of these dolts (always toting a latecomer's userID) feels compelled to complain. If it was ever insightful -- which I don't think it was -- surely it ceased to be so YEARS AGO.
So I repeat, and I plead: STOP MODDING THIS SHIT. Or at least, use the "Troll" or "Flamebait" options. (Although you should always try to mod up, rather than down...blah, blah, blah.)
Taco, et. al, can simply enter in the article into Google News and get a NYT link that does NOT require registration.
For Fuck's sake, why? Just register at NYT and get on with your meaningless life.
Pretentious git.
The knowledge discovery and datamining cup challenge this year was looking at the arxiv.org papers for this sort of analysis - some very interesting results. The Task 4 winnder looked at the structure of the papers as a sort of relational database and uncovered a lot of statistical patterns and metrics that could be quite useful for scientists.
Energy: time to change the picture.
"I was an FBI agent for 20 years," said Randall Murch, now a researcher at the Institute for Defense Analyses, which works for the Office of the Defense Secretary and other government agencies. "And I have yet to see anyone who is able to model the way an agent thinks and works through an investigation."
Apart from suggesting the jibe that, of course, only an ex-fbi dick could think that anyone would want to model his/her behaviour, this misses the point that text-mining is intended to find precisely those connections which are too weak to attract human attention. A human being approaches an investigation with preconceptions that can colour their findings powerfully. The power of text mining lies in the fact that is non-human and stupid. Software doesn't get tired and is very fast. Attempts to make software "smarter" are misguided.
Doesn't it bother anyone that copying the article is probably illegal?
If you don't understand the benefit of maintaining pressure against something on a well-known public third-party site as opposed to sending complaints direct into a black hole then you don't understand how the world works.
Micheal Moorcock has been text-mining the multiverse for decades now.
One line blog. I hear that they're called Twitters now.
that the whining cretins are really vastly intelligent creatures railing against the machine? What a relie.
Any preoccupation with ideas of what is right or wrong in conduct shows an arrested intellectual development. (Wilde)
think before they use a hammer. Using software to fix problems that exist within their human intelligence arena is soooo typical. The bit about subrogation is so idiotic, I can't believe it. Any idiot can check a box on the report if there is a basis for subrogation. If there is enough data in the report to determine a basis for subro. then the adjuster obviously knew that it should have been handed to the subro. dept. from the outset. There is obviously an issue here. The adjusters are reluctant to send claims to subrogation. Why? Maybe they got yelled at for making work for the lawyers who are paid much more than adjusters and they need to be able to blame it on the computers.
Any preoccupation with ideas of what is right or wrong in conduct shows an arrested intellectual development. (Wilde)
At dinner last night, my friend asked how I could justify spending a lot of time putting in massive amounts of information on a project. I told him that's what students are for! (wink wink, nudge)
Any preoccupation with ideas of what is right or wrong in conduct shows an arrested intellectual development. (Wilde)
...short for "Marvel Multiverse". Text-mining all the comic books in existence to find out which timelines conflict with the others would be an excellent research project.
8-PP
I've been getting 500 errors almost every few pageloads. Isn't anyone else getting the 500 errors?