Mining Unstructured Data

Structurable data vs. Human intellect by Anonymous Coward · 2002-03-15 11:24 · Score: 0

Very interesting read on this (and a little more) over here at UIA. It seems that its not up to human intellect to structure data, but rather is easier to recognize information through keywords (true for all acquisation of data, ie book reading, watching tv, etc.).

I submitted it to Slashdot a couple of days ago but guess they didn't like my story.

They've discovered Google! by twitchkat · 2002-03-15 11:26 · Score: 2, Insightful

They better get ready to pay some google patent licensing fees:

People also make their feelings known in less direct ways, says Jhingran. "People actually vote their preferences by providing links to different documents," he explains. "You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something." Businesses could use such analytical capability to determine the "buzz" about their products found in chat rooms and forums on the Internet.

Re:They've discovered Google! by yintercept · 2002-03-15 14:29 · Score: 1

The scientific community has been fascinated with the topology of indexes long before Google. I can remember in college (1980s) coming across companies that tallied the journal citations on all academic journals and produced various reports on the influence of various writers, and trends. (of course, I went senile and can't remember the names of the organizations.) In any case, the bibliography of an article is often as interesting as its contents. If I ever had any spare time and cash, my plan was to turn y-intercept.com into a place where people could track the citations in different books. In any case, google didn't come up with a new idea, they are just applying an old idea to the web.

answer is perl by Anonymous Coward · 2002-03-15 11:27 · Score: 1, Interesting

http://www.google.com/search?q=learn+perl+for+hu ma nities+student+data+mining

(remove the silly space in "humanities"

perl and a lot of thinking, that is.

Re:answer is perl by Anonymous Coward · 2002-03-15 16:45 · Score: 0

SOrry Spanky but the page looks fine to me. No horizontal scrollbar.

I've been working on a project... by Anonymous Coward · 2002-03-15 11:27 · Score: 1, Interesting

..it basically converts human readable (unstructured data) into computer-readable, structured data. Parsers are in the works for converting unstructured services into standard services; for example the inboxes of Yahoo, Lycos, Mailcity, Excite, etc. are converted to an internal form, which is later served via a POP3 server.

Of course, this isn't limited to web-based e-mail, there are parsers to parse web-based forums and bulletin boards, yes even for Slashdot. The unstructured data here can be converted and served via NNTP (NetNews) or some other method.

There's a huge amount of unstructured data and services available on the Web, making these available to computers is a huge step forward in information technology.

Oh Man by Tadrith · 2002-03-15 11:28 · Score: 1, Insightful

This is a fascinating article. I'm especially interested in it, because I tend to work with databases - most of which are created from completely unstructured data.

For instance, Company A is tracking all their data in a Microsoft Word document. Frequently I get asked to dynamically work with this data, and pull it directly out programmatically. I can attest to how difficult this can be sometimes, and I frequently find that upper management doesn't understand the challenges behind pulling unstructured data out.

I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.

Re:Oh Man by thelenm · 2002-03-15 11:55 · Score: 1

I think upper management (and the general public) think that a computer is some sort of magic box. "The data is right there! Why can't you just take it from the Word document and put it in the database? I can understand it, so why can't the computer?" But people have been working on automatic language understanding for over 50 years and haven't even come close to solving the problem. I work in natural language processing, and I can attest, it's tough.

--
Use Ctrl-C instead of ESC in Vim!
Re:Oh Man by brer_rabbit · 2002-03-15 12:01 · Score: 2

I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.
And pray your boss hasn't heard of Perl :)

/. as a Turing Test by bravehamster · 2002-03-15 11:28 · Score: 5, Funny

email identified by interpretation rather than keywords

A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.

--
---- El diablo esta en mis pantalones! Mire, mire!

Re:/. as a Turing Test by Sunda666 · 2002-03-15 12:44 · Score: 1

if the machine can hack into /. database servers and get all of our real email addresses, then I will consider it really smart (unless /. is running IIS/MSSQL these days, it would require not much intelligence then)

--

``If a program can't rewrite its own code, what good is it?'' - Mel

"Slashdot as a minable database of ideas..." by theonomist · 2002-03-15 11:29 · Score: 4, Funny

Oooookay.

Sir? Please step away from the bong.

I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.

Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.

What the hell are you people smoking?

--
"Offtopic, Inflammatory, Inappropriate, Illegal, or Offensive" -- hey, that's me!

Re:"Slashdot as a minable database of ideas..." by Sloppy · 2002-03-15 12:44 · Score: 1

You bastard! Can you imagine what it's going to do to the miner's self-esteem, when it reads your message?

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Re:"Slashdot as a minable database of ideas..." by Anonymous Coward · 2002-03-15 13:59 · Score: 0

Damn my lack of moderator points, that's one of the funniest posts I've seen on Slashdot. Kudos if not karma to theonomist.
Re:"Slashdot as a minable database of ideas..." by ahde · 2002-03-15 19:03 · Score: 2

what that list doesn't tell you is that all those stupid stock market analysts were doubling their money every month or less, because the rest of us suckers bought their hype and false predictions. Payne Webber, Merrill Lynch, Credit Suisse, Oppenheimer & co. consolidated the largest percentage of the world's money since the London Bay Company & East India Company. And they haven't lost a penny of it. (WTC offices were insured)
Re:"Slashdot as a minable database of ideas..." by MikeBabcock · 2002-03-16 02:23 · Score: 2

It depends on the article ... when was the last time you read every article in a given week and every message attached thereto?

_Sometimes_ unique or semi-unique or thought-provoking ideas get stated. That's the nature of chat rooms and discussion boards. USENET has unique ideas as well and is much more spam-filled and useless looking on first glance.

--
- Michael T. Babcock (Yes, I blog)
Re:"Slashdot as a minable database of ideas..." by t · 2002-03-17 05:28 · Score: 1

Monkeys, keyboards, shake vigourously.

Slashdot by rbgaynor · 2002-03-15 11:30 · Score: 4, Funny

Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...

--
"Good things don't end with eum, they end with mania or teria." - H. Simpson

Re:Slashdot by cybermage · 2002-03-15 12:12 · Score: 0, Offtopic

Of the Seven Dwarfs, the only one who shaved was Dopey. That should tell us something of the wisdom of shaving.

Nah. Just too dumb to grow facial hair.

--
Some people have a way with words, and some people, um, thingy.

Doomed Doomed we're all doomed by MosesJones · 2002-03-15 11:31 · Score: 3, Insightful

By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.

Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.

To put in it simple terms for George W. Bush

All Muslims are Terrorists
All Supporters of Militia (McVeigh) are terrorists

Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.

Welcome to 1984, and a Brave New World, the minority will cease to count.

--
An Eye for an Eye will make the whole world blind - Gandhi

Re:Doomed Doomed we're all doomed by Alien54 · 2002-03-15 12:04 · Score: 2

By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.
Probably it should be randomly structured data, but in any case, the problem still boils down to how you described to, trying to decide what is relevant and how. Other wise you just have a bunch of blobs.
Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.

--
"It is a greater offense to steal men's labor, than their clothes"
Re:Doomed Doomed we're all doomed by gwernol · 2002-03-15 14:02 · Score: 1

Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.

Nothing is destroyed. The original data is still there. These technologies are best used to summarize and search large information archives (e.g. the web). Does Google destroy any data? No it merely indexes it in a certain way. In fact search services often make data more easily accessible, the opposite of what you are arguing.

--
Sailing over the event horizon
Re:Doomed Doomed we're all doomed by dekraved · 2002-03-15 16:45 · Score: 1

The other 20% gets lost anyway. Who really reads an entire discussion on /. carefully? The hope for parsing unstructured data is that redundancies can be aggregated, reducing the amount of time needed to consume the full range of ideas in a given set of documents...
Re:Doomed Doomed we're all doomed by maxpublic · 2002-03-15 18:20 · Score: 1

No, what he's arguing is that the context is effectively destroyed. The methods are good enough to get 80% of the meaning, but the other 20% is lost if all you do is thumb through the search results. The only way to restore that other 20% is to read the actual documents themselves, using human reason and judgement to come to logical conclusions (assuming the reader is capable of such a thing in the first place).

Max

--
My god carries a hammer. Your god died nailed to a tree. Any questions?

Good use of XML by soap.xml · 2002-03-15 11:33 · Score: 2, Informative

From the article:

One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.

This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.

Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!

-ryan

Re:Good use of XML by Pinball+Wizard · 2002-03-15 13:21 · Score: 2

Interestingly enough, relational database technology itself was created to overcome the limitations of hierarchal databases(aka tree-based data structures). The problem back then was - not everything can be organized into a nice little hierarchal tree of data, however if you could create relations between otherwise unrelated pieces of information you could tie together all sorts of disparate data.

Seems to me like we're coming full circle with OOP and XML - trying to create huge monolithic structures that can handle everything we need to do. Look at Java - everything ultimately is inherited from the almighty Object. XML is no better in this regard, although you can have lots of different XML files describing different pieces of data.

I wonder if the people pushing hierarchal(OOP and XML) data models over relational ones realize that the exact opposite was the case 30 years ago. Perhaps we should have stuck with hierarchal databases in the first place?

--
No, Thursday's out. How about never - is never good for you?
Re:Good use of XML by Mr.+Shiny+And+New · 2002-03-15 15:41 · Score: 2, Insightful

It's really a case of using the right tool for the right job. After all, some data is not well expressed in a tree, while some is not well expressed in a relational database. Does this mean it's more right to use one or the other? Too often I see people using XML just because it's new, and not because it actually makes the data easier to work with.

As for the Object hierarchy in Java, it really doesn't limit what you can do with the objects and classes... you can still have a class with no data and only static methods, which is just like a function in C. The nice thing about the automatic Object superclass is that it makes generic, heterogenous containers really easy to use.
Re:Good use of XML by WatertonMan · 2002-03-15 21:20 · Score: 1

Actually depending upon the kind of data you are mining, XML is very poor for this. Consider a simple structure that exists in every book. You have pages, paragraphs, authors, quotes, and so forth. The problem is that different blocks are not always within other blocks. (i.e. nested, the way inner loops are always nested in a programming language) Instead a paragraph block can be half in one page block and half in an other.
That doesn't sound like a big problem, but it can be when you are using regions to map out new concepts. (i.e. analyze a class of words in all sentences that contain the concept of Apple computer) In practice writing "concepts" to analyze (data mine) texts of this sort is very hard. Further using tools like Perl can be a pain. Yeah you can do it, but you probably won't do it well.
I know that the company Sageware which I have dealt with does what this article describes. However it supplies various "objects" for mining for concepts. It ends up being tricky stuff which is why mainly large portals use the technology.
The basic notions can apply to Perl or simple C code. Go very complex though and things get messy very quickly.
Re:Good use of XML by dubl-u · 2002-03-16 05:14 · Score: 2

Interestingly enough, relational database technology itself was created to overcome the limitations of hierarchal databases(aka tree-based data structures). [...] Look at Java - everything ultimately is inherited from the almighty Object.

Don't mistake a hierarchical type structure for a hierarchical data structure.

In Java, one might model things so that Persons and Vehicles are both subclasses of Object, and that Cars and Trucks are subclasses of vehicles. This is indeed strictly hierarchical.

But a Person called Joe can be the owner for a Truck, ride in a Car, and be the spouse of another person Jane simultaneously. That's not a hierarchical relationship; it's a web of connections.

You can still have hierarchical relationships with OO data; if Joe sells his truck, the Engine and the four Wheels would automatically go along with. But that's just one possible relationship.

UK data protection law by skinfitz · 2002-03-15 11:39 · Score: 2, Interesting

This sounds interesting - particularly how essentially this is something that makes an unstructured filing system suddenly become a structured filing system. What implications does this have for UK law?
UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?

Forward this to the Director of IT, stat! by johncheng · 2002-03-15 11:40 · Score: 4, Funny

This article will have great importance to our director of IT, since the way our company stores data seems to completely unstructured.

Mining /. by saintlupus · 2002-03-15 11:41 · Score: 1

Why is it that the very thought of mining Slashdot makes me think of the goatse.cx guy?

Hell, I've really got to stop reading at -1 so much.

--saint

Google Made to Order by shalunov · 2002-03-15 11:41 · Score: 3, Informative

Some quotes from the press release:

People actually vote their preferences by providing links to different documents. You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something.

This Discoverylink(TM) search engine concept somehow sounds very familiar. Where could I have heard this innovative idea before? Or, as the press release asks, "Where did I read that?" Ah, yes!

--

-- Stanislav Shalunov

Re:Google Made to Order by John+Harrison · 2002-03-15 11:47 · Score: 2

"Where did I read that?" Ah, yes! [google.com]
Of course if you were in IBM Research, as the authors are, you might have been familiar with The Clever Project prior to Google. It is explained very nicely here.
I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.

--
Lasers Controlled Games!
Re:Google Made to Order by shalunov · 2002-03-15 11:56 · Score: 2

I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.
The concept of using links as votes to rate resources is simply not novel anymore. Everyone knows about it. I'm not claiming that Google invented it (not being a very non-obvious idea, this was probably independently developed at a number of places); but presenting stuff familiar to everybody as "Invented Here" news sounds like PR.
But wait, it was a press release. Submitted by someone from IBM, too.

--
-- Stanislav Shalunov
Re:Google Made to Order by John+Harrison · 2002-03-15 12:40 · Score: 2

But wait, it was a press release. Submitted by someone from IBM, too.
Just to make the conspiracy complete, I am from IBM as well.

--
Lasers Controlled Games!
Re:Google Made to Order by dubl-u · 2002-03-16 05:34 · Score: 1

U BM, I BM, we all BM for IBM.

good title, but mismatched content by candot · 2002-03-15 11:41 · Score: 3, Interesting

They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.

I just attended the Knowledge Technologies conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio, for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.

[Skip next section to avoid my self-promotion]

I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 are. Check out the demo of our technology, Waypoint 2.0, which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.

This Is Like Mining Money by Anonymous Coward · 2002-03-15 11:45 · Score: 5, Funny

"email identified by interpretation rather than keywords"

Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.

Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.

Suggestion: Give me a dollar so I can start a business.

Email? by Fucky+the+troll · 2002-03-15 11:53 · Score: 0

Email identified by interpretation rather than keywords? Does that mean email addresses can be identified in the same way? Surely that'd mean spamblocks wouldn't work any more.

--

Roadkill is yummy.

Re:Email? by feloneous+cat · 2002-03-15 12:02 · Score: 1

Interpretation!?!
That would make my new e-mail address you-f**king-b*****d-dont-you-come-around-here-agai n-you-c**k-s****r-exclamationpoint-exclamationpoin t.
Somehow I just can't see telling Mom my new e-mail address...

--
IANAL, but I've seen actors play them on TV

Some interesting technology... by jonfromspace · 2002-03-15 11:56 · Score: 2

These guys have some interesting tech revolving around semantic search... worth a boo anyway...

--
I am become Troll, destroyer of threads

Worked at two starups that do this by voisine · 2002-03-15 12:01 · Score: 2, Insightful

We used perl regular expressions and lex/yacc
like tools to tease structured data out of semi
structured web pages and other listings. It's
doable if you limit your scope to one particular
subject, such as job listings. The hardest part
is creating contextual lexicons. Does MS mean
that a master degree is required? The job is
located in Mississippi? Expreince with Microsoft
products is required? The hr contact is Ms.
Smith? You have to figure it out based on
context. Is MS preceded by a city name, that type
of thing.

inFact by Anonymous Coward · 2002-03-15 12:02 · Score: 0

Insightful has a cool software package called inFact. http://www.insightful.com/solutionslibrary/TextIma geMining/inFactInfoExtraction/Information_bizcase. asp

Nat. Language Understanding != Speech Recognition by thelenm · 2002-03-15 12:02 · Score: 3, Informative

A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.

--
Use Ctrl-C instead of ESC in Vim!

Polymorphic Searching by waimate · 2002-03-15 12:04 · Score: 2, Informative

Of all the information stored in computers, 80% of it is unstructured, and arguably it's the most valuable 80%, too.

Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.

Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.

The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.

The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.

Re:Polymorphic Searching by Com2Kid · 2002-03-15 13:35 · Score: 2

Irony is of course that ALL data that we recieve and send for human consumption is INDEED structured.

It is just not structured according to how the COMPUTER sees it.

Hell this posting of mine right here is structured, and beyond the obvious sentances/paragraphs explanation that is most often given.

Almost all written work is designed as so to allow for the reader to follow along the author's thought process.

Indeed writting could be looked at as some sort of bare level one shot emulation code for the human brain.

Now for computers this makes NO sense at all.

Uh duh, they don't think.

Now with a lot of work native languages can indeed be PARTIALY understood by computers, and there is an artificial language out there (I forget the name) that was designed from the ground up for both comptuer and human understanding on a quasi-equal level. But even so it cannot match the same. . . . underlying meanings between both parties.

Humans are capible of understanding all of the complexities of modern day computers, it may require a lot of work and some darn good wizardy, but it IS possible.

The issue is that the way that computers 'think' is not but a subset of our own thought methods that we have expanded upon and made more complex but ultimatly added nothing new too.

And yet it is by the very nature of being a subset that computer 'thinking' (ugh I hate using that term in this context) can only contain a partial set of the abilities of Human thinking.

Ah, to take a related explanation from Dansdata

"But a clever enough algorithmic composition system can get around this, by using a human to direct it through infinite musical space. With any luck, the human will have some idea of what sounds good; that's a really difficult thing to teach a computer."

(speaking about the Kong Karma's composition functions)

Humans have to GUIDE the computer.

For instance the file finder feature on many OSs.

If I tell my Windows box to search for mIRC* it will search my entire computer's hard drive including my Cygwin folder and my C:\corel folder.

Which is obviously highly friggin stupid since mIRC is NOT going to be in either one of those. (well not today at least. :) )

But the COMPUTER does not know that. Despite having a highly refined layout system for my files that has everything compacted into nice small little subsets of subsets as to what types of file it is, the damn computer has;

No idea WTF mIRC is, what IRC is (outside of some sort of program that tells the computer to interpet network packet X with Y evaluation system and display Z depending on X's contents, and oh yah shove the word IRC on the window while your at it. That is ALL computers know of IRC), what the hell a 'program file' is or why in the world (no concept of 'why' either) mIRC would be in C:\program files\

Now if I use a bit of human judgement and direct the computer to search only C:\program files\ it can find the requested files just fine.

But it is STUPID. Period.

What is the BEST possible outcome we can hope for in this situation? Hmm?

Hah. All files in some sort of a database system? Make it 'object based'? Or just add assloads of data to the 'file fork'.

Bah it would STILL come down to the computer going over each friggin entry in a database until it gets a match with the search string. Hell even if some more efficent searching algorithm is used besides just going through every item in the database, the fact is that the computer

(pay attention here folks)

STILL HAS NO FRIGGIN IDEA AS TO WHAT IN THE HELL mIRC is.

I can add descriptors to heck to all files associated with the program. And the computer will STILL NOT KNOW WHAT mIRC IS!

Once again.

THE COMPUTER HAS NO IDEA AS TO WHAT THE HELL ANYTHING IS.

For instance.

I know off hand that my copy of virtual dub is in F:\video editing tools\virtual dub\ (actualy the version number follows it, but close enough. :) )

Now the computer has no idea as to what 'video editing tools' is (I am using is here folks, plural? Huh, whats that? what is 'what'. The computer does not have an understanding of ANY of these topics.)

In fact, one thing that SO many people seem to forget, is that COMPUTERS UNDERSTAND NOTHING.

Nothing AT ALL.

PERIOD.

So please.

Please.

PLEASE

Understand that the computer will NEVER be able to truly organize or structure your data, because the computer does not even know what the hell a structure is. Sure you can tell it to shove such and such bits into such and such places, but it knows not what those bits are or what those bits mean or what those places mean or what the hell a place is or ANYTHING ELSE AT ALL.

I can make my computer feel happy.

I have it show "I am happy" on the screen.

That is as close as you are ever going to get the current breeds of computers to being able to understand or think about anything at all.

Because everything eventualy comes down to that same basic fact.

The computer does what you tell it too and nothing else.

--
Need help treating your acne? Come here!

debian dependencies data structures by Anonymous Coward · 2002-03-15 12:15 · Score: 0

Anyone care to analyse the relationships between debian package dependencies ?

My current uynderstanding is

There are 8 types of dependencies, 5 that are enforced, pre-depends, depends, conflicts, replaces, provides and 3 optional types suggests, recommends, enhances

Two packages can depend or pre-depend on each other.
A package can conflict, provide or replace itself.

The data structure must;
- directional
- cyclical (self-loops)
- multigraph (parallel edges of different dependency types)

So its a multidigraph (i think)

Id like to analyse the entire depedency graph and do all sorts of checks to analyse each release as a whole rather than on a package by package basis.

I have a good book (Algo. in C Part 5 Graph Algorithms by Sedgwick). But still not sure how to analyse, or represent (in a more structured way) this beast.

Anyone have any hints ?

XML as a solution by Prowl · 2002-03-15 12:16 · Score: 1

Isn't XML only part of the solution? I'm pretty sure RDF comes into play somewhere here.

--
That man tried to kill mah Daddy

XML won't make it by mangu · 2002-03-15 12:16 · Score: 4, Insightful

To encode information in XML is as much work as doing it in SQL or any other language. What is needed is artificial intelligence, to take any data source, be it a picture, text, music, or whatever, and classify it. Some examples of what I have wanted for:

- show a text and find other texts about the same subject.

- hum a tune and tell find an mp3 of the same music.

- show a picture and find other pictures of the same girl.

- better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...

Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.

Re:XML won't make it by Anonymous Coward · 2002-03-15 13:38 · Score: 0

Bell Labs is already doing this stuff for CNN.
Not for the masses yet. But geese people, don't you think some of this stuff is not so hard really??

I mean one part of the system they use involes using Speech rec to match video with audio, not word for word, but my setting keyword matches every so often. They found that is much more accurate than plain speech rec, plus it is a lot faster.
Re:XML won't make it by foobar104 · 2002-03-15 16:57 · Score: 2

Bell Labs? Are you high? The system you're talking about is commercially available: it's called Virage VideoLogger. (I'd provide a link, but the Virage web site sucks so bad.... Just go to www.virage.com [the www is mandatory].)

VideoLogger has neato features like speech-to-text, speaker identification, face recognition, and keyframe extraction. All of those things happen in real time, if the PC is fast enough for it.

Combined with a half-decent RDBMS back-end, you can do stuff like search on "Saddam Hussein" and get back a reference to a clip that includes a picture of him, but not his actual name anywhere in the voiceover or the CC data. It's pretty cool.

It's also, like, $60,000 a copy, or something.

No, I don't work for Virage, and I've never had a business relationship with them. I've seen their stuff demoed, though.

creative uses by rnd() · 2002-03-15 12:17 · Score: 3, Informative

There are some companies that are doing some creative things with this kind of technology.

It makes you wonder how much of this is based on theoretical linguistics and formal semantics, and how much is based on good old fashioned statistics and optimization.

--

Amazing magic tricks

Re:creative uses by gwernol · 2002-03-15 14:08 · Score: 2

It makes you wonder how much of this is based on theoretical linguistics [stanford.edu] and formal semantics [mit.edu], and how much is based on good old fashioned statistics [nec.com] and optimization.

I can't speak to the work discussed in the original post, but I do know that in the real world a formal linguistics/semantics approach is impractical. These systems require complete or near-complete knowledge structures to work at all. They are brittle, meaning as the world changes they fail to adapt to the changing lexicon. Formal systems are often computationally expensive, and scale poorly to large data sets. The practical problems of constructing and maintaining the formal knowledge structures quickly overwhelms the advantages they have over looser approaches.

So in most cases it is a hybrid of machine learning and statistical techniques that are used in these systems.

--
Sailing over the event horizon
Re:creative uses by WatertonMan · 2002-03-15 21:49 · Score: 1

Often you can mix bits of formal systems with bits of statistical systems. Depending upon what you need, it can get you quite a ways. Of course formal structure (besides being problematic philosophically) is pretty much beyond anything we could conceive of writing. However you can do things like write a statistical part of speech tagger and then use those structures to find direct objects. Tricks like that often are very helpful in mining data.

database of descritpions by Alien54 · 2002-03-15 12:24 · Score: 2

Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.

which, come to think of it, is what is happening in XML anyhow, you are adding tags in the file instead of having descriptive data outside in the database.

YMMV as far as which method will work better for you.

--
"It is a greater offense to steal men's labor, than their clothes"

Poetic justice? by Anonymous Coward · 2002-03-15 12:27 · Score: 0

ATHBT

They are talking about searching by thogard · 2002-03-15 12:36 · Score: 2

What happens when you don't know what your even looking for? Data mining is more about ways to automaticly find interesting ways of indexing and displaying data than simply looking up known values in unstructured data.

There is a package that is good at displaying unstructured data and letting you see strange patterns since it has tools to find patterns in the data. Its called Partek.

E2? by Anonymous Coward · 2002-03-15 12:44 · Score: 1, Funny

Unstructured, like [Everything2]?

You know.. by Anonymous Coward · 2002-03-15 12:58 · Score: 0

It was a pretty lousy article. Did the editor even skim it before posting this?

This can be handled by rho · 2002-03-15 13:10 · Score: 3, Insightful

This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."

Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.

I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.

Oh, and Apple called--they'd like their Knowledge Navigator back, please.

--
Potato chips are a by-yourself food.

KDD by rusti999 · 2002-03-15 13:22 · Score: 1

There is a field dealing with this. It's called Knowledge Discovery in Databases (KDD). It's been around for a few years now. Go here for a slightly more technical overview. The posted article is aimed more toward the business people rather than the technical people.

Semi structured data, likely the way. by DutchSter · 2002-03-15 13:32 · Score: 2, Interesting

This reminds me of work I did as an undergrad in my advanced database class. At the end of the term we were given group projects to research and present "future" database concepts. One of them was unstructured data. The conclusions drawn were that right now unstructured data has no real value. The value assigned to a particular element can only be assigned by the human who assesses those values. My group was assigned the task of making unstructured data available to standard databases.

Consider:
3822 North Fickle, Frequent Customer - solicit often.

The inherent meaning is obvious to a human. Most likely a street address followed by a remark of some type. For a computer to correctly (and in a real world industry, the demand is that it be 100% correct, always, or forget it) establish the meanings would require some serious AI.

Enter XML. As it exists now, XML is the bridge between unstructured data and data which can be formalized into a structured format. As other people have pointed out, XML does solve some of these issues. Semistructured data via XML is a fairly recent innovation and does a nice job adding meaning and definition to otherwise cryptic (to a computer) strings of ASCII characters.

Going back to the original problem however, XML must be inserted by a human. Until such time as a machine can *establish* intrinsic value on data, there needs to be an intermediate platform. This doesn't help in the example provided where a company keeps data in a Word document. That data is truly unstructured and random. Human interaction can easily destroy the meaning of a document to a computer, without affecting (and perhaps increasing) the comprehension by other people.

In the near future, it looks like the closest we will truly get is semi-structured, not un-structured data. He who ultimately solves this problem will also solve AI.

Re:Semi structured data, likely the way. by WatertonMan · 2002-03-15 21:35 · Score: 1

As I mentioned elsewhere in this discussion, the problem with XML is that it must be fully nested. This is, for many types of unstructured data, a horrible situation. The problem is that when mining for data you often don't have the structure but are creating the structure. This relates various contexts in ways that don't fit the requirements of an XML topology. An example of this is relating pages to paragraphs. Paragraphs aren't always nested within pages. One structure can cross the borders of the other structure.
However once you have some structures (say basic linguistic units like sentences, words, paragraphs, pages, speakers, etc.) you can then create other ones. From those structures you can then use various techniques to develop more informtion.
Once again, great in theory, complex in practice. However many of the issues used in NLP to understand words can then be expanded for larger units of meaning. Further you can then start to relate various types of contexts. Of course how helpful all this is relates to the type of analysis you are making. Some practical problems are very solvable now. Other problems are more complex.
But consider some future "Google" which indexes pages based not on words but on concept spaces. It then uses other methods, such as the links to a page and so forth, to rank not just pages but concept *spaces* within a page. Finding information would be much, much more helpful.

FlipDog uses this for job-hunting by bruckie · 2002-03-15 13:36 · Score: 1

FlipDog crawls the web and uses machine-learning technology to extract job listings from companies' web sites. You can then browse through them at their web site, filtering based on location, position category, etc.

The technology was developed by WhizBang Labs, and is quite cool. They basically take a small set of job listings that their crawler finds, have a human classify parts of the web page (job title, location, description, etc.) and then let their software program loose on it. It analyzes the human-filtered web pages, "learns" how to extract relevant data, and then uses that to classify all of the other crawled pages. Of course, this is over-simplified, but that's the basic idea.

(I'm not affiliated with them, other than successfully finding a job there.)

--Bruce

--
There are 10 kinds of people in the world: those who understand binary, and those who don't.

Re:FlipDog uses this for job-hunting by candot · 2002-03-15 13:48 · Score: 1

The problem with this kind of approach is that it doesn't scale well to growing repositories of content where the conceptual span changes over time. For job listings and resumes, this works well because the set of concepts encoded in the content changes very slowly. If I recall correctly, WhizBang Labs recently partnered with LexisNexis to classify legal stuff. It'll probably work there too, as long as they've got a room full of monkeys to keep the training up to date.

But for dynamic environments email, usenet, news/weblog rss feeds, knowledge bases, etc., the WhizBang approach, and just about all approaches that rely on sample-based training or handbuilt taxonomies, just doesn't scale.

But at least you found a job :)
Re:FlipDog uses this for job-hunting by WatertonMan · 2002-03-15 21:45 · Score: 1

This works because resumes have a structure. The structure varies a fair bit and is somewhat vague in implementation, but it is there. Consider the problem akin to finding word breaks in text if you weren't given such things. Obviously a slightly different problem, but the reason we can solve it is because there is structure to what you are looking for. (I bring it up just because that's the problem I'm working on at work)
Concepts and so forth are far more unstructured. Consider the problem of finding all references to Apple executives. Now you can get part way there with complex queries. But somehow you have to take some information (say executive names gleaned from connection to terms about executives near terms related to the company name) and then use that info to define spaces in a text or information in text to get you further information. That is a much more complex problem than simply tagging text with XML or so forth. The final output might possibly be taggable. However generating that final output involves many intermediate steps that require complex views of both terms and space.
You end up requiring a way of querying documents so that you can use complex boolean and ranked queries and complex notions about position and space ranges. Thus you might have a complex boolean query that finds all terms with a certain rank (to do fuzzy match or more complex notions of belonging to a set). Then with those results you create a region and then use those regions for further calculations.
My caveat for all this is that I did work on a project for Lextek International (Lextek.com) that did do all this. So I'm somewhat biased. Probably no one here (given the Open Source nature of things here) would likely be a client. So hopefully I can say all this without anyone thinking I'm just tooting my own horn. Besides - I hardly ever see anything on slashdot I can actually say anything about.

Knowledge Discovery in Databases by sl956 · 2002-03-15 14:07 · Score: 1

This article refers to something called KDD. Knowledge Discovery in Databases [KDD] - as his counterpart Knowledge Discovery in Texts [KDT] - is a whole field of computer sciences. It's been around for more than 20 years now. The ACM even has a Special Interest Group : ACM-SIGKDD.
For an introduction you should read: _Introduction to Machine Learning_ (Kodratoff, Yves; Morgan Kaufmann Pub; 1988) or for a more recent and complete survey: _Advances in Knowledge Discovery and Data Mining_ (Fayaad, Usama; AAAI/MIT Press; 1996).

Re:Knowledge Discovery in Databases by Anonymous Coward · 2002-03-16 04:54 · Score: 0

The parent post makes a good connection. On a side note, the laymen's term (to PHB) is Data Mining. From the little that I know, KDD is a lot more than just Data Mining. In the cases of military or large corporate databases containing terabytes of data, mining isn't the problem. It's mining it efficiently to get usable results.
In the case of the article mentioned in the post, it would appear IBM is going further. Here is an excerpt:
Unstructured emotions Helping computers understand human emotions is the goal of another unstructured data research project. This work is taking natural language understanding -- which allows a computer to recognize natural speech rather than specific commands -- a significant step further. In this case, a computer will review text and determine the sentiment of the individual who created it.
So how can a plain piece of paper express feelings like anger or enthusiasm? "Many of the cues that authors provide to human readers are cues that are available to machines" explains Jhingran. "And if a machine takes those cues into account it doesn't necessarily have to do very deep natural language understanding in order to comprehend what the document is about, what the sentiment of the document is, what the important features of the document are."
If I am reading the article correctly, IBM's goal is much broader than the usual RDBMS data mining. Trying to capture and organize all sorts of data is a tremendous task. Normalizing that data in a mine-friendly format is another challenge by itself. In some ways, this post is related to earlier posts in the week about Text/News summaries. In this particular case, a facial expression is the combination of several facial movements. The computer needs to be able to recognize particular movements/patterns as "happy" or "possibly happy". Once a system has collected all this information, organizing it into meta data or a knowledgebase is pretty challenging. To me, all of this seems to be part of a broader goal of changing computer interfaces to be more "human."
There's a research group in Germany focusing on natural human interface systems. I don't remember the URL at this moment, but it's part of a grand scheme to move away from the keyboard. When we'll get there is anyone's guess.

This was my final year project thesis by Beliskner · 2002-03-15 14:59 · Score: 2, Informative

This was my final year project thesis. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.

NIST does a MASSIVE competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go

It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML ;-)

My favourite bookmarx
PROJect[21 links]
Beginners' Guide[13 links]
Berkeley Linguistics Dept. Course Summaries, general stuffzzzzzzzzzzzzzzCryptic IR Vocabulary defined
Explanations of weird words like hypernym zzzzzzzzzzzzzzHow do we produce and understand speech
How Inverted Files are Created - Univeristy of Berkeley zzzzzzzzzzzzzzNLP Univ. of Indiana, very good basics e.g. word sense d
Simple langauge - useful.... zzzzzzzzzzzzzzWhat is Natural Language Processing, links
What is POS tagging........ zzzzzzzzzzzzzzWord Sense Disambiguation defined
Word Sense Disambiguation in detail, scroll down far zzzzzzzzzzzzzzWord Sense Disambiguator - LOLITA (tested at MUC-7 and SENSEVAL competition as best)
XML for the absolute beginner

HTML, XML stuff + parsers[19 links]
Apache plug-in that uhhh does stuff with XML zzzzzzzzzzzzzzConvert COM to XML
convert XML, HTML to Unix pipeable formats zzzzzzzzzzzzzzconverters to and from HTML
expat XML parser zzzzzzzzzzzzzzHTML Tidy - converts HTML 2 XML + source code!!
Parse DB (RDBMS, whatever) to XML zzzzzzzzzzzzzzPerl-XML Module List
PHP Manual XML parser functions - what the hell are they talking about, PHP Virtual M... zzzzzzzzzzzzzzPublic SGML-XML Software
Pyxie - XML Processor for Python, Perl, etc. zzzzzzzzzzzzzzSGML+XML tools.org
The XML Resource Centre - massive number of links zzzzzzzzzzzzzzW4F wrapper - wrapper converts XML to HTML
XFlat - convert flat file into XML zzzzzzzzzzzzzzXML Parsers and other XML stuff
XML.com - Parsers, etc. zzzzzzzzzzzzzzXML-Data Catalog System - uhhhh looks close
XTAL's general converter - convert anything 2 XML

other Background[8 links]
Is Linux ready for the Enterprise, scalable... zzzzzzzzzzzzzzLinux reliability
Linux Versus Windows NT, Mark(sysinternals bloke) zzzzzzzzzzzzzzPC reliability (pcworld)
SPEC - Standard Performance Evaluation Corp. zzzzzzzzzzzzzzSystems benchmarks
TPC - Transaction Processing Performance Council zzzzzzzzzzzzzzUnix Beats Back NT In EDA Workstation Arena
Proper TREC(-8) QA systems[2 links]

pg. 387 LIMSI-CNRS pretty deep parsing[2 links]
More links....
NLP, IR links - lots to corpii, etc.

pg. 575 U. of Ottawa and NRL (shit system, got 0%)[1 links]
LAKE Lab
pg. 607! University of Sheffield (crap system, but OPEN SOURCE!)[2 links]
GATE - FREE IE app w`source code
LaSIE - ER, coreference, template (cv)

pg. 617 Univ of Surrey (inconclusive matches)[2 links]
System Quirk - Or is this their search system..... Hmmmmmm
Univ of Surrey - pointers (hopefully this is their WILDER search system...)

SMU - Pg. 65[1 links]
Natural Language Processing Laboratory at SMU

Textract[2 links]
Cymfony - Technology
Textract - State of the Art Information Extraction

Xerox uhhhhh maybe[1 links]
Xerox Palo Alto Research Center
(OVERVIEW) 1999 TREC-8 Q&A Track Home Page
NLP bloke, Univ Sussex

Tcl-Tk[4 links] Tcl tutorial
Tcl-Tk Contributed Programs Index
Tcl-Tk Resources, sources
TclXML - manipulating XML using Tcl-Tk
Artificial Natural Language - Is this what I'm trying to parse into...
Comparison of Indexers - Prise vs. Inquery vs. MG, etc.
Eagles - Language Engineering Standards
Language Technology Group - lots of modules!
LDC - Linguistic Data Consortium, lots of corpora
Lexical Resources
Links 2 resources, indexers.....
Lots of IR stuff, University of uhhh
Managing Gigabytes Indexer
Managing Gigabytes Manuals and stuff
Htdig search system
NLP & IR (NLPIR, NIST) Group
OVERVIEW OF MUC-7-MET-2
Perl XML Indexing - XML search engine type thing
Phrasys Language Processing Software Components (money)
QA HCI bullshit
SIGIR - TREC-type thing, resources
SMART indexer system documentation
Text REtrieval Conference (TREC) Home Page
The Natural Language Software Registry
Thunderstone IE and IR products
WordNet - FREE DOWNLOADABLE lexical English database

Page created with URL+, nice utility for working with internet shortcuts

--
A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?

Re:This was my final year project thesis by maxpublic · 2002-03-15 18:28 · Score: 1

Thanks for the links! Can't wait to go through them.

Max

--
My god carries a hammer. Your god died nailed to a tree. Any questions?
Re:This was my final year project thesis by Beliskner · 2002-03-15 23:14 · Score: 1

Yeah great, knock yourself out.

--
A caveman dreams of being us, the incalculable power and riches. We dream of being Q, then what?

voice recognition by Anonymous Coward · 2002-03-15 18:34 · Score: 0

"One financial institution is using IBM ViaVoice® voice recognition software to convert complaint calls--considered unstructured data--into text."

That's interesting, because the IntelliStation I bought a while back came with ViaVoice, and it was excellent at converting unstructured voice data to unstructured text.

Discovery Link by littleRedFriend · 2002-03-15 20:17 · Score: 1

I have worked with Discovery Link, it contains wrappers around heterogenous database sources, like Oracle, flat text files and tries to integrate everything into a single representation.

In life sciences data sources are huge and plentiful. This thing is a monster, it's slow and it needs lots of dedicated people integrating and maintaining it. I'm not even talking about the (IBM) hardware you need for this.

No, I'm a pragmatic guy. I will integrate on the fly whatever I need to know. The idea is nice and all, but it is unworkable at the moment.

--
IANAL, but imagine a beowulf cluster of in Soviet Russia all your belong are base to us welcoming the new SCO overlords.

I would have to HIGHLY disagree... by cr0sh · 2002-03-16 04:18 · Score: 2

While I would say that the vast majority of posts on /. are mere discussion, etc - there is a small but useful subset buried deep within that arguably contains useful information, or at the very least would serve as a starting point for further research.

There are a TON of "Ask Slashdot" articles with very valuable information. I also see in this article many valuable posts (especially that one with the tons of links on machine learning and mining). I also remember some valid ideas bandied about back on the homebrew rollercoaster posting. I also remember seeing information on a posting about mozilla yesterday talking about how to get nice looking fonts in X. Finally, I remember quite some time back (possible up to 2 years ago) an article on AI, in which two individuals, who seemed to know their shit at minimum, and at best were both neuroscientists - arguing about how neurons worked and how the brain "thinks" - most of that went WAAAY over my head, but it was valuable information (or at least it might be a stepping stone).

I see this all the time here on /. - true, there is a ton of SPAM and troll posts, etc to wade through, but that is what we are discussing here - how do you "mine" through the ore to get to that nugget of "gold"?

--
Reason is the Path to God - Anon

Wow! But don't you have to train this system? by ondelette · 2002-03-16 05:31 · Score: 1

How can it know who is Saddam?

Re:Wow! But don't you have to train this system? by foobar104 · 2002-03-16 06:31 · Score: 2

It's not "training," really. In the demo I saw, they gave VideoLogger four or five video frames that had Saddam Hussein in them, and drew little boxes around his face to identify it, then assigned a keyword to it.

Then they ran some news footage through the system that had other pictures of Hussein in it. VideoLogger picked him out and assigned the keyword "Saddam Hussein" to the clip. It got did this on face recognition, not speech or CC recognition, because the video clip was from the Russian TV news!

It was pretty cool, even though it was just a demo.

105 comments