Using the Semantic Web to Enhance Search
RobMcCool writes "At Stanford KSL, we really like the Semantic Web. So we've taken many of our favorite web sites, scraped them, and put together a huge pile of RDF, which we'll let you download. We've used that RDF to create a search application, in the spirit of Google Q & A or Microsofts recently announced MSN Search extensions. Our search can answer simple factual queries like the previously discussed population of Portugal but can also answer some more complex ones. We also have a smart autocomplete system, type "tom hanks birth" slowly to see it in action (best with Firefox). We're looking for people to be a part of this search system by running their own search sites, and by putting their data on the Semantic Web. Come check it out!"
Semantic-driven search engines have awesome potential. However, it does place a lot of demand on the content provider to provide metadata-rich content - or to be able to provide intelligent mining tools to create metadata from existing sites.
This is definitely one to watch...
As soon as you even begin to type it is loading something, it keeps loading with each character, guessing it is the autocomplete "feature" but it loads too slowly for me to tell. Anyone else have any luck?
That's nice and all but who shot first and is there a mash up of both scenes with crazy alien bar music mixed with 20's sinister piano.
You're linking a 600MB file from slashdot?
(oh, and I'm getting 503's for the searches)
http://sp11.stanford.edu:8000/valve?query=it+is+sl ashdotted&ckb=Aggregated+Data
"All great things are simple & expressed in a single word: freedom, justice, honor, duty, mercy, hope." --Churchill
Autocomplete is a useless feature that nobody wants to see when the type "a"...and see it load everything that beings with "a". The user is not interested in items starting with "a". Perhas they're interested in terms beging with "anon" or something, which has many fewer items to load, therefore making the load time much faster and not annoying the user in the process.
Or, even better, never have any autocomplete turned on automatically. Do a VB-like idea, where if you want to see possibilities at a certain point, hit a specific key that will register for the list to pop down.
The Stanford research is interesting, but I'm still trying to make up my mind about the Semantic Web, learning about RDF, and whether I need to bake in ways of handling these kinds of assertions in my web app. The Stanford group writes, "Our hope is that our search application spurs development of the Semantic Web, and leads to sites publishing their data in this format so that we don't have to." It obviously takes more work to encode such information and getting user contributions auto-marked for the semantic web. For a counter viewpoint, take a look at some of Clay Shirky's work -- in particular:
Will the semantic web be supported by future versions of Drupal, phpBB, and other grass-roots content management web apps? Not sure. Since a lot of the content is visitor generated, you would have to build in ways of providing easy markup. Would be interested to hear /. thoughts on the matter.
No, 'works best with Firefox' is just as bad as 'works best with IE'. What would be nice would be to see 'works best with any standards compliant browser'.
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
faster than a thousand speeding gazelles
$ strings FTP.EXE | grep Copyright
@(#) Copyright (c) 1983 The Regents of the University of California.
While the idea of the semantic web has been legitimately lambasted, I think it's a bit far from DOA. While I agree that it's not exactly practical, I think that if you get enough sites displaying their content in such a manner, you'll eventually reach a point at which others will do the same.
I mean, think about it this way - while laziness or inertia might initially win out, once someone's competitors start to explore the idea of the semantic web, interest will start to be shown in it, especially once it becomes either profitable to do so.
concrete5: a cms made for marketing, but strong enough for geeks.
Secondly, scraping doesn't always work and you will surely have low-grade porno and get rick quick schemes/scams littering your sematic data.
But let us suppose that the main benefits of a semantic web are (A) access to reference data [which may be falsified, oops], and (B) access to product availability data [which may be falsified, oops, like mail order companies that pretend they have something in stock but don't and yet still charge your credit card].
It's just won't work.
It will always be a rough approximation of reality.
It's just a way of bad way of caching the results of scraping.
You have a good point here. But in my experiance "works best with firefox" is equivalent to it follows standards.
sigbldr is currently in pre-alpha.
Check out QuASM (Question Answering using Semi-Stuctured Meta-data)...we used similiar processes and approaches to getting "answers" out of a large (40 TB) collection of .gov, .edu, .org web pages. The demo page is no longer available (we completed work on this in 2002) but you can checkout the paper at ACM:
2 8
http://portal.acm.org/citation.cfm?id=544220.5442
It was a really interesting project to be a part of!
Go UMass!
The Semantic Web appears to be a budding server-side solution to the paradigm of information glut online. Social bookmarking appears to be a client-side solution to the paradigm of information glut online.
It is refreshing to see exciting new solutions to the problems we have at present of targeted information retrieval on the internet. I can remember years of stagnation in this field (read: early 90's), and any change from today's google-and-pray searching mentality among the majority of end-users will be welcome.
The Crimson Dragon
...towards the future.
IGB: More fun than eating oatmeal!
"everything but IE"
not entirely, but pretty close -- if you write compliant html/js, it has an excellent chance of working in all of {firefox, opera, safari}
...now I can finally search for "images of women with breasts larger than 36D"!
"works worst on MSIE" that would do
The only things certain in war are Propaganda and Death. You can never be sure which is which though
We've used that RDF to create a search application... Steve Jobs is the only one using any RDF to get applications made. Has he finally gotten a distortion field so big that others think they have them? hmmmm....
Four roommates. No microwave. You do the math.
This looks like it will broaden the volume of useful searches. Right now, there are at least two limits that show up when searching:
1. For really popular subjects, the useful links are swamped in the noise of sites trying to make a buck off of getting you to look at their ads before directing you to somewhere else, that might have the actual content or might not.
2. For many less popular subjects, there is some oddity, like an unusual term being borrowed by some other field, so that it is something most people have never heard of, but people in two or more specialties use it frequently, in very different ways. resulting in strangeness. (i.e. the search engine throws up 23,003 links for a search on "Sator Resartus". 30% are esoteric literary criticism, 20% relate to apoptosis (cell biology), 20% relate to building moral inhibitions into A.I., 10% to Keith Laumer novels, and the rest are probably noise).
(I'm sure there are more than these two limits. Someone else may want to comment on some others).
This is likely to help with the second case, oddities in the data set grouping. (it could sort links into the larger sub-categories, query the user which one(s) seemed most applicable, and maybe even sort out a small set of links that explain, for the previous example, how a high brow literary term got borrowed by the other fields).
It's not as likely it would help with the first case, though, as sites that don't have actual content are actively duplicitous. Something that is actively trying to fool humans is still likely to be very successful at fooling our tools.
Who is John Cabal?
The best part is the W3C looks down on the business rules world and openly snubs them. for a long time, the W3C camp snubbed RETE algorithm, claiming RDF graphs are better. Once people saw how horrible RDF engines perform as rule count and data increases, they rushed to hack together junk and label it RETE. Sorry, but you have to first understand RETE to implement it. A clueless bunch of impractical day dreamers.
Does it have a countermeasure against 'semantic spam'?
The average starting salary offer for Stanford graduate students has raised 30% in the last hour, as Microsoft, Google, and Yahoo each vied tooth and nail for their services.
(starts filling in application)
My little site.
Of course, it's a beta feature at Google Labs. FYI...
Trust me. This is an inactive account. Regardless of what the
That second link goes to http://www.google.com/url?sa=U&start=1&q=http://ww w.w3.org/2001/sw/&e=9707
How is that different to linking to http://www.w3.org/2001/sw/?
Is Slashdot trying to improve someone Google ranking?
(Also, did Slashdot always linkify URLs entered as plaintext? I didn't write any "a href" for those two.)
# cat
Damn, my RAM is full of llamas.
Well, maybe Fx can do whatever they need fastest? Maybe they use pipelining or something?
# cat
Damn, my RAM is full of llamas.
...not only what the Semantic Web is about, but more pragmatically why this is in "Hardware." :)
Mit der Dummheit kämpfen Götter selbst vergebens.
One word: Context.
Currently keywords are used to search for relevant matches and yes, this seems to work ok for lots of things but imagine if you could add context:
Imagine searching for the title of a peice of music that you heard in a certain film.
Currently this could involve some digging but a semantic search engine could very quickly narrow this search. Have a look at this (theres a demo somewhere on the site). It's a research project run by Southampton Uni. It's pretty basic but hopefully you'll get the idea.
Silly rabbit
Although I find the Semantic Web project intriguing, the idea of tagging data to define it is somewhat of a cop-out. The "meaning" of any given page is already there: in the page. Instead of spending so much time tagging pages, how about working on algorithms to derive meaning from the content. Surely those in the field of Computational Linguistics can make a real push at this: "artificial" corpora aren't needed anymore: the web offers more data than you'll ever need.
Shameless promotion: for OS X users, theConcept offers an example of mining key words and phrases, and contextual elements automatically from pages returned by Google queries.
Seems to me there's a useful metadata resource available now due to the way that OSX-Tiger is now allowing metadata to be attached to a file (either as xattribs, or via the Spotlight keyword field). See here.
Does anyone know if web crawlers/gatherers (google, harvest, combine etc') have the ability to access that information and associate it with the file?
I would love an automatic gatherer extracting my metadata from the filesystem and allowing searches on it, in combination with the full text option.
RDFS and OWL are both RDF formats.
Badass Resumes
The Semantic Web is about describing resources, not tagging pages.
Indeed, you might output RDF from your processing of Web pages.
Extracting information from semi-structured text is very different to making logical assertions about resources.
As for RSS, it is limited, but it took off rapidly. RSS v1.0 introduced RDF. That is another step in the right direction.
BTW RDF isn't that complicated. Think of it as a triplet : Subject Verb Objet.
So semantic web is coming, little step at a time.
I don't think the evidence on RDF mailing list supports that opinion. Look at the literature in the bookstores about semantic web. If anything, it is full of confusion and the specification is poorly written compared to the HTML and XML specification.
Triplet does not equal (Subject verb object). What the RDF spec describes is closer to Natural Language parsing concepts. There are many similarities between what the RDF describes as RDF Model graph and dependency grammar techniques http://w3.msi.vxu.se/~nivre/research/sdg.html.
Anyone remotely interested in NLP knows the problem is very hard to solve using dependency grammar techniques. Statistical approaches have been shown to perform much better.
Semantic Web is essentially repeating the same mistakes already made in the AI world with NLP. the W3C seems blind to these facts and that's why semantic web is doomed to fail.
In any case, for Japanese/Chinese/Korean - autocomplete is almost a natural part of using a web search engine, so it's not a "useless feature that nobody wants to see."
Those languages use alphabet-based inputs which are then converted into native text. Why bother converting if you can take the direct alphabetical input and start showing native text autocompletes?
The fact is RDF is really just triplet. Not surprising that it can be represented in N3 (where 3 stands for triplet). Take a look at this example taken from wikipedia :
which can also be represented in XML/RDF like this(the output isn't pretty, see wikipedia link)So take another look at RDF, you'll be surprised.
if it would mean that their sites would rank higher in the search results, I'd say that they all would...
I really don't know what you mean by this nor how can it be good.
P.S.: Everyone as to "confirm you're not a script" even logged users.
Phrasing it that way, that it works best with any standards compliant browser, doesn't get the point across to those who think IE is a standards compliant browser.
Search on TAP has been tested with Firefox on Linux, Windows, and OS/X, and with IE on Windows. I think Andy might have tried it with Safari. I haven't tested it with Opera. With IE, I had to redo how the dynamic HTML was being generated twice to get around its limitations, and it's still ignoring my alignment tags.
Saying it works with standards compliant browsers assumes the reader knows that IE sucks, which isn't always the case.
Besides, I'm ex-Netscape, we're supposed to cheese people off with our browser rah-rahs.
Actually, I saw a preso for this project a while ago. It was pretty neat, showed a lot of promise, and I see that it's been progressing nicely. Stanford KSL actually DOES like the Semantic Web. Sure, they receive DARPA funding, but that's not why the like the Semantic Web. Also, some of the features/scrapers have been built as requested by the gov't, but it's not like the entire project is for the gov't.
I haven't RTFA yet, but I wanted to link to Piggy Bank, which is a Firefox plugin by the Simile MIT group, which seems to be making a large step forward in bringing the usefulness of the sematic web to the users.
It contains a RDF engine, and allows you to install "screen scrapers" for different sites, plus it knows automatically how to read FOAF and some other ontologies that have spread on the net a little bit. When you see the "Semantic web coin" icon in your status bar, you can click on it and it will extract what semantic information it can about the given page. Using javascript or XSL based screen scrapers makes this a bit like developing for Greasemonkey.
As examples, they have screen scrapers for Craig's List Jobs, and they can merge the location (lat/long) information pulled from that along with other info pulled from other sites and display it all on a Googlemap.
It's just getting started, but it seems very cool.
"What thou shalt not, I shalt did!" -Bart Simpson
It would be great if people would say their website works with any compliant browser. But much of the world doesn't care. In my opinion that's because standards doesn't carry connotations with anybody besides web/standards geeks.
Now the cute little firefox plushtoy (have you seen it?) - that's what people will remember. As long as you keep the FF designers on the straight and narrow wwith regards to implementing web standards, then everybody gets what they want.
Course, some will argue that Firefox isn't very complaint, or that it could be more complaint, or whatever predilection that their brain dreams up.
I think we agree that we are really not there and only future will tell.
Apparently, you missed the point that I was not talking about Al Aqsa Martyr Brigades.
The filesystem is the package manager
The most basic aspect for any application to qualify as a "Semantic Web" app (from SW challenge, http://www-agki.tzi.de/swc/swapplication.html) is that the application should use "some formal description of the meaning of the data" ! RDF by itself doesnt give any *meaning* or *semantics* to the data. You need to associate your RDF data to RDFS/OWL for that purpose (TAP doesnt have a published OWL ontology http://tap.stanford.edu/tap/tapkb.html)
Also given that you dont have any 'meaning' to nodes and links in your RDF, I presume your searching again boils to 'keyword' based searching ! People find it cool to term their search as "Semantic Search" but I find it difficult to see any 'semantics' in the current application.
Isn't this basically what HTML is supposed to do kind of?
Maybe the Sematic Web can work someday, maybe not.
:)
However, anyone who thinks this is a utopia in the making should the infamous MetaCrap essay by Cory Doctorow:
Metacrap: Putting the torch to seven straw-men of the meta-utopia.
After you are done reading, go to e-bay and pick yourself up a cheap Plam Pilot.
1. Introduction
2. The problems
2.1 People lie
2.2 People are lazy
2.3 People are stupid
2.4 Mission: Impossible -- know thyself
2.5 Schemas aren't neutral
2.6 Metrics influence results
2.7 There's more than one way to describe something
3. Reliable metadata
-braddock gaskill
'type "tom hanks birth" slowly to see it in action' hmm
i wouldn't like, nobody, especially a machine, telling me what i mean. the possibility of censorship is enormous
Check out the link in the story to the Semantic Web... it's a redirect from Google to w3c.org.
I'm sure this isn't what the submitter meant to do.
For those who haven't been paying attention, Google has recently begun giving redirects from their own site as search results, so, in effect, they get to record every site you end up visiting.
I was on the fence as to whether this was a good or bad thing, but now I see that clearly it's the latter, simply because when the link is copied-and-pasted, as it is here, the person who visits the link won't have any idea that Google is recording that visit.