Using the Semantic Web to Enhance Search
RobMcCool writes "At Stanford KSL, we really like the Semantic Web. So we've taken many of our favorite web sites, scraped them, and put together a huge pile of RDF, which we'll let you download. We've used that RDF to create a search application, in the spirit of Google Q & A or Microsofts recently announced MSN Search extensions. Our search can answer simple factual queries like the previously discussed population of Portugal but can also answer some more complex ones. We also have a smart autocomplete system, type "tom hanks birth" slowly to see it in action (best with Firefox). We're looking for people to be a part of this search system by running their own search sites, and by putting their data on the Semantic Web. Come check it out!"
Semantic-driven search engines have awesome potential. However, it does place a lot of demand on the content provider to provide metadata-rich content - or to be able to provide intelligent mining tools to create metadata from existing sites.
This is definitely one to watch...
As soon as you even begin to type it is loading something, it keeps loading with each character, guessing it is the autocomplete "feature" but it loads too slowly for me to tell. Anyone else have any luck?
That's nice and all but who shot first and is there a mash up of both scenes with crazy alien bar music mixed with 20's sinister piano.
We also have a smart autocomplete system, type "tom hanks birth" slowly to see it in action (best with Firefox).
In the early days one can see lots of "Best with internet explorer". 'It is nice to see best with firefox for a change'
sigbldr is currently in pre-alpha.
You're linking a 600MB file from slashdot?
(oh, and I'm getting 503's for the searches)
http://sp11.stanford.edu:8000/valve?query=it+is+sl ashdotted&ckb=Aggregated+Data
"All great things are simple & expressed in a single word: freedom, justice, honor, duty, mercy, hope." --Churchill
Autocomplete is a useless feature that nobody wants to see when the type "a"...and see it load everything that beings with "a". The user is not interested in items starting with "a". Perhas they're interested in terms beging with "anon" or something, which has many fewer items to load, therefore making the load time much faster and not annoying the user in the process.
Or, even better, never have any autocomplete turned on automatically. Do a VB-like idea, where if you want to see possibilities at a certain point, hit a specific key that will register for the list to pop down.
The Stanford research is interesting, but I'm still trying to make up my mind about the Semantic Web, learning about RDF, and whether I need to bake in ways of handling these kinds of assertions in my web app. The Stanford group writes, "Our hope is that our search application spurs development of the Semantic Web, and leads to sites publishing their data in this format so that we don't have to." It obviously takes more work to encode such information and getting user contributions auto-marked for the semantic web. For a counter viewpoint, take a look at some of Clay Shirky's work -- in particular:
Will the semantic web be supported by future versions of Drupal, phpBB, and other grass-roots content management web apps? Not sure. Since a lot of the content is visitor generated, you would have to build in ways of providing easy markup. Would be interested to hear /. thoughts on the matter.
I'm tech-illiterate but interested none-the-less.
Don't post ill-prepared, university hosted laboratory website on slashdot.
faster than a thousand speeding gazelles
$ strings FTP.EXE | grep Copyright
@(#) Copyright (c) 1983 The Regents of the University of California.
While the idea of the semantic web has been legitimately lambasted, I think it's a bit far from DOA. While I agree that it's not exactly practical, I think that if you get enough sites displaying their content in such a manner, you'll eventually reach a point at which others will do the same.
I mean, think about it this way - while laziness or inertia might initially win out, once someone's competitors start to explore the idea of the semantic web, interest will start to be shown in it, especially once it becomes either profitable to do so.
concrete5: a cms made for marketing, but strong enough for geeks.
Secondly, scraping doesn't always work and you will surely have low-grade porno and get rick quick schemes/scams littering your sematic data.
But let us suppose that the main benefits of a semantic web are (A) access to reference data [which may be falsified, oops], and (B) access to product availability data [which may be falsified, oops, like mail order companies that pretend they have something in stock but don't and yet still charge your credit card].
It's just won't work.
It will always be a rough approximation of reality.
It's just a way of bad way of caching the results of scraping.
Check out QuASM (Question Answering using Semi-Stuctured Meta-data)...we used similiar processes and approaches to getting "answers" out of a large (40 TB) collection of .gov, .edu, .org web pages. The demo page is no longer available (we completed work on this in 2002) but you can checkout the paper at ACM:
2 8
http://portal.acm.org/citation.cfm?id=544220.5442
It was a really interesting project to be a part of!
Go UMass!
The Semantic Web appears to be a budding server-side solution to the paradigm of information glut online. Social bookmarking appears to be a client-side solution to the paradigm of information glut online.
It is refreshing to see exciting new solutions to the problems we have at present of targeted information retrieval on the internet. I can remember years of stagnation in this field (read: early 90's), and any change from today's google-and-pray searching mentality among the majority of end-users will be welcome.
The Crimson Dragon
...towards the future.
IGB: More fun than eating oatmeal!
What does this give you over HTML + Search Engine, except it's 10x harder to code (and crumbles under Slashdot effect). Using Google with the first keywords that popped into my head:
Population of Portugal: portugal population [first match]
Buildings Taller than Sears Tower: tallest buildings world [first match]
Roller coasters faster than 80mph: fastest roller coasters [fourth match contains link to table of top-10 fastest roller coasters]
Countries with population greater than Indonesia: coutries population indonesia [first match]
Color me unimpressed with the Semantic Web...
"everything but IE"
not entirely, but pretty close -- if you write compliant html/js, it has an excellent chance of working in all of {firefox, opera, safari}
...now I can finally search for "images of women with breasts larger than 36D"!
God I hope they don't include an image search...that could be worse than goatse...
It is interesting to note that they are only using RDF. What about RDF Schema, DAML, or OWL (the successor to all of these)? It has been my experience with all of these that they are too complicated for the everyday person to use and thus only a select few will be able to perform the markup. Only some excellent ubiquitous markup tools could alleviate this. My guess is that these will end up being academic solutions at best. Not to knock academia, but sometimes they overcomplicate things in the spirit of completeness or correctness and thus make it too impractical for everyday use.
One should also point out, "At Stanford KSL, we are paid by DARPA to like the Semantic Web.
We've used that RDF to create a search application... Steve Jobs is the only one using any RDF to get applications made. Has he finally gotten a distortion field so big that others think they have them? hmmmm....
Four roommates. No microwave. You do the math.
This looks like it will broaden the volume of useful searches. Right now, there are at least two limits that show up when searching:
1. For really popular subjects, the useful links are swamped in the noise of sites trying to make a buck off of getting you to look at their ads before directing you to somewhere else, that might have the actual content or might not.
2. For many less popular subjects, there is some oddity, like an unusual term being borrowed by some other field, so that it is something most people have never heard of, but people in two or more specialties use it frequently, in very different ways. resulting in strangeness. (i.e. the search engine throws up 23,003 links for a search on "Sator Resartus". 30% are esoteric literary criticism, 20% relate to apoptosis (cell biology), 20% relate to building moral inhibitions into A.I., 10% to Keith Laumer novels, and the rest are probably noise).
(I'm sure there are more than these two limits. Someone else may want to comment on some others).
This is likely to help with the second case, oddities in the data set grouping. (it could sort links into the larger sub-categories, query the user which one(s) seemed most applicable, and maybe even sort out a small set of links that explain, for the previous example, how a high brow literary term got borrowed by the other fields).
It's not as likely it would help with the first case, though, as sites that don't have actual content are actively duplicitous. Something that is actively trying to fool humans is still likely to be very successful at fooling our tools.
Who is John Cabal?
The best part is the W3C looks down on the business rules world and openly snubs them. for a long time, the W3C camp snubbed RETE algorithm, claiming RDF graphs are better. Once people saw how horrible RDF engines perform as rule count and data increases, they rushed to hack together junk and label it RETE. Sorry, but you have to first understand RETE to implement it. A clueless bunch of impractical day dreamers.
Does it have a countermeasure against 'semantic spam'?
The average starting salary offer for Stanford graduate students has raised 30% in the last hour, as Microsoft, Google, and Yahoo each vied tooth and nail for their services.
(starts filling in application)
My little site.
Of course, it's a beta feature at Google Labs. FYI...
Trust me. This is an inactive account. Regardless of what the
Piggy Bank is an eleet RDF creating, greasemonkey web scraping, meta plugin.
http://simile.mit.edu/piggy-bank/
props to waxy 4 the link
http://waxy.org/links
check out Sir Tim Berners Lee the Knight that goes nee rap on the semantic future of the web at the Royal Society London - total futurosity.
http://www.royalsoc.ac.uk/page.asp?id=3110
Do you think theres a porn site somewhere using sign ups to process secretly referred slashdot catchems?
That second link goes to http://www.google.com/url?sa=U&start=1&q=http://ww w.w3.org/2001/sw/&e=9707
How is that different to linking to http://www.w3.org/2001/sw/?
Is Slashdot trying to improve someone Google ranking?
(Also, did Slashdot always linkify URLs entered as plaintext? I didn't write any "a href" for those two.)
# cat
Damn, my RAM is full of llamas.
...not only what the Semantic Web is about, but more pragmatically why this is in "Hardware." :)
Mit der Dummheit kämpfen Götter selbst vergebens.
Although I find the Semantic Web project intriguing, the idea of tagging data to define it is somewhat of a cop-out. The "meaning" of any given page is already there: in the page. Instead of spending so much time tagging pages, how about working on algorithms to derive meaning from the content. Surely those in the field of Computational Linguistics can make a real push at this: "artificial" corpora aren't needed anymore: the web offers more data than you'll ever need.
Shameless promotion: for OS X users, theConcept offers an example of mining key words and phrases, and contextual elements automatically from pages returned by Google queries.
foaf and mindsap do not represent mainstream usage. If you consider that Semantic Web being reality today, then sure. But that doesn't mean it's useful for those not in the semantic web research world. It's fringe technology looking for a problem.
Seems to me there's a useful metadata resource available now due to the way that OSX-Tiger is now allowing metadata to be attached to a file (either as xattribs, or via the Spotlight keyword field). See here.
Does anyone know if web crawlers/gatherers (google, harvest, combine etc') have the ability to access that information and associate it with the file?
I would love an automatic gatherer extracting my metadata from the filesystem and allowing searches on it, in combination with the full text option.
"Note to self. Dreaming about the world tagging all their data isn't going to happen. It takes too much damn time."
Note to self. Dreaming about the world writing their web sites in notepad isn't going to happen.
"Semantic driven search using google's technique works."
It's spotty.
"Producing a RDF graph is crap."
I guess you do everything by hand.
"Nothing to watch here."
Nothing to understand here.
The semantic web is a backawards step. We should be working on being able to use even more unstructured data on the web.
And the "confirm you are not a script" on the Slashdot anonymous submissions page is an underhanded way of coercing "membership," viz. tracking and sales opportunities.
The Semantic Web is about describing resources, not tagging pages.
Indeed, you might output RDF from your processing of Web pages.
Extracting information from semi-structured text is very different to making logical assertions about resources.
In any case, for Japanese/Chinese/Korean - autocomplete is almost a natural part of using a web search engine, so it's not a "useless feature that nobody wants to see."
Those languages use alphabet-based inputs which are then converted into native text. Why bother converting if you can take the direct alphabetical input and start showing native text autocompletes?
In the examples page, PLO and Al Fatah are listed under "Terror Organizations". This is a horrible misrepresentation.
The PLO is the organization representing the Palestinian people that eventually evolved into the Palestinian Authority. It had observer status in the UN General Assembly and even special permission to participate on Security Council debates (sans voting rights). Al Fatah is a political party which was involved in guerilla activities in the 70s, but that has, since the Oslo Accords, accepted the statehood of Israel.
The filesystem is the package manager
my explanation was poor. according to the RDF spec, rdf consists of two parts: model and graph. The model represents an object, like car, cat, boat, house, etc. The graph represents the relationship between the objects, like honda->car->vehicle. The graph is suppose to allow the system to "infer" facts which are not explicitly stated. In other words, a RDF engine would be able to infer a Honda is a type of vehicle.
If you look at what the spec describes in terms of building the Graph, it is very similar to dependency grammar techniques. After all, both attempt to interpret data.
I read the spec plenty of times, but it is still a horrible specification. RDF engines (reasoners as RDF people like to call it) are attempting to do the same thing AI researchers have been working on for 3 decades. The only differnce is the W3C RDF people have a huge chip on their shoulders and refuse to see reality is dirty and messy. Trying to infer anything from dirty data is an unbounded problem. there's no getting around that.
I haven't RTFA yet, but I wanted to link to Piggy Bank, which is a Firefox plugin by the Simile MIT group, which seems to be making a large step forward in bringing the usefulness of the sematic web to the users.
It contains a RDF engine, and allows you to install "screen scrapers" for different sites, plus it knows automatically how to read FOAF and some other ontologies that have spread on the net a little bit. When you see the "Semantic web coin" icon in your status bar, you can click on it and it will extract what semantic information it can about the given page. Using javascript or XSL based screen scrapers makes this a bit like developing for Greasemonkey.
As examples, they have screen scrapers for Craig's List Jobs, and they can merge the location (lat/long) information pulled from that along with other info pulled from other sites and display it all on a Googlemap.
It's just getting started, but it seems very cool.
"What thou shalt not, I shalt did!" -Bart Simpson
I think we agree that we are really not there and only future will tell.
http://www.co-ode.org/resources/tutorials/ProtegeO WLTutorial.pdf
http://protege.stanford.edu/plugins/owl/
BTW:
"Slow Down Cowboy!
Slashdot requires you to wait 2 minutes between each successful posting of a comment to allow everyone a fair chance at posting a comment.
It's been 15 minutes since you last successfully posted a comment
Chances are, you're behind a firewall or proxy, or clicked the Back button to accidentally reuse a form. Please try again. If the problem persists, and all other options have been tried, contact the site administrator."
I don't agree. Data (aka assertion) that is not explicit fact, is an interpretation of explicitly stated facts. NLP tries to label each part of a sentence into nouns, verbs, adjective, pronouns, adverbs and so on. It's goal is to form a graph that describes what the sentence means. The NLP term for the relationship is valence. Here is a random paper from google on the topic http://nats-www.informatik.uni-hamburg.de/~ingo/pa pers/tal2000.pdf.gz
RDF Graph (not RDF schema) attempts to describe data in a way that computers don't have to "figure out" (ie parse) the subject predicate object. This is achieved through writing rules about a given topic. Dependency grammar "can be considered" techniques and frameworks for representing and generating knowledge. NLP also uses rules to process/parse sentences. In the case of NLP, it's grammatical and contextual rules.
The two topics are much closer and have much more common than superficial similarities. One difference is that RDF does not address the issue of building knowledge through automated process. Humans have to do it. People are lazy. Ignoring the problem doesn't make the problem go away.
I agree there's a lot more is needed to achieve the goal of semantic web, but the question is, will W3C listen to others and change? So far it hasn't.
The most basic aspect for any application to qualify as a "Semantic Web" app (from SW challenge, http://www-agki.tzi.de/swc/swapplication.html) is that the application should use "some formal description of the meaning of the data" ! RDF by itself doesnt give any *meaning* or *semantics* to the data. You need to associate your RDF data to RDFS/OWL for that purpose (TAP doesnt have a published OWL ontology http://tap.stanford.edu/tap/tapkb.html)
Also given that you dont have any 'meaning' to nodes and links in your RDF, I presume your searching again boils to 'keyword' based searching ! People find it cool to term their search as "Semantic Search" but I find it difficult to see any 'semantics' in the current application.
Isn't this basically what HTML is supposed to do kind of?
Maybe the Sematic Web can work someday, maybe not.
:)
However, anyone who thinks this is a utopia in the making should the infamous MetaCrap essay by Cory Doctorow:
Metacrap: Putting the torch to seven straw-men of the meta-utopia.
After you are done reading, go to e-bay and pick yourself up a cheap Plam Pilot.
1. Introduction
2. The problems
2.1 People lie
2.2 People are lazy
2.3 People are stupid
2.4 Mission: Impossible -- know thyself
2.5 Schemas aren't neutral
2.6 Metrics influence results
2.7 There's more than one way to describe something
3. Reliable metadata
-braddock gaskill
'type "tom hanks birth" slowly to see it in action' hmm
have you ever tried to type chinese in windows? it is a pain in the arse compared to typing english. chinese writing system is very different than english and autocomplete is much harder. In english you have A-Z. In chinese the number of radicals ranges in the hundreds for the most common. Then when you get into the essoteric radicals, no auto complete is going to be able to handle it today.
"Should I reject using the Encyclopedia Brittanica just because, someone somewhere will post false information."
Your point would be even stronger if you had used Wikipedia instead of Britannica. There's a great deal of fannism surrounding it, even though trust issues involve both, and some of the same "It's OK" arguments likewise apply to both.
i wouldn't like, nobody, especially a machine, telling me what i mean. the possibility of censorship is enormous
The two topics are much closer and have much more common than superficial similarities.
Not only close, but identical from a formal pow.
Metadata is data, the 'meta' part vanishes the moment you try to describe-it.
Plain old philosphical knowledge, always forgotten.
Check out the link in the story to the Semantic Web... it's a redirect from Google to w3c.org.
I'm sure this isn't what the submitter meant to do.
For those who haven't been paying attention, Google has recently begun giving redirects from their own site as search results, so, in effect, they get to record every site you end up visiting.
I was on the fence as to whether this was a good or bad thing, but now I see that clearly it's the latter, simply because when the link is copied-and-pasted, as it is here, the person who visits the link won't have any idea that Google is recording that visit.