Slashdot Mirror


Semantic Search Points To Better Relevancy

ReadWriteWeb writes in to tell us about an article by Dr. Riza C. Berkan, founder and CEO of hakia.com, describing the promise of and potential for semantic search. This approach to providing more on-target search results contrasts with the dream of the semantic Web. Semantic search doesn't require all the Web page authors in the world to begin adding metadata; but it's not a sure thing that the researchers now developing the idea will get it right.

25 of 90 comments (clear)

  1. So what does he offer? by javilon · · Score: 4, Interesting

    From TFA:

    "There are so many ways of doing it improperly, and only one way of doing it right."

    But he doesn't say what the right way is, or how it could be, or even if he thinks his company is on the right track. There is no information at all.

    --


    When his defense asked, "Which computer has Jon Johansen trespassed upon?" the answer was: "His own."
  2. The semantic web is still a Good Thing by Max+Romantschuk · · Score: 3, Interesting

    The semantic web is about more than search. Rich semantics will enable applications of a completely different nature than today. Aggregating and mashing up data could be taken to a whole new level. Just because someone comes up with better indexing we shouldn't give up on the semantic web.

    Just my 2 cents, anyway.

    --
    .: Max Romantschuk :: http://max.romantschuk.fi/
    1. Re:The semantic web is still a Good Thing by kahei · · Score: 5, Insightful


      Honestly, if some Marxist state from the 60s produced propaganda like that, everyone would laugh:

      "The People's Revolution is about more than nationalism! New communal agricultural techniques will enable a standard of living of a completely different nature than today! Manufacturing and distributing goods for the Workers could be taken to a whole new level!"

      It's the same fallacy: "If only everyone spontaneously got together and did what I think they should, all problems would go away!"

      Yet just because the fictional utopia in question is the 'Semantic Web' rather than the 'Workers Paradise', everybody takes it really seriously. And nobody mocks it at all. Nope, nobody ever laughs at the Semantic Web.

      Ok, ok, I'm just being mean, I should go and do something useful.

      --
      Whence? Hence. Whither? Thither.
    2. Re:The semantic web is still a Good Thing by frank_adrian314159 · · Score: 2, Insightful
      Ok, ok, I'm just being mean, I should go and do something useful.

      No. Actually, you're being accurate. Unless folks can solve the multiple taxonomy problem (and, no, deciding on a common taxonomy and taxonomy translation approaches have not worked in the past) and the metadata cheating problem, the "Semantic Web" is BS promulgated by someone who probably doesn't know the history of epistemology, taxology, or why hard AI problems really are hard, even if he has been knighted. And the people who think that this is worthwhile are the same techno-utopians who probably don't know much about the problem either. When you have a robot that can actually return a Dewey Decimal System classification to four digits to the right of the decimal for a set of randomly selected web pages (and, no, just returning the word "pr0n" doesn't count, although it would probably have the best score of most algorithms you can think of) then you can come and talk about having a start. Otherwise, it's all just BS.

      --
      That is all.
  3. Man promotes own company by DrSkwid · · Score: 2, Insightful

    Hear the outlandish claims ladies and gentlemen, of how the brave doctor wants us just to have better searches.

    --
    There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
  4. Semantics don't work on a global scale by FredDC · · Score: 3, Insightful

    IMHO semantics don't work on a global scale, it does work if you only check trusted sources. If everyone can create data and place semantics on it, it becomes useless. You can't trust everyone to place correct semantics on it, either they don't have the knowledge to place correct semantics on data, or they maliciously place the wrong semantics on it.

    --
    09 f9 11 02 9d 74 e3 5b d8 41 56 c5 63
    1. Re:Semantics don't work on a global scale by epine · · Score: 2, Insightful


      This society goes to great lengths to cultivate learned helplessness. Attitudes toward brands are a good example. Many people wish to simplify their decision making by forming an emotional bond with their favorite brands, rather than exercising rational judgement, which involves wading into the frustrations involved in finding information you can trust about the products you wish to purchase.

      I no time for Sanger, either, who is busy trying to brand knowledge with the warm glow of credentialed expertise.

      If the purpose of semantic search is to return search results that lull the sleepy sheep into the warm glow of suspended judgement, it will be a long time coming, and the road will be paved with broken promises.

      The reason Google already works so well is that many of us actually *want* to enter into the larger context of the search terms we query. The various manifestations of my keywords are of interest to me. Once I've dialed into the subcontext I'm most interested in, it's usually an easy matter to refine the search. In rare instances, such as the metabolic cofactor SAMe, it proves almost impossible. This is a highly specialized meaning, masked by an everyday word.

      It's also annoying that Google won't accept roots, or form clusters of common spellings / misspellings. When I was working with the HC12 microcontroller, I wanted to search all the forms as a set, which included variant forms such as MC68HC12 and HCS12 and 68HC12 as well as forests of related part numbers, all of which specified an HC12 variant. Sometimes I wish to search "color/colour" as equivalent lexemes.

      Google already works spectactularly well for any purpose except selling learned helplessness. Many weaknesses exist, and as these weaknesses become more apparent, the worst of the problems ought to be addressed by pragmatic refinements (of the existing search algorithms). Google already has the "google suggests" mechanism to propose more specialized search in the cases where they develop the capacity to support this.

      The other problem with the semantic grail is that even after undesired contexts are filtered out, you still don't have a unique answer. Now the question becomes "whose answer?". There are good business models to be had in controlling the answer to that question, and you might still get away with calling it "search", but it would totally suck as an instrument for harvesting knowledge.

      I did a lot of work in the nineties in the area of statistical NLP, and I spent a lot of time wrestling with the boundary between what statistical methods could ultimately accomplish, and what the allure of semantic methods really amounted to. Often the "long tail" itself is a fiction of surface forms. For example, "fuschia deck chair" might be a statistical singleton on surface form, but it colour words are clustered it becomes [colour-word] deck chair, which probably isn't a statistical singleton. This level of statistical analysis is rarely employed, because the payoffs are marginal, which is yet again a testament to how well the basic (Google) algorithm already works.

      One of the reason statistical methods have proven so successful is that these methods nicely complement what the brain already does well (unless disabled by brand preferences). Humans don't have the patience to scan millions of documents to establish statistical patterns, but we do excel at filtering a nugget of usefullness out of a small pool of crap. This is the biological reality of Sturgeon's law. Any organism that can't identify the one nugget out of ten worth pursuing has relinquished self-destiny.

      If Google attacks the clustering and disambuiguation problems, slowly but surely one thing will lead to another, and a semantic-like system will finally emerge, but one quite different than one might discover having set out to achieve the semantic grail by direct means. As Douglas Adams put it "I may not have gone where I intended to go, but I think I have ended up where I intended to be."

  5. Re:metadata worst idea ever by $RANDOMLUSER · · Score: 4, Informative

    You're confusing the word "metadata" with the HTML tag . In this case (the semantic web) metadata would be in RDF. More clues here. What TFA is proposing is to semantically process and index websites content, rather than have the websites (or a third party) tag the content with RDF. What both of them are lacking is any kind of a universal ontology (or even standardized specialty ontologies).

    --
    No folly is more costly than the folly of intolerant idealism. - Winston Churchill
  6. That's good by suv4x4 · · Score: 4, Interesting

    While this is not strictly PR piece for Hakia.com, it mentions the site (and some others) and I just to try it. I gotta be honest, it does produce more interesting results than Google in some cases (i.e. more accurate). While in others it produces worse results. But the company's young.

    Overall, this is the direction we should be taking. The semantic web is indeed just that: a shiny dream.

    Today, we're talking about anyone having the ability to create a web page, using pre-made online page/blog tools, or easy to use WYSIWYG desktop apps.

    You can't ask of people who can't make the difference between typing a query in the search engine and typing an URL in the address bar, to add proper meta information on his blog. Not to mention the abuse potential.

    I can already hear someone saying "If you don't know the XHTML/CSS specs by heart you shouldn't be making pages" but that's just arrogant. Technology should destroy barriers, not create them, the technology which implements this idea better, will succeed. Look at Google: it will parse even the most horrendous code and extract proper information for it. This is why they are number 1.

    BTW, Google already extracts semantic information from both the site and query, but this quite primitive compared to the potential mentioned in the article. Google looks for term context, meaning context, synonyms, related words etc. I hope Hakia.com and businesses like them take this idea further, so there's finally some innovation happening in search (something that only enjoyed gradual and miniscule improvements for the last 9 years, since Google introduced pagerank).

    1. Re:That's good by suv4x4 · · Score: 2, Interesting

      Adhering to standards and accessibility may give you the edge in the business while letting a hundred monkeys bang away in Frontpage '97 won't. It's probably more arrogant to say you don't need that edge.

      There are two things here: actually there isn't a "business" behind every page. This is like saying we should all have proper automated phone answer systems on our phones, as this gives us edge in our business: but phones are used for more than business, and I certainly don't need all those fancy things on my home phone.

      The web is large enough, there's place for all kinds of sites: amateur sites with poor code and interesting content, web dev blogs with ultra accurate code and amusingly somewhat boring content, huge site portals with terirble code but a strong CMS system to make up for it, huge site portals with great code, bad CMS system and hundreds of monkeys who do manual edits on the pages every day.

      Standards, as defined by W3C are just a way to make multiple agents compatible (search engines, clients, servers). If they are compatible, you've achieved the goal of a standard. A standard isn't the goal itself, it's the means. And sometimes you need to be more flexible about the means.

      Now, I'm authoring pages strictly comliant with the standards, it's more of a geek-ish inner requirement since I've a good knowledge on how internally the browsers handle all this (and by the time the browsers change drastically, the site would be redesigned few times already, or dead). I don't however care about inserting empty alt tags on images without meaning, or avoiding "target" since it was supposedly bad about something. I need to use a feature, it works, it's not going away: I use it. It's my means. I achieved my goal, on time, and with great results.

    2. Re:That's good by fermion · · Score: 2, Interesting
      The points are valid within a certain context, but we have to define what that context is. First, who is going to pay for the service. Second, who is going to use the service. Third how is the service actually going to be built. Fourth how is the profit going to be derived.

      In the Google model, advertising pays the bill, the masses use it, the service is built on sound statistical principles, and profit is driven by focusig on making the process relatively simple and cheap. The web is crawled, links are counted, a bit of intellegence is added, and results are displayed.

      Overall this method has proven useful. The problems are mainly that the pagerank has proven easy to hack. I do not believe the problem is that users look for Madonna and get the pop star by mistake. Since google is meant to be used by the masses, as it is the cheap mass searches that generate revenue, the popularity ranking is not an issue. Make no mistake, google results are ofttimes crap, but they are still usable for common searches.

      The semantic web, as discussed, seems to be something different. It in fact seems to be the standard revolt of a linguist against the mathematician. The linguist say translation must be in meaning. The statistician says I can do it without understanding anything. They are both correct, but Google has shown the later can provide reasonable and cheap results. Likewise, this guy tries to compare the long tail to the iceberg. Of course, the long tail are the minority underserved, who are underserved because the lack the means or desire to pay for the service. The hidden iceberg is the majority that sinks large ships. Not someone who understands statistics, or, for that matter, is likely to make a generous profit.

      What I think this guy is talking about is the specialized services that people might pay for directly, not a booming industry, as the nation provides librarians for free. A program that will take a search, and we assume that the user is competent enough to form the search using valid english, as there is no librarian to help construct the search, and know enough about the language, about the context, and about the subject matter, to return the exactly proper few results. It would then have to do this cheaply enough to drive a profit. This would in fact be a grand piece of software, but would it compete with Google or MSN or any mass search engine?

      I am disappointed as even simple semantic search engines could get rid of the clutter we have on google, and if someone were willing to invest, even MS for that matter, the link farms could be a thing of the past. A lot of this, I believe, is due to the battle between the mathematicians and the linguants.

      --
      "She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
  7. in the defense of meta-data by spectrokid · · Score: 4, Interesting

    Yes, people will abuse it in any way they can. Mostly to try and get higher up in the search engines. But this does not mean it is by definition useless. It is useless to do ranking, but once you (the search engine) have decided to list a site, you could use the metadata for semantic web-stuff. How about allowing for a physical address, phone number, opening hours (for brick & mortar )... This would e.g. allow for a "copy address to contacts" button. Make an easy (web based) program to generate the HTML so mom&pop shops can include it tin their website, and refrain from using it for ranking purposes, and you should be ok.

    --

    10 ?"Hello World" life was simple then

  8. Tiresome and wrong by dread · · Score: 5, Insightful

    There is a huge problem with the argument made in the article - one which is plainly visible in the "Palladium" example. The meaning of "Palladium" is related to an internal state (i.e. my internal state). What am *I* thinking about when I write "Palladium"? Am I referring to the element Palladium? Am I referring to the DRM technologies from Microsoft? This is dependent on three things primarily:
    1: my "role". What am I? Am I a journalist at a newspaper? Am I a private citizen with a large collection of illgotten mp3s?
    2: my "context". Am I discussing something? Is this a query related to a conversation I am having with someone else? God only knows how many Google queries actually stem from ongoing IM-conversations where a, to the reader, previously unknown term/subject is brought forward.
    3: my "personality". What am I primarily interested in? What is my preferred format of consumtion? If I am 7 years old - what the hell does "Palladium" really mean?

    To me it is obvious that the idea of a semantic web, the promise if you will, can never be delivered upon without a framework that is usercentric rather than centralistic in the current Googlefashion. Desktop search is interesting to some extent as a way of tying our personal space with the dataspace outside of our local control but that is still a very limited tool. Since much of what is very simplistically covered in 1 and 2 above is related to interpersonal communication it becomes obvious that what is necessary is data structures that learn from ongoing conversation, eg the intersection of Person A and Person B is described in a way that can give us guidance as to what the appropriate (or most likely) interpretation of the term used is.

    There is much that can be said about this but suffice to say that the semantic web people are ignoring the real needs that have to be met in order to create something that is truly semantic and carries a knowledge of what the end user actually intends. Because if we don't understand the intent, we don't really understand anything.

    --
    I've had a wonderful time, but this wasn't it -- Groucho Marx
    1. Re:Tiresome and wrong by illaqueate · · Score: 2, Interesting

      Yeah, pretty much. I set out to make a data assistant program in high school (c 1996-1999) and was thinking about how to get a correspondence between what I was thinking and how data would be retrieved and figured it would have to be so generic to be worthless. And then I read Hilary Putnam's Representation and Reality and felt sick about the entire thing. But now that I think back on it I did have a lot of fun testing out different kinds of data retrieval on structured and unstructured data (and thinking up weird semantic hypertext languages).

      http://slashdot.org/comments.pl?sid=142985&cid=119 86906 -- lol

    2. Re:Tiresome and wrong by PPH · · Score: 2, Insightful
      Well, humans don't understand intent. They have to ask. Call up a reference librarian and ask for information on "Palladium". Odds are s/he will reply with one or more questions. I don't expect a semantic search engine to do any better.

      There are two parts to this problem. The UI, or how a user will interact with the system to describe the context within which a search is to be performed, and the web crawler, which must extract semantics from web pages based on either metadata, linking algorithms (ala Google), natural language processing, or some combination of these.

      Within restricted knowledge domains, some of these techniques work quite well already. Document management systems can enforce metadata and linking conventions and the knowledge domain is already understood to some extent. Transferring this to the WWW might be simpler than many people imagine. Just crawl the pages with the same techniques and index those where the metadata/language/linking is consistent. Ignore the rest as garbage. Odds are that what is most easily parsed and properly tagged will be the most useful to the end user. Owners of pages who wish them to be found will clean them up so as to make them appear in searches.

      --
      Have gnu, will travel.
    3. Re:Tiresome and wrong by dread · · Score: 2, Insightful

      Humans certainly understand intent. They will - as you point out - ask if they don't know the intent. You always know what you intend. If someone you know asks you a question, chances are you will have enough commonality, so to speak, to intuitively grasp the intent (or context). Your example with the librarian is interesting but pointless since you are talking about another centralised knowledge solution whereas I am talking about a decentralised model that starts with the user and - if you will - a "context model".

      --
      I've had a wonderful time, but this wasn't it -- Groucho Marx
  9. Re:Would someone please cut and paste here... by regular_gonzalez · · Score: 4, Interesting

    MovieLens is perhaps kind of similar-but-different. You go there and rate movies. Based on similarities to how other people rated movies, it then suggests movies for you and your likely rating of them. It's pretty neat actually -- my wife and I both have accounts there, and you can cross-reference with other people. So now when we go to the video store, instead of each of us picking one movie we like and potentially forcing the other person to suffer through it, we can find a movie that (in theory) we will both like. Seems fairly accurate so far.

    --
    Due to circumstances beyond my control, I am master of my fate and captain of my soul.
  10. Semantic Search Points To Better Relevancy by robably · · Score: 4, Funny

    Quick! Tag this story as "Goldfish" and "Hairdressing".

  11. Re:metadata worst idea ever by monk.e.boy · · Score: 2, Interesting

    Semantic Web = the promise that never quite delivers

    Such a good idea in theory, but where does trust come from? Who can we trust to mark anything?

    And by the time any of this is solved google will have evolved so it can understand plain text better than mark up. How do you markup something as ambiguous? Unsure? Rumor? It's pretty easy in plain English:

    "I hear Joe is living in Cornwall". There you go, easy to use and no angle brackets.

    monk.e.boy

  12. Availablilty? by Anonymous Coward · · Score: 2, Funny

    Is 'Semantic Web' already included in Web 2.0? Or will that be the 3.0 version?

    BWAAAHAHAHAAAAHAAA

  13. Re:metadata worst idea ever by maxume · · Score: 2, Informative

    How do you propose enforcing any sort of universal or specialized ontology?

    If I have a turd, and I add metadata to it that says its prure gold, it's still a turd; you have to trust me to trust my metadata. That's what the op is talking about, not the container.

    --
    Nerd rage is the funniest rage.
  14. Re:metadata worst idea ever by danbri · · Score: 2, Interesting

    "Susan saw the dog in the window. She pressed her nose against it. She wanted to buy it."

    The SW project exists *because* machines are too dumb to read English. Or Chinese. And will probably stay that way for the forseeable future.

    So W3C's RDF is positioned half-way between the world of dumb computers and smart people. It structures data in terms of classes and properties, and allows different groups to define sets of class and property names that can be freely mixed together without the need for heavyweight standardisation. And it gives us an SQL-ish querying framework, SPARQL, for asking questions of this data, and getting back tables of results. Despite the myths, RDF doesn't oblige people to put metadata "inside ever Web page". It just defines a common data model that information from various sources and formats can be mapped to, so that what they say can be processed with less regard for fiddly detail of file formats and encodings. And RDF certainly doesn't require that you believe everything you read: the SPARQL spec, unlike SQL, provides built-in machinery for querying properties of the data source, inline in your query, so you can filter the data down to the bits you decide to trust in some specific app.

  15. Re:Missing the target by msporny · · Score: 2, Interesting

    If you are interested in real solution to semantic web markup that works (and is being used) right now, you might want to check out the Microformats website. There is a growing following that is working on getting the semantic web working properly. The Firefox and Songbird guys are looking at using Microformats to make browsing the web a much richer experience - NOW, not 10 years from now.

    There are currently Microformats for marking up people, places, events, geographic locations, music, and many other widely used data items on the web. For more information on what Microformats are, check out the info page on Microformats.

    -- manu
    --
    Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
    Founder/CEO - Digital Bazaar, Inc.
  16. Semantics derivable from web corpus statistics by presidenteloco · · Score: 2, Informative

    Some (if not all) of the concept relation semantics needed for doing "semantic search"
    or "machine comprehension" of text on the web can be gleaned by
    doing statistical analysis of the relationships between words and phrases
    across the entire web. Aggregating across a large corpus eliminates "noise"
    in usage and draws out the semantic "signal" about how people relate the
    concepts to each other.

    --

    Where are we going and why are we in a handbasket?
  17. Re:Would someone please cut and paste here... by srussell · · Score: 2, Insightful

    instead of each of us picking one movie we like and potentially forcing the other person to suffer through it, we can find a movie that (in theory) we will both like.

    If it weren't for my wife, my media consumption would consist entirely of science fiction and WWI/II movies; thanks to my wife, I've been exposed to a much broader swath of media genres -- some of which has been painful, and some of which I've regretted... but in the balance, I think I'm a better person for it. But, then, I possess an abundance of room for improvement.

    Actually, this issue is something that bothers me. This increasing ability to narrow our exposure to data which we find unpleasant, to filter out the world so that we only see what we want to see, is vaguely disturbing. I see what I think are consequences of this increasingly in my own country, and evidence of it in the form of rising fundamentalism around the world. I'm afraid that I do it, too. It is limiting and dangerous, and increasingly easy to do.

    I don't have a solution, and maybe there isn't one. Perhaps, someday, we'll all live in virtual realities where all of the facts are shaped to what we want to believe, and we'll never have to interact with anybody who disagrees with us, and we'll find that this is the utopia that humans have been searching for.

    Maybe.

    --- SER