Slashdot Mirror


Semantic Web Under Suspicion

Dr Occult writes "Much of the talk at the 2006 World Wide Web conference has been about the technologies behind the so-called semantic web. The idea is to make the web intelligent by storing data such that it can be analyzed better by our machines, instead of the user having to sort and analyze the data from search engines. From the article: 'Big business, whose motto has always been time is money, is looking forward to the day when multiple sources of financial information can be cross-referenced to show market patterns almost instantly.' However, concern is also growing about the misuses of this intelligent web as an affront to privacy and security."

8 of 79 comments (clear)

  1. All Talk by eldavojohn · · Score: 5, Informative

    So I know a lot of people that get all excited when they read articles on the "semantic web."

    I think that we are all missing some very important aspects of what it takes to make something capable of what they speak of. In all the projects I have worked on, to create something geared toward this sort of solution, you need two things: training data & a robust taxonomy.

    First things first, how would we define or even agree on a taxonomy? By taxonomy, I mean something with breadth & depth that has been used and verified. By breadth I mean that it must be capable of normalization (pharmacetical concoctions, drugs & pills are all the same concept), stemming (go & went are the same action, dog & dogs are the same concept) and also important is how many tokens wide a concept can be. By depth I mean that we must be able to define specificity and use it to our advantage (a site about 747s is scored higher than a site about airline jets which is scored higher than a site about planes). By rigorous I mean that it must be tried and true ... you start with a corpus of documents to "seed" it and have experts (or web surfers) contribute little by little until it is accurate. Oh, it must also be able to adapt quickly and stay current.

    Without a taxonomy, how will we index sites and be able to tell between "water tanks" and "panzer tanks." I think that this is one of the great things that Google is missing to really improve its searching abilities. If you suggest an ontology to replace it, the problems encountered in developing it only multiply.

    Where is the training data? Well, one may argue that the web content out there will suffice as training data but I think that more importantly, they need collections of traffic for these sites and user behavioral patterns to quickly and adequately deduce what the surfer is in need of.

    I feel that these two aspects are missing and the taxonomy may be impossible to achieve.

    Why are we even concerned with security if we can't even lay the foundations for the semantic web? I would argue that once we plan it out and determine it's viable, then we concern ourselves with the everyone's rights.

    --
    My work here is dung.
    1. Re:All Talk by $RANDOMLUSER · · Score: 4, Interesting

      I've always thought that the Table of Contents for Roget's Thesaurus was one of the greatest works of mankind. I don't think many people realize just how difficult the problem really is, and how long it's going to take.

      --
      No folly is more costly than the folly of intolerant idealism. - Winston Churchill
  2. Smarter Machines by jekewa · · Score: 4, Interesting
    I personally fear the day that a machine or algorithm can better determine the purpose for my keyword-based search than I can. Sure, there's a lot of improvement that can be done to make the searches more precice, but certainly in the end it'll be my decision what's important and what isn't.

    What I really want to see is the search engine reduce the duplicated content to single entries (try Googling for a Java classname and you'll see how many Google-searched websites have the API on them), or order them by reoccurrance of the word or phrase giving the context more value than the popularity of the page.

    --
    End the FUD
    1. Re:Smarter Machines by Irish_Samurai · · Score: 5, Insightful

      What I really want to see is the search engine reduce the duplicated content to single entries (try Googling for a Java classname and you'll see how many Google-searched websites have the API on them), or order them by reoccurrance of the word or phrase giving the context more value than the popularity of the page.

      There is a huge problem with this, and it goes back to the days of people jamming 1000 instances of their keywords at the bottom of their pages in the same fant color as the background. Also, your desire to rate the pages on context requires an ontology type algo, which is NOT easy. Google has been working on this for a little while now, but it is a big hill to climb. They are using popularity as a substitution for this. It is not the most effective, but it is a pretty decent second option.

      There is another issue with the approach you suggest. If Google decides that javapage.htm is the end all be all of JAVA knowledge, and removes all other listings from their database - then everyone and their grandmother will be fed information from this one source. That will ultimately reduce the effectiveness of Google to return valid responses to people who do not use search like a robot.

      There is a human element at play here that Google is attempting to cater to through sheer numbers. Not everyone knows how to use search properly, hell most people have no idea. Keyword order, booleans, quotes - these will all affect the results given back, but very few people use them right off the bat. If you reduce the number of returned listings for a single word search to one area that was detirmined to be the authority, you have just made your search engine less effective in the eyes of the less skilled. I would be willing to bet that this less skilled group composed most of Googles userbase.

      If you don't cater to these people, then you lose marketshare, and then you lose revenue from advertisers, and then you go out of business.

  3. It's already happening... by gravyface · · Score: 4, Insightful

    ...and growing and evolving.

    Take a look at the "blogosphere" and the tagging/classification initiative that's happening there.

    Sure, it seems crude and unrefined but it's working, like most grass-roots initiatives do when compared with grandiose "industry standards" and the big, bulky workgroups that try to define them.

    --
    body massage!
  4. Semantic Web ~- evil by tbriggs6 · · Score: 5, Informative
    The article does a pretty bad job at explaining the situation. The idea behind the Semantic Web is simply to provide a framework for information to be marked up for machines rather than human eyes. The idea is that using an agreed upon frame of reference for the symbols contained in the page (an ontology), agents are able to make use of the information contained there. Further, an agent can collection data from several different ontologies and (hopefully) perform basic reasoning tasks over that data, and (even better) complete some advanced tasks for the agent's user.

    The article would have us believe that this is going to expose everyone to massive amounts of privacy invasion. This is not necessarily the case. It is already the case that there are privacy mechanisms to protect information in the SW (e.g. require agents to authenticate to a site to retrieve restricted information). Beyond simple mechanisms, there is a lot of research being conducted on the idea of trust in the semantic web - e.g. how does my agent know to trust a slashdot article as absolute truth and a wikipedia article as outright fabrication (or vice versa).

    As for making the content of the internet widely available, some researchers feel this will never happen. As another commenter noted that it is essential that there is agreement in the definition of concepts (ontologies) to enable the SW to work (if my agent believes the symbol "apple" refers to the concept Computer, and your agent believes it refers to "garbage", we may have some interesting but less than useful results). I am researching ontology generation using information extraction / NLP techniques, and it is certainly a difficult problem, and one that isn't likely to have a trivial problem (in some respects, this is goes back to the origins of AI in the 1950's, and we're still hacking at it today).

    For some good references on the Semantic Web (beyond Wikipedia), check out some of these links

  5. Pfff, the problem is marketing by SmallFurryCreature · · Score: 4, Insightful
    Lets use the holiday example giving in the article. So I got a hotel that is 54 dollars per night. That means I am not going to be included in the below 50 dollar search. Hmmm, I don't want that. I want maximum exposure. So I lower my price to 49 dollars + 10 dollars in extra fees that are a suprise when you receive the bill (what you say? 49+10 > 54? Offcourse you idiot, any price cut must be offset by higher charges elsewhere.)

    You could already do this semantic web nonsense if people would just stick to a standard and be honest with what they publish.

    Nobody wants to do that however. Mobile phone companies always try to make their offering sound as attractive as possible by highlighting the good points and hiding the bad ones. Phone stores try to cut through this by making their own charts for comparing phone companies but in turn try to hide the fact that they get a bigger cut from some companies then others.

    It wouldn't be at all hard to set up a standard that would make it very easy to tell what cell phone subscription is best for you. Getting the companies involved to participate is impossible however.

    This is the real problem with searching the web right now. It wouldn't be at all hard to use google today if everyone was honest with their site content. For instance, removed the word "review" from a product page if no review is available.

    Do you think this is going to happen anyday soon? No, then the semantic web will not be with us anyday soon either.

    --

    MMO Quests are like orgasms:

    You may solo them, I prefer them in a group.

  6. Glass Houses by Baavgai · · Score: 4, Insightful

    "All of this data is public data already," said Mr Glaser. "The problem comes when it is processed."

    The privacy and security concerns are bizarre. They're saying that there is currently an implicit "security through obscurity" and that's ok. However, if someone were to make available data more easily found, then it would be less secure?

    Here's a radical thought; don't make any data public you don't want someone to see. Blaming Google because you put your home address on your blog and "bad people" found you is absurd. If data is sensitive it shouldn't be there now.

    You can't really bitch about peeping Tom's if you built the glass house.