Slashdot Mirror


On Finding Semantic Web Documents

Anonymous Coward writes "A research group at University of Maryland has published a blog describing the latest approach for finding and indexing Semantic Web Documents. They have published it in reaction to Peter Norvig's (director of search quality at Google) view on the Semantic Web (Semantic Web Ontologies: What Works and What Doesn't): 'A friend of mine [from UMBC] just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go.'"

8 of 67 comments (clear)

  1. It's not about the filename by Simon+Brooke · · Score: 3, Insightful

    It's not about the filename extension (if any), silly. It's about the data. Valid RDF data may be stored in files with a wire range of extensions, or even (how radical is this?) generated on the fly.

    What matters is first the mime type (which is most likely application/xml or preferably text/xml), and the data in it.

    Oh, and, First Post, BTW.

    --
    I'm old enough to remember when discussions on Slashdot were well informed.
    1. Re:It's not about the filename by old_guys_can_code · · Score: 3, Interesting

      I work at one of the few places that crawls billions of URLs each month, and I observed exactly the same thing as Peter. There just isn't that much xml/rdf/daml/owl on the web. At the point when we had crawled 6 billion URLs, I found only 180,000 URLs that had a mime type or extension to indicate that they were machine-readable metadata.

      The reason is something that people in the semantic web community are loathe to talk about - that there isn't enough incentive for people to create metadata that they put out for others to read. When we write web pages or blogs, we are able to express ourselves to other humans, but when we put out data there is no clear incentive (economic or otherwise) to justify the effort. This is probably why there is so little metadata being published.

      If you wish to dispute the small amount of data, feel free to put up a web server showing a million URLs of metadata created by others.

  2. What about... by Apreche · · Score: 4, Insightful

    What about all the pages that are .rss but are actually rss 1.0, those are rdf-based. And what about all the rdf which is in the comments of .html files and others? My creative commons license is rdf, but its inside a .html file. Sure, we do have a long ways to go, but the semantic web is bigger than a few file extensions findable by google.

    --
    The GeekNights podcast is going strong. Listen!
  3. unexpected? by AnonymousCactus · · Score: 3, Insightful

    Without a large number of widely used tools out there that make use of semantic information there won't be that much content designed for them...and without content designed for them the tools won't exist and certainly won't be widely used. Currently it's more of an academic exercise - if we somehow knew what all this information on the web actually was, what could we do with it? More interesting it seems then are approaches at bypassing the markup by hand and do something equivalent automatically.

  4. Solution without a problem? by faust2097 · · Score: 4, Interesting

    Semantic web stuff if cool and all but I honestly don't believe that it will ever really take off in any meaningful way. For one, it takes a paradigm that people know and understand and adds a lot of complexity to it, both on the user end and the engineering end.

    Plus a lot of the rah-rah booster club that's grown up around it sound a whole lot like the Royal Society folks in Quicksilver who keep trying to catalog everything in the world into a 'natural' organization.

    What it basically comes down to for me is that it seems like a great framework for single-topic information organization but at a point we need to keep our focus on the actual content of what we're producing more than the packaging. For this to be ready for prine time the value proposition needs to move from a 30-minute explanation involving diagrams and made-up words ending in '-sphere' to something even less than an "elevator pitch" like 2 sentences.

  5. LiveJournal and other weblogging services by crschmidt · · Score: 3, Informative

    Every user of a LiveJournal-based website running recent code has a FOAF file. Let's look how many users that is:

    * LiveJournal.com: 5751567
    * GreatestJournal.com: 717406
    * DeadJournal.com: 474435
    * Weedweb.net: 22650
    * InsaneJournal.com: 12970
    * JournalFen.net: 7629
    * Plogs.net: 7086
    * journal.bad.lv: 4530

    (This list is most likely incomplete.)

    In addition to this, every Typepad user has an account: according to the 6A merger stories, that's another million users. Add in the RDF from all the Typepad RSS files, and that's another 1 million.

    All Wordpress blogs have a feed, located at /feed/rdf or /wp-rdf.php, which is in RDF. Movable Type comes preinstalled with an RSS 1.0 feed. Each of these has at least a couple thousand users.

    So, we've got, just as a guess, about 9 million RDF files out there in the blogging world alone. Throw in a hell of a lot of scientific data, and everything on RDFdata.org, and you start to get an idea that the world is a lot more Semantic Web enabled than you seem to think it is.

    --
    -- Christopher Schmidt YouTube Quality of Experience
  6. Just my opinion, but... by crazyphilman · · Score: 3, Insightful

    I think the "Semantic Web" sounds great on paper, and is the next big thing in university research departments and etc, etc, BUT I don't think it's going to end up seeing wide use. Here are my reasons, basically a list of things that I as a web developer would hesitate on.

    1. The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met. I can submit the main web page to search engines, prevent the rest from being indexed, figure out how to advertise my 'page's existence... I'm pretty much set. The extra stuff doesn't buy me anything. In fact, I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.

    2. Let's say people start using this tech, which I imagine would involve all sorts of extra tagging in pages, extra metadata, etc. Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow. On top of that, you have to trust the tool vendors to write bug-free code, which isn't going to happen. What I'm saying is that all these extra layers of complexity are places for bugs, screw-ups, and booby traps to hide.

    3. And, the real beneficiary of these sorts of systems seems to be the tool vendors themselves. Because what this REALLY seems to be about is software vendors figuring out a new thing they can charge money for. Don't write those web pages using HTML, XML, and such! No, code them up with our special sauce, and use our special toolset to bake them into buttery goodness! Suddenly, you're not just writing HTML, you're going through a whole development process for the simplest of web pages.

    Maybe I'm getting crusty in my old age, but it seems that every single year, some guy comes up with some new layer of complexity that we all "must have". It's never enough for a technology to simply work with no muss and no fuss. Nothing must ever be left alone! We must change everything every year or two! Because otherwise, what would college kids do with their excess energy, eh?

    Sigh... Anyway, no matter what you try and do to prevent the Semantic Web from turning out just like meta tags, the inevitable will happen. You watch.

    --
    Farewell! It's been a fine buncha years!
    1. Re:Just my opinion, but... by l0b0 · · Score: 3, Insightful
      The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met.

      How about the needs of the people actually using the page? If you don't care about the viewers, why bother putting it on the web?

      I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.

      That sounds just like the kind of site I get pissed off at, when being redirected to the main page after finding the page I really want via Google. Forcing visitors to jump through hoops has never been popular.

      Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow.

      As a web developer, you probably already know what kinds of ugly designs there are out there. And yet, by some kind of magic, there are companies which create searchable indexes of these pages, and it just works. One of the benefits of this technology I expect to see in search engines shortly, is the possibility of semantic searches. How would you go about, today, looking for a bike magazine called "Encyclopedia" (I've tried)? Or research resultat relevant to your latest blog entry? Or the cheapest direct or indirect first class return ticket from London to New Delhi departing between one hour from now and 9 a.m. Thursday, with return between three and five days later, no smoking all the way?