On Finding Semantic Web Documents
Anonymous Coward writes "A research group at University of Maryland has published a blog describing the latest approach for finding and indexing Semantic Web Documents. They have published it in reaction to Peter Norvig's (director of search quality at Google) view on the Semantic Web (Semantic Web Ontologies: What Works and What Doesn't): 'A friend of mine [from UMBC] just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go.'"
It's not about the filename extension (if any), silly. It's about the data. Valid RDF data may be stored in files with a wire range of extensions, or even (how radical is this?) generated on the fly.
What matters is first the mime type (which is most likely application/xml or preferably text/xml), and the data in it.
Oh, and, First Post, BTW.
I'm old enough to remember when discussions on Slashdot were well informed.
I used to love their Norton Utilities.
What about all the pages that are .rss but are actually rss 1.0, those are rdf-based. And what about all the rdf which is in the comments of .html files and others? My creative commons license is rdf, but its inside a .html file. Sure, we do have a long ways to go, but the semantic web is bigger than a few file extensions findable by google.
The GeekNights podcast is going strong. Listen!
What's all this about finding Semetic web documents... Oh... Never mind.
Without a large number of widely used tools out there that make use of semantic information there won't be that much content designed for them...and without content designed for them the tools won't exist and certainly won't be widely used. Currently it's more of an academic exercise - if we somehow knew what all this information on the web actually was, what could we do with it? More interesting it seems then are approaches at bypassing the markup by hand and do something equivalent automatically.
Semantic web stuff if cool and all but I honestly don't believe that it will ever really take off in any meaningful way. For one, it takes a paradigm that people know and understand and adds a lot of complexity to it, both on the user end and the engineering end.
Plus a lot of the rah-rah booster club that's grown up around it sound a whole lot like the Royal Society folks in Quicksilver who keep trying to catalog everything in the world into a 'natural' organization.
What it basically comes down to for me is that it seems like a great framework for single-topic information organization but at a point we need to keep our focus on the actual content of what we're producing more than the packaging. For this to be ready for prine time the value proposition needs to move from a 30-minute explanation involving diagrams and made-up words ending in '-sphere' to something even less than an "elevator pitch" like 2 sentences.
Who cares about the semantic web or any new web technology if its going to be deluged by spam within 5 days of deciding to use it, and thus becoming unusable / untrustable as a resource. Deal with the spam problem, then come back to me about these great new technologies that are vulnerable to it.
Caesar si viveret, ad remum dareris.
Dude! Its got to be the funniest thing Ive heard so far. Here is an idea for your friend to save him some embarresment. Ask him to turn on the tv hook up the laptop and pause some vidoe (non-porn) that way he can cover up the burn in... actually burn the remaining stuff so they cant make out Man! My buddies are laughing their A** off too.. Good one will remember for a long time to come.
I thought I knew what these articles were supposed to be talking about, but it turns out I had no clue.
Thinkin' Lincoln - a web comic of presidential proportions
Hahahahahah .... this shit is funny ahhahahahaha
I wonder if he will have nightmares?
... rules the jungle without fear.
Norton Antivirus got to do with this web technology?
When you can honestly show me how any sane person can support the RIAA's stance on anything - yet remain a money-gouging, price-fixing cartel who pay their representants (the artists) pittance - I may actually listen to you.
Caesar si viveret, ad remum dareris.
Every user of a LiveJournal-based website running recent code has a FOAF file. Let's look how many users that is:
/feed/rdf or /wp-rdf.php, which is in RDF. Movable Type comes preinstalled with an RSS 1.0 feed. Each of these has at least a couple thousand users.
* LiveJournal.com: 5751567
* GreatestJournal.com: 717406
* DeadJournal.com: 474435
* Weedweb.net: 22650
* InsaneJournal.com: 12970
* JournalFen.net: 7629
* Plogs.net: 7086
* journal.bad.lv: 4530
(This list is most likely incomplete.)
In addition to this, every Typepad user has an account: according to the 6A merger stories, that's another million users. Add in the RDF from all the Typepad RSS files, and that's another 1 million.
All Wordpress blogs have a feed, located at
So, we've got, just as a guess, about 9 million RDF files out there in the blogging world alone. Throw in a hell of a lot of scientific data, and everything on RDFdata.org, and you start to get an idea that the world is a lot more Semantic Web enabled than you seem to think it is.
-- Christopher Schmidt YouTube Quality of Experience
A few sites I have worked on that are run by MKDoc are listed in their top 500, since MKDoc generates a RDF metadata file for every HTML document, but the biggest and most interesting are missing, I expect that there are perhaps several hundred times more RDF documents out there than they have found...
Check out MKDoc a mod_perl CMS
How's censorware.org doing?
* Slashdot editors are abusive. We all remember The Post.
Anyone know what he's talking about here?
Any company ending in AA is evil. Especially if it doesn't want you distributing its works without paying for it. Somehow, this mindset is supposed to make sense.
That's harsh, man! I have nothing against the AAA. Why, just last week they came and changed my tire when I had a flat and was without a spare.
Anyway... did you say anything else, or was that pretty much it? Oh, yeah! Almost forgot:
Slashdot is dead.
Wow! That's pretty big news. But has Netcraft confirmed this???
Don't worry, I'm laughing with you, not at you. No... no, really.
From the Google TOS: You may not send automated queries of any sort to Google's system without express permission in advance from Google.
I am serious. These researches just used a lot of resources from Google that they had no permission to use. Researchers especially should try to be good citizens on the net and not do tons of automated querying to websites without permission--especially when it is specifically prohibited.
Google has spent a lot of time and money to get the information that they wanted; and when asked for copies of it google didnt give it to them--so instead they just took it without permission.
I would call that stealing, except I wont because that will start a whole other thread thelling me that information cannot be stolen.
My point is, if you want to do research, at least play by the rules that you are given. It may take longer and require more work, but that seems better than using information that you dont have permission to use.
No, you should offer to attempt to "repair" it yourself then do what LMA suggests YOURSELF. If this works, charge him $$$. Of course, I don't know enough about plasma screens to know if it would really work or not. You might try pausing on an all-white screen.
That's about 0.005% of the web. We've got a ways to go.
I dunno about you, but I'm not going to do this to any of my data, unless I'm forced to (i.e., my editor saves it that way, or Firefox 5.0 doesn't read it otherwise).
So don't hold yer breath.
So you don't deny that they are "money-gouging, price-fixing cartel" then? I guess we're in agreement then ...
Caesar si viveret, ad remum dareris.
You're on.
1) A simple human- and machine-readable schema is defined for marking up descriptions of items for sale or wanted.
2) Google learns how to read them, thereby putting eBay, Craigslist, and other sundry companies out of business and putting your data back in your hands.
Okay, so the second sentence is a bit of a run-on, and this use case has a whole lot of hairy details I'm leaving out. But the possibilities are pretty exciting nonetheless.
If you don't pretend to be anyone, are you?
Apart from RSS feeds, how can I use this data? I mean, I have RDF metadata available for pretty much every page on my website, but I haven't yet noticed anyone who actually reads it.
The semantic web seems like a good idea in principle, but I would really like to know just how I could use it in real life! Seriously, can anyone name a useful tool that relies on RDF feeds (again, aside from RSS-style stuff) or propose one that could? Perhaps if I saw a real application of the semantic web I would actually understand what RDF is actually all about.
ݼ)s$æúßðíÊ'öX'îò5^àûßQç£
He is talking about this comment.
There is additional background information and historical perspective available at the following sites:
Sllort's journal
Kuro5hin article
Hear recorded Slashdot headlines on your phone! New service beta testing. Just call (248) 434-5508
I think the "Semantic Web" sounds great on paper, and is the next big thing in university research departments and etc, etc, BUT I don't think it's going to end up seeing wide use. Here are my reasons, basically a list of things that I as a web developer would hesitate on.
1. The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met. I can submit the main web page to search engines, prevent the rest from being indexed, figure out how to advertise my 'page's existence... I'm pretty much set. The extra stuff doesn't buy me anything. In fact, I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.
2. Let's say people start using this tech, which I imagine would involve all sorts of extra tagging in pages, extra metadata, etc. Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow. On top of that, you have to trust the tool vendors to write bug-free code, which isn't going to happen. What I'm saying is that all these extra layers of complexity are places for bugs, screw-ups, and booby traps to hide.
3. And, the real beneficiary of these sorts of systems seems to be the tool vendors themselves. Because what this REALLY seems to be about is software vendors figuring out a new thing they can charge money for. Don't write those web pages using HTML, XML, and such! No, code them up with our special sauce, and use our special toolset to bake them into buttery goodness! Suddenly, you're not just writing HTML, you're going through a whole development process for the simplest of web pages.
Maybe I'm getting crusty in my old age, but it seems that every single year, some guy comes up with some new layer of complexity that we all "must have". It's never enough for a technology to simply work with no muss and no fuss. Nothing must ever be left alone! We must change everything every year or two! Because otherwise, what would college kids do with their excess energy, eh?
Sigh... Anyway, no matter what you try and do to prevent the Semantic Web from turning out just like meta tags, the inevitable will happen. You watch.
Farewell! It's been a fine buncha years!
The google TOS you are talking about is for the google website. We had used the google webservice api, please read the google api TOS .
Google api was built to allow automated queries so we were not "violating" the TOS.
So I think it is wrong on you part to comment on some one without having the full information.Ofcourse it may take longer and require more work, but that seems better than using wrong information.
Any news on that down there?