On Finding Semantic Web Documents
Anonymous Coward writes "A research group at University of Maryland has published a blog describing the latest approach for finding and indexing Semantic Web Documents. They have published it in reaction to Peter Norvig's (director of search quality at Google) view on the Semantic Web (Semantic Web Ontologies: What Works and What Doesn't): 'A friend of mine [from UMBC] just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go.'"
This has to be one of the funniest things I've ever read.
....
I used to work with this guy until he found a better job making about three times what he used to. With his first paycheck he got the latest Plasma TV and started to rub it in about how cool it is etc. I recieved the following email from him today
-----
----- Original Message -----
From: xxx
To: xxx ; xxx
Sent: Friday, January 14, 2005 1:32 PM
Subject: My PLASMA TV
You fools need to listen to dis....
Your peeps this high roller right...... so he gets himself a
plasma screen. So hes watching regular cable on it and then
he gets an idea how cool would it be to hook up the computer
and stream porn on it. So i get the gear that will allow the
computer to display the shiiiiiiiiiat on the tv.
Dawgs listen to this shit... Im watching milf on it and boy was
i enjoying it when my dad picks up the right time to call me
and touch base with me abt the wedding. So put the damn thing
on pause and start talking to him. 10 mins... 20 mins.. 45 mins im
still talking to my homie on the phone. So I hang up on him after
talking to him for an hour. Now when I come back and see the tv
is still on pause... so i start seeing it. Just when i thought i had
enough of hunter getting on some hillbilly... im switching between
inputs i realized that the image where i paused is burnt on the tv.
Dude its so embarrasing... Ive called in for a replacement but
im scared the guys would turn this shit on find an imprint of
hunter in a doggy position.
THIS SHIT AINT FAIR....
xxxx
It's not about the filename extension (if any), silly. It's about the data. Valid RDF data may be stored in files with a wire range of extensions, or even (how radical is this?) generated on the fly.
What matters is first the mime type (which is most likely application/xml or preferably text/xml), and the data in it.
Oh, and, First Post, BTW.
I'm old enough to remember when discussions on Slashdot were well informed.
I used to love their Norton Utilities.
What about all the pages that are .rss but are actually rss 1.0, those are rdf-based. And what about all the rdf which is in the comments of .html files and others? My creative commons license is rdf, but its inside a .html file. Sure, we do have a long ways to go, but the semantic web is bigger than a few file extensions findable by google.
The GeekNights podcast is going strong. Listen!
What's all this about finding Semetic web documents... Oh... Never mind.
Without a large number of widely used tools out there that make use of semantic information there won't be that much content designed for them...and without content designed for them the tools won't exist and certainly won't be widely used. Currently it's more of an academic exercise - if we somehow knew what all this information on the web actually was, what could we do with it? More interesting it seems then are approaches at bypassing the markup by hand and do something equivalent automatically.
Semantic web stuff if cool and all but I honestly don't believe that it will ever really take off in any meaningful way. For one, it takes a paradigm that people know and understand and adds a lot of complexity to it, both on the user end and the engineering end.
Plus a lot of the rah-rah booster club that's grown up around it sound a whole lot like the Royal Society folks in Quicksilver who keep trying to catalog everything in the world into a 'natural' organization.
What it basically comes down to for me is that it seems like a great framework for single-topic information organization but at a point we need to keep our focus on the actual content of what we're producing more than the packaging. For this to be ready for prine time the value proposition needs to move from a 30-minute explanation involving diagrams and made-up words ending in '-sphere' to something even less than an "elevator pitch" like 2 sentences.
If you agree with any of this, feel free to repost it in the future.
* If you expect companies to follow the copyright of the GPL, you should support the RIAA going after infringers of its copyright. If not, you're a hypocrite.
* There is absolutely nothing wrong with a company being upset that its product is being pirated freely over online networks. A recent Slashdot poll showed that the majority of Slashotters are unemployed or are students ("academics"), which explains a lot. Try getting a real job sometime and see what it feels like when your work is everywhere, and you start worrying that your days are numbered. Does John Carmack want you to "sample" his new game via the "free advertising" happening on eMule?
* Artists "deserve their money" only in cases in which the RIAA is the bad guy. When it's a P2P article, suddenly ripping artists off and not paying for their music via piracy is magically different from some record companies not paying royalties. This mindset is supposed to make sense.
* At the 2004 WinHEC, Allchin demonstrated an alpha version of Longhorn that played six high-resolution videos at the same time while playing Quake III in the background. An equivalent XP machine couldn't play more than four videos. Meanwhile, I can't even get xmms to play without skipping, and windows to drag without visual tearing! That's because KDE and GNOME are hacks to emulate a desktop on top of the crufty XFree86 architecture that people won't let die (the majority Linux users absolutely fear change...there are rational ones, but they are outnumbered by zealots).
* OSTG-owned Slashdot thinks its niche opinion represents the majority of the world. This is a result of people visiting every day and buying into the groupthink. Nobody outside of Slashdot knows or cares about "Linux," "RIAA", "M$," or anything else Slashdotters think is such a huge issue in today's society. Go to a mall or coffee shop sometime and see what people actually talk about.
* Speaking of OSTG--it's a Linux company...that owns a "tech news" site...that posts news stories negative toward competitors like Microsoft. If a Windows company or even Microsoft itself owned a "tech news" site and posted anti-Linux articles all the time, everyone would be up in arms. But with OSTG, it's okay.
* Slashbots think people don't like the music coming out these days, which is the cause of the piracy. Never mind that if people didn't like the music they wouldn't be pirating it, most Slashbots--again, this goes back to the niche opinion thing--don't realize that most people these days love the music coming out and want to hear all of it. Probing around, you discover that Slashdot is made up of nerds and fogies who listen to things like The Who and Blind Guardian and techno--not what mainstream society enjoys.
* Any company ending in "AA" is evil. Especially if it doesn't want you distributing its works without paying for it. Somehow, this mindset is supposed to make sense.
* The inevitable result of all this is a world in which nothing can be profitable because people simply pirate free copies. Is that really what Slashbots want? OSS and free-ness in general reminds me of the hippie era of the 60s--idealistic socialism that only exists because of the surrounding capitalism around it that provides the environment for it to exist. We all know what happened to that idea.
* Linux rules the desktop, when in reality: Windows = 91%; Mac = 4%; Linux = 1%
* Slashdot editors are abusive. We all remember The Post. It's amusing the editors never mention the issue. The worst editor is michael, who will mod you down, insult you for your post count, and post unprofessional color commentary along with the article. This is the same bizarre person who cybersquatted Censorware for years--even as Slashdot posted articles negative toward cybersquatting! Michael played it off as though he was a stalking victim, which made it all the more biz
Who cares about the semantic web or any new web technology if its going to be deluged by spam within 5 days of deciding to use it, and thus becoming unusable / untrustable as a resource. Deal with the spam problem, then come back to me about these great new technologies that are vulnerable to it.
Caesar si viveret, ad remum dareris.
Manual for the Modern Slashdotter
Golden Rule: You must base your worldview entirely on Slashdot headlines. You must ignore the innaccuracy and editorial shortcomings of the Slashdot staff. You must buy into the groupthink of the comment threads. This is of UTMOST IMPORTANCE.
- Post the lamest, most obvious, and most unfunny jokes imaginable. They will be modded up "+5 Funny." Even Malda couldn't stand it any longer and made Funny mods not count toward karma.
- Everything involving Linux is flawless and perfect.
- Anything involving Mozilla is flawless and perfect. Ignore that Mozilla marks security flaws as "confidential" and keeps them secret. Ignore that this is something Microsoft is endlessly bashed for. Ignore that Firefox has had several severe security flaws, especially for a browser used by so little of the market (1% according to Google Zeitgeist).
- Whenever someone has a criticism of the current moderation system, refer to Taco's "future moderation system."
- You must lean left. You must obsess over George W. Bush and make Bush jokes whenever possible, no matter how irrelevant to the topic. In political articles, you must upmod anti-Bush comments and downmod independent or pro-Bush comments. Use the "Overrated" moderator whenever possible. Remember, Taco is going to fix this in "the future moderation system."
- Use the term "FUD" religiously in everyday conversation. When someone puts out something that disagrees with your worldview, call it FUD matter-of-factly as a way to dismiss the points it raises. Demonization is far easier than debating the issues.
- Whenever Linux Torvalds says anything, it is newsworthy and infallible. Linux is perfect, just so practicle and is the "Alpha-Geek." Linus does not make mistakes. Basically, you must behave as though you are in love with Linux Torvalds. When he says he doesn't bother looking at the source code of competitors like Solaris because he's not interested, herald it as the "wonderful attitude of Linus" even though such a comment coming from a Microsoft employee would get flamed as an example of their arrogance and closed-minded attitude. When giant kernel holes go unpatched, ignore it and continue to suck the teat of the Linus Torvalds hype machine like a good sheep should.
- Believe articles like "Microsoft Violates Human Rights In China," based entirely on the idea that Microsoft is evil because Windows is used by the government there. Ignore the fact that China has its own custom Linux distribution called Red Flag Linux. Slashdot is unbiased and holy.
- Ignore that Slashdot is corporate-owned, by a company called OSTG that employs Rob Malda and makes money off selling OSS products. Ignore the conflict of interests in running a "tech news" site that coincidentally posts articles critical of competitors. Ignore that if Microsoft owned a tech news site that did the same, it would be criticized for it.
- Pretend that Linux is ready for the desktop, even though it took you two hours to set up your soundcard, mouse scroll wheel, and 3D card. Ignore that the real reason you refuse to acknowledge that Linux sucks on the desktop is because you don't want to diminish your sense of accomplishment in getting it up and running. Make sure to confuse this sense of accomplishment with the feeling that you have "more control" in a Linux system compared to a Windows system.
- Pretend there's nothing wrong with endless submissions accepted from Roland Piquepaille, who makes several thousands a month thanks to Slashdot's linking to his blog which links to the original article--rather than Slashdot just linking to the original article and cutting out the pointless middle-man. It's okay for Malda to shrug it off as though Slashdot should never consider ethics or morals.
- Pretend there's nothing wrong with Michael cybersquatting Censorware.org, even though Slashdot champions itself as the voice of online rights, anti-spa
Is a bastard on wheels and he spreads chicken shit over himselft!
The Linux Revolution Is Dying
./* | grep -v sizeof', I discovered 4 exploitable vulnerabilities in a matter of 15 minutes. More vulnerabilities were found in 2.6 than in 2.4. It's a pretty sad state of affairs for Linux security when someone can find 4 exploitable vulnerabilities in a matter of minutes." - Brad Spengler
In light of the disastrous 2.6 development model that has given sysadmins everywhere a headache by introducing development code into a production line, Linux has signed its own death knell. With more and more people looking to alternatives like FreeBSD 5.x, OS X, and DragonflyBSD, Linux is slowly shovelling the dirt beneath its feet to dig its own grave.
Linux And Windows
Quite simply, the revolution against Windows has run out of steam. While Linux was a viable alternative in the days of Windows 98, when the rallying cry of geeks everywhere was "Down with M$, Linux never crashes," we now have the majority of the Windows userbase running NT-based operating systems. Except in cases of hardware or driver issues, reliability is no longer an issue in the comparison between Linux and Windows.
Eventually, the movement became one of security. In the years after its release, Windows XP was discovered to have several high-profile security flaws. Microsoft underwent a major code audit and released SP2. The rallying cry for OSS was now about security.
However, the community has discovered major flaws in the Mozilla software suite, including bugs marked "confidential" for years at a time. Additionally, major security holes have been appearing in the 2.6 line of Linux kernels, some having existed for years and affecting the 2.4 line. Declaring Linux to be the secure alternative is no longer as true.
Worst of all, the Linux kernel developers have no clear process, nor any clear contact person, when it comes to security issues.
Evidence: http://lwn.net/Articles/118251/
Evidence: Long-time shell-provider SDF used Linux until they got hacked into. Now, it's a 64-bit version of NetBSD.
Evidence: PaX discovered the mlockall hole. It was fixed in PaX for two years. Linux just now (2005) caught up.
Evidence: "Using 'advanced static analysis': 'cd drivers; grep copy_from_user -r
The New Linux Development Model
With the 2.6 line of kernels, a new model has been adopted that is considered easier for the kernel developers. Instead of branching a 2.7 line, following the model of odd-numbered version numbers denoting development code, everything is now being thrown into 2.6.
"Not all 2.6.x kernels will be good; but if we do releases every 1 or 2 weeks, some of them *will* be good. The problem with the -rc releases is that we try to predict in advance which releases in advance will be stable, and we don't seem to be able to do a good job of that. If we do a release every week, my guess is that at least 1 in 3 releases will turn out to be stable enough for most purposes. But we won't know until after 2 or 3 days which releases will be the good ones." -- Ted T'So
In other words, this Linux kernel developer believes it is perfectly fine for one in three kernels of the stable line to actually be stable. The new development process is anti-user. "Release early, release often" has outlived its reliability and applicability to the real world.
The excuse given is that Linus is only one man, and there are only 24 hours in a day. If that is true, than Linus needs to address this shortcoming of the process; otherwise, the process is poorly managed.
The Community Has Regurgitated Itself
In a frenzy of newbies, the Linux community has grown, with Slashdot as its rallying center. The cycle of self-feeding groupthink has created a userbase unable to see outside its own perceptions. This leads to unrealistic attitudes about the safety and stability of Linux and its applicability to various solutions.
Contrast to the BSD community which employs a more academic approach.
That the 'net has to be segregated into semantic and non-semantic.
Haven't the Jewish people been through enough without this digital persecution?
More proof that michael's a Nazi.
Disgusting.
I don't need no instructions to know how to rock!!!!
I thought I knew what these articles were supposed to be talking about, but it turns out I had no clue.
Thinkin' Lincoln - a web comic of presidential proportions
Norton Antivirus got to do with this web technology?
I think Google should not spend time finding anything anti-Semantic.
Every user of a LiveJournal-based website running recent code has a FOAF file. Let's look how many users that is:
/feed/rdf or /wp-rdf.php, which is in RDF. Movable Type comes preinstalled with an RSS 1.0 feed. Each of these has at least a couple thousand users.
* LiveJournal.com: 5751567
* GreatestJournal.com: 717406
* DeadJournal.com: 474435
* Weedweb.net: 22650
* InsaneJournal.com: 12970
* JournalFen.net: 7629
* Plogs.net: 7086
* journal.bad.lv: 4530
(This list is most likely incomplete.)
In addition to this, every Typepad user has an account: according to the 6A merger stories, that's another million users. Add in the RDF from all the Typepad RSS files, and that's another 1 million.
All Wordpress blogs have a feed, located at
So, we've got, just as a guess, about 9 million RDF files out there in the blogging world alone. Throw in a hell of a lot of scientific data, and everything on RDFdata.org, and you start to get an idea that the world is a lot more Semantic Web enabled than you seem to think it is.
-- Christopher Schmidt YouTube Quality of Experience
A few sites I have worked on that are run by MKDoc are listed in their top 500, since MKDoc generates a RDF metadata file for every HTML document, but the biggest and most interesting are missing, I expect that there are perhaps several hundred times more RDF documents out there than they have found...
Check out MKDoc a mod_perl CMS
How's censorware.org doing?
* Slashdot editors are abusive. We all remember The Post.
Anyone know what he's talking about here?
From the Google TOS: You may not send automated queries of any sort to Google's system without express permission in advance from Google.
I am serious. These researches just used a lot of resources from Google that they had no permission to use. Researchers especially should try to be good citizens on the net and not do tons of automated querying to websites without permission--especially when it is specifically prohibited.
Google has spent a lot of time and money to get the information that they wanted; and when asked for copies of it google didnt give it to them--so instead they just took it without permission.
I would call that stealing, except I wont because that will start a whole other thread thelling me that information cannot be stolen.
My point is, if you want to do research, at least play by the rules that you are given. It may take longer and require more work, but that seems better than using information that you dont have permission to use.
That's about 0.005% of the web. We've got a ways to go.
I dunno about you, but I'm not going to do this to any of my data, unless I'm forced to (i.e., my editor saves it that way, or Firefox 5.0 doesn't read it otherwise).
So don't hold yer breath.
to deliVer what, host what the house of prograaming
You're on.
1) A simple human- and machine-readable schema is defined for marking up descriptions of items for sale or wanted.
2) Google learns how to read them, thereby putting eBay, Craigslist, and other sundry companies out of business and putting your data back in your hands.
Okay, so the second sentence is a bit of a run-on, and this use case has a whole lot of hairy details I'm leaving out. But the possibilities are pretty exciting nonetheless.
If you don't pretend to be anyone, are you?
Apart from RSS feeds, how can I use this data? I mean, I have RDF metadata available for pretty much every page on my website, but I haven't yet noticed anyone who actually reads it.
The semantic web seems like a good idea in principle, but I would really like to know just how I could use it in real life! Seriously, can anyone name a useful tool that relies on RDF feeds (again, aside from RSS-style stuff) or propose one that could? Perhaps if I saw a real application of the semantic web I would actually understand what RDF is actually all about.
ݼ)s$æúßðíÊ'öX'îò5^àûßQç£
He is talking about this comment.
There is additional background information and historical perspective available at the following sites:
Sllort's journal
Kuro5hin article
Hear recorded Slashdot headlines on your phone! New service beta testing. Just call (248) 434-5508
I think the "Semantic Web" sounds great on paper, and is the next big thing in university research departments and etc, etc, BUT I don't think it's going to end up seeing wide use. Here are my reasons, basically a list of things that I as a web developer would hesitate on.
1. The Semantic web seems to require a lot of extra complexity without much "bang for my buck". If I build a page normally, all my needs are already met. I can submit the main web page to search engines, prevent the rest from being indexed, figure out how to advertise my 'page's existence... I'm pretty much set. The extra stuff doesn't buy me anything. In fact, I definitely would NOT want people being able to find information on my site without going through my standard user interface. I WANT them to come in through the front door and ask for it.
2. Let's say people start using this tech, which I imagine would involve all sorts of extra tagging in pages, extra metadata, etc. Now you have to trust people to A) actually know what they're doing and set things up properly, which is a long shot at best, and B) not try to game the system somehow. On top of that, you have to trust the tool vendors to write bug-free code, which isn't going to happen. What I'm saying is that all these extra layers of complexity are places for bugs, screw-ups, and booby traps to hide.
3. And, the real beneficiary of these sorts of systems seems to be the tool vendors themselves. Because what this REALLY seems to be about is software vendors figuring out a new thing they can charge money for. Don't write those web pages using HTML, XML, and such! No, code them up with our special sauce, and use our special toolset to bake them into buttery goodness! Suddenly, you're not just writing HTML, you're going through a whole development process for the simplest of web pages.
Maybe I'm getting crusty in my old age, but it seems that every single year, some guy comes up with some new layer of complexity that we all "must have". It's never enough for a technology to simply work with no muss and no fuss. Nothing must ever be left alone! We must change everything every year or two! Because otherwise, what would college kids do with their excess energy, eh?
Sigh... Anyway, no matter what you try and do to prevent the Semantic Web from turning out just like meta tags, the inevitable will happen. You watch.
Farewell! It's been a fine buncha years!
The google TOS you are talking about is for the google website. We had used the google webservice api, please read the google api TOS .
Google api was built to allow automated queries so we were not "violating" the TOS.
So I think it is wrong on you part to comment on some one without having the full information.Ofcourse it may take longer and require more work, but that seems better than using wrong information.
Any news on that down there?