WebQL Turns the Web Into A Giant Database
An anonymous reader says "
This article was posted on ZDNet by Bill Machrone on a new type of query language for aggregating information from the Web." Somewhat light on the details, but definitely something to think about.
Remember, remote content is not under your control. It will change (often) and is very very likely to not have a nice structure, and is even more likely to contain mismatched tags and other errors.
OK, its in its infancy, but IMHO if/when XHTML is widely adopted, a special query language or tool will largely be irrelevant because most of what is alleged in that brief article could be done in the magical wonderful world of XML.
The breakthru is that you notice almost everyone (those significant enough) have a web frontend to their database. Now if you can just go via that web front, you don't have to go direct to the database and can bypass all the above issue!
The first company that I know of that does this is an Israelic company: Orsus.com. Since then, OnePage.com also does it.
What they (at least Orsus) did was build a language (based on XML) that instructs a web spidering engine that has the ability to parse HTML and Javascript. A GUI IDE (no less) is used by the lay person to write the XML-based code.
Never use select * from... for production queries if you can help it. It's bad style. If you change your schema to include more columns you can wind up returning more data to the front end than you need to. This caused errors in display at worst and wasted bandwidth at best.
If you have different developers working on the front end and the database, this will really make them hate each other. It also makes the query optimizer work harder than it needs to (the amount of cpu wasted this way is totally insignificant, but it's bad form anyway.
Also, if you're going to run select * from internet without a where clause, be prepared for an extremely long running query.
--Shoeboy
ROTFLMAO!!!!!!!
Good god! An "Al Gore invented the internet" joke is combinied with a "stupid patent idea" joke! The originality of the average slashbot never ceases to amaze me! You should send some of your jokes to illiad so he can put them in user friendly!
--Shoeboy
How will this be different from Google's back-end query interface? I ask, because I can't imagine someone making a "screen-scraping" search engine that returns bits of data and not just a link. They will probably get sued by the owners of the purloined content. Plus, parsing HTML to extract one little field of data is tricky, and highly dependant on the layout of the page. I've written a number of things to do just that, from Amazon, IMDB, Borders, finance.Yahoo.com, etc., for my own purposes. I wrote them in both C and Perl. It's a job keeping the filters updated to accomodate the changes in page layout style, regardless of language. Good luck to them and all, but until we have an XML + XSL web, with standard DTDs for the XML, forget it.
________________________________________
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
WQL - wackle
SQL - squeal
If tits were wings it'd be flying around.
The appropriate response to an ASK is a NASK.
If tits were wings it'd be flying around.
I downloaded the WebQL Business Edition manual. Here's an abbreviated version of the first example query:
The select clause accepts a variety of functions, of which text() seems to be the most useful. You can see that the first argument is a regex designed to match phone numbers. The from clause is an URL. The where clause primarily takes the approach "descriptor," which can crawl or guess new URLs.So basically, it doesn't do anything a Perl script can't. It just presents a simpler interface.
select
;-)
text("(\(206\)\s+\d{3}-\d{4})","","","T")
from
http://foo/bar.html
where
approach=sequence("1","10","1","XX")
I wouldn't exactly call that a simple interface!
cpeterso
For my thesis, I created a Web query system called ParaSite. The best introduction is the paper Squeal: A Structured Query Language for the Web, which I presented at the World-Wide Web Conference. Anybody is welcome to use my code, algorithms, or ideas.
See also WebSQL and W3QL, which also come from academia.
is a link to the ZD revue of the Biz version.
meow
Diffs between this and Google, for instance, abound. Central is the fact that it's not limited to urls.
"Version 1.0 of WebQL uses a wizard to simplify writing queries, but only users with SQL experience will be able to create useful queries. (Ordinary-language queries will be supported in future versions.) The wizard lets you select whether to return text, URLS, table rows or columns, or any combination thereof. You can then specify to search for text, regular expressions, or table cells, and you can add refinements such as case sensitivity and the number of matches returned per page."
I will buy this when it supports ordinary language queries.
Through its access to directories will this thing allow you to bypass registrations on all sites? Pay sites?
How about an image search? (Since people don't name their files informatively all the time..)
Goat sex free since 2001
especially with the web running at well over a billion pages by now. Just think of the time to query a billion pages all around the planet, never mind on a small business line, with say a dsl line (forget modem!)
but then I don't get the big bucks for this either....
"It is a greater offense to steal men's labor, than their clothes"
>drop table internet;
OK, 135454265363565609860398636678346496
rows affected.
"oh fuck"
FluX
After 16 years, MTV has finally completed its deevolution into the shiny things network
"It is seldom that liberty of any kind is lost all at once." -David Hume
The Semantic Web Page is a good starting point.
TBLs personal notes Is another one. Probably the best one, actually.
"The Semantic Web" was a term coined by Tim Berners-Lee (we all know who that is, don't we?) to describe a www-like global knowledge base, which when combined with some simple logic forms a really interesting KR system. His thesis is that early hypertext systems died of too much structure limiting scalability, and current KR systems (like CYC) have largely failed for similar reasons. The Semantic Web is an attempt to do KR in a web-like way.
This really could be the next major leap in the evolution of the web. Do yourself a favour and check it out. And it's not based on hacks for screen-scraping HTML, it's based on real KR infrastructure.
For simple techniques (without learning or any kind of intelligence) such as regular expression to extract or label contents from web pages, you won't expect a good coverage from pages written in all kinds of templates and with so many types of errors.
Right now I'm writting a Java program to extract links from Google search results (easy, don't shoot! Academic use only). What I'm using is OROMatcher, one of the best regular expression packages for Java. I'll say it's still a mission impossible to get 100% recall and be error-free even for this simple task.
The formal name of such a program (labelling and extracting contents) is a "wrapper". Probably the only way to improve the efficiency of a wrapper is to apply machine learning techniques. A well-trained wrapper program with good learning algorithm could be smart enough to adapt to HTML coding formats with small variances. A good example is in this paper.
The webql site info reads
Sounds like nothing but a spam e-mail address collector to me.
However, a proprietary piece of software - sold for $450 is not the best way to surface an excellent idea. What we need is a protocol: a common query language for searching the web that will be easily supported by today's available search engines. Something like this would enable programmers to easily interface their programs with web search engines (which i guess is a good thing).
Also, if their manual is correct, no inserts, updates or deletes are allowed. A carefully drafted protocol like the one mentioned above should support all these, e.g. for adding documents into search engines, removing deleted web sites, coping with new URLs and so on.
Imagine:
delete *
from Yahoo
where errcode = 404
update Yahoo
set url = redirected_url
where redirecton = True
--
I am appauled by the large number of posts that I have read already bashing this thing. Did you guys just read the news article? if that is all you did shame on you. Go to the site, download the manual http://www.webql.com/webqlmanual.zip (sorry, I don't create clickable links, cut and paste it in) Anyway, this is a nice idea, I once wanted to gain an edge on ebay when I was once addicted, so I wrote a program to allow me to query ebay, with my program I can query all ended auctions, and find out which items were in demand by the number of bids, which items sold the most, using such knowledge, I can try to find such items and sell them on ebay. Using such a program, you can query all ended auctions, find out which auctions are not in demand, then find if there are any thing you could use from those auctions.
:)
What I am pondering about tho, is if someone will soon make an opensource implementation, if so, will that be fair? I mean, if I started a company with a neat idea, and 3 months later, someone cranked out an opensource version of my product, I do be heartbroken. Ah well...
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
Here is the intro:
The future of the Internet is in what I call "rational programming" derived from a revival of Bertrand Russell's Relation Arithmetic. Rational programming is a classically applicable branch of relation arithmetic's sub theory of quantum software (as opposed to the hardware-oriented technology of quantum computing). By classically applicable I mean it is applies to conventional computing systems -- not just quantum information systems. Rational programming will subsume what Tim Berners Lee calls the semantic web. The basic problem Tim (and just about everyone back through Bertrand Russell) fails to perceive is that logic is irrational. John McCarthy's signature line says it all about this kind of approach: "He who refuses to do arithmetic is doomed to talk nonsense."
Seastead this.