WebQL Turns the Web Into A Giant Database
An anonymous reader says "
This article was posted on ZDNet by Bill Machrone on a new type of query language for aggregating information from the Web." Somewhat light on the details, but definitely something to think about.
--
A mind is a terrible thing to taste.
"A mind is a terrible thing to taste."
System Requirements:
6GB HD suggested (10MB required to install)
Must use 5.99 GB of virtual memory I guess.
Wait a second... so I'm supposed to believe that this goes onto the web and takes information off of it from other computers? Whoa... that's a great idea! Someone should patent it if Al Gore hasn't already. Who would have known that something like this could happen to the Internet.
"I have not failed. I've simply found 10,000 ways that won't work." --Thomas Edison
Remember, remote content is not under your control. It will change (often) and is very very likely to not have a nice structure, and is even more likely to contain mismatched tags and other errors.
OK, its in its infancy, but IMHO if/when XHTML is widely adopted, a special query language or tool will largely be irrelevant because most of what is alleged in that brief article could be done in the magical wonderful world of XML.
Seriously guys,
A company comes out with a product that is atleast fairly cool. If it does nothing that it says it will, it atleast will turn some heads and get some attention. So at its worst, this is AMAZING PR for linux in general. Can't you just picture it now:
MS Driod: Its good, we don't have anything like it, lets buy the company.
Tech Dude: Uhh, Sir, this runs on Linux. We *CANT* aquire them.
Beyond that, this reaffirms the idea that Linux is a valid operating system for the enterprise, and MySQL as a valid Data base solution. Even though I know this, and you probably know this, what matters is the the VP's and VC's know this. My company is forced to use Oracle because our clients "don't trust PostgreSQL nor MySQL."
So the way I see it, we win!
--Alex the Fishman, (very proud fishman)
I followed the ZD links at the bottom to this page and it waxes lyrical about the ability to pull every phone number off a web site. Replace phone number with email address, and it seems a bit worrying. Couple with that the SQL programmability, and we could be looking at something that can auto-harvest de-anti-SPAMmed addresses, perhaps?
WebQL Turns the Web into a Giant Database
Isn't the Web really a giant database without 'WebQL'??? If the web isn't already a database of sorts, then what is it???
------
Random, useless fact: I type in startx entirely with my left hand.
What's great is that this program could, if scaled down into a personal edition, make the Internet much more accessible and useable for novices. An acquaitance once asked me to show him how to do basically the same thing: he wanted a macro to automatically import his stock portfolio information from a web page into a spreadsheet. He was kind of spoiled by a simple TRS-80 BASIC program that he had once used in the hey-days of computing that automatically dialed up Compuserve (or some other online service, I don't recall which) that would automatically download and parse stock information from one of the Compuserve forums. I had to tell him that there's no simple way to parse HTML like that, but it would have been much better if I could have pointed him to a personal version of WebSQL to use for exactly this. Just imagine if Web sites could make ready-made WebSQL scripts for their portals for users to download and use with their favorite spreadsheet or database.
Not to mention what I could do with such a fun toy. :) If they make a personal edition, and it works as advertised, I'll buy a copy for sure.
The breakthru is that you notice almost everyone (those significant enough) have a web frontend to their database. Now if you can just go via that web front, you don't have to go direct to the database and can bypass all the above issue!
The first company that I know of that does this is an Israelic company: Orsus.com. Since then, OnePage.com also does it.
What they (at least Orsus) did was build a language (based on XML) that instructs a web spidering engine that has the ability to parse HTML and Javascript. A GUI IDE (no less) is used by the lay person to write the XML-based code.
An exerpt - I'm too tired to find the link...
THE COPYRIGHT CONUNDRUM
Another problem is with copyrights and other protections of intellectual property. As we have learned from the recent Napster battles, this can be a real sticky wicket on the Net, where users can so easily and freely trade files regardless of any such protection. Because the current system is intentionally oblivious to what's in those Internet packets being transferred, there's no easy way to protect copyrighted data. With all that in mind, Kahn decided it was necessary to develop a new framework for dealing with all the information on the Internet that would act as a layer above the existing infrastructure but deal with the "what" as much as the "where." So through the 90s, while most of the world was just discovering the Internet, Kahn was working on how to reinvent it. His new system is called the "Handle System." Instead of identifying the place a file is going to or coming from, it assigns an identifier called a "handle" to the information itself, called "digital objects." A digital object is anything that can be stored on a computer: a web page, a music file, a video file, a book or chapter of a book, your dental x-rays - you name it. Similar to the way a host name is resolved to an IP address, the handle will be resolved into information the computer needs to know about the object. Only since the information is now about the object, the location of the object is just one of the bits of information that is important. The handle record will also tell the computer things like what kind of file the object it is, how often it will be updated and how the object is allowed to be used - whether there are any copyright or privacy protections. The record can also have any industry-specific information about the object, like a book's International Standard Book Number (ISBN) code. There are two other crucial things about these records: First, each handle can have multiple records associated with it - allowing multiple copies of the same information to be stored on different servers or for different systems. Second, the handle record is updated by the owner of the information - something in stark contrast to the host-name data, which is updated by central repository companies like Network Solutions. This will make things like changing the location of the file much more seamless, rather than waiting days for a new IP address to be updated in the DNS server.
Wouldn't merging the querying features with the above "Handle System" seem a wise thing to do? Maybe that's what it already does...
Please stop APK.. you're only hurting yourself.
Never use select * from... for production queries if you can help it. It's bad style. If you change your schema to include more columns you can wind up returning more data to the front end than you need to. This caused errors in display at worst and wasted bandwidth at best.
If you have different developers working on the front end and the database, this will really make them hate each other. It also makes the query optimizer work harder than it needs to (the amount of cpu wasted this way is totally insignificant, but it's bad form anyway.
Also, if you're going to run select * from internet without a where clause, be prepared for an extremely long running query.
--Shoeboy
SELECT * FROM Internet WHERE SubjectOfPic = "Natale Portman" AND Grits = "Hot Pouring"
How will this be different from Google's back-end query interface? I ask, because I can't imagine someone making a "screen-scraping" search engine that returns bits of data and not just a link. They will probably get sued by the owners of the purloined content. Plus, parsing HTML to extract one little field of data is tricky, and highly dependant on the layout of the page. I've written a number of things to do just that, from Amazon, IMDB, Borders, finance.Yahoo.com, etc., for my own purposes. I wrote them in both C and Perl. It's a job keeping the filters updated to accomodate the changes in page layout style, regardless of language. Good luck to them and all, but until we have an XML + XSL web, with standard DTDs for the XML, forget it.
________________________________________
Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
Correction. You may GNU/FreeASK for forgiveness.
Please please please mod me up! I'm serious! Pathetic, but serious!
The appropriate response to an ASK is a NASK.
If tits were wings it'd be flying around.
I dunno, just seems to do the same stuff...
I downloaded the WebQL Business Edition manual. Here's an abbreviated version of the first example query:
The select clause accepts a variety of functions, of which text() seems to be the most useful. You can see that the first argument is a regex designed to match phone numbers. The from clause is an URL. The where clause primarily takes the approach "descriptor," which can crawl or guess new URLs.So basically, it doesn't do anything a Perl script can't. It just presents a simpler interface.
What this reminds me of is those wolf programs from a couple of years back. They made FTPWolf, WebWolf, MP3Wolf, WarezWolf and some others. They were just little client spiders that scoured webpages for keywords and followed links. They started with search engines and followed the results to other pages and followed those pages, and on and on. Nothing that hasn't been done before.
The part that would make it useful, and which they claim to do, is comparative searches. i.e. Show me all the latest P4 prices by vendor, or what's the difference between these 3 drills. They mention on the products page that anyone familiar with perl and HTML can use it in no time. I would think anyone familiar with perl and the LWP or Net::FTP modules could create this system in no time. They say there are some wizards, but when have wizards been able to really do what a user wants?
Free Online Woodworking Resources Directory
select
;-)
text("(\(206\)\s+\d{3}-\d{4})","","","T")
from
http://foo/bar.html
where
approach=sequence("1","10","1","XX")
I wouldn't exactly call that a simple interface!
cpeterso
Oh, ye gods, if only I had some mod points! And if I could use them all to mod this post up to its rightful level! Curse you, fate! CURSE YOU!!!!
- Have a picture
For my thesis, I created a Web query system called ParaSite. The best introduction is the paper Squeal: A Structured Query Language for the Web, which I presented at the World-Wide Web Conference. Anybody is welcome to use my code, algorithms, or ideas.
See also WebSQL and W3QL, which also come from academia.
is a link to the ZD revue of the Biz version.
meow
Diffs between this and Google, for instance, abound. Central is the fact that it's not limited to urls.
"Version 1.0 of WebQL uses a wizard to simplify writing queries, but only users with SQL experience will be able to create useful queries. (Ordinary-language queries will be supported in future versions.) The wizard lets you select whether to return text, URLS, table rows or columns, or any combination thereof. You can then specify to search for text, regular expressions, or table cells, and you can add refinements such as case sensitivity and the number of matches returned per page."
I will buy this when it supports ordinary language queries.
Through its access to directories will this thing allow you to bypass registrations on all sites? Pay sites?
How about an image search? (Since people don't name their files informatively all the time..)
Goat sex free since 2001
Too much like Microsoft.NET?
I don't see any indication that Caesius intends to start such a search engine. WebQL is just a web crawler.
If someone did, the primary defense would be fair use. Some search engines already display an abstract in the search results. On the other hand, I think eBay won a case against (or bullied into submission) a site that crawled their auctions. U.S. courts don't seem to like deep linking, let alone data extraction. Something about the God-given right to banner ad impressions. Next thing you know, a U.S. Marshall will break down my door because I'm using the Internet Junkbuster proxy. I did post anonymously, right?
Like they say about Perl: it makes the easy things easy, and the hard things possible. The average SQL user could probably learn WebQL syntax, although regexes can be complicated. (I didn't realize how complicated until I read Mastering Regular Expressions.) On the other hand, writing a web crawler in Perl may be beyond his reach.
That said, it wouldn't take much work to cook up a little language like this to wrap a Perl web crawler. I certainly wouldn't pay $500 for this proprietary package.
I'm sorry but I just can't seem to get excited. I think time would better spent trying to improve the existing ones, rather introducing another set of bugs and security holes into the mix.
especially with the web running at well over a billion pages by now. Just think of the time to query a billion pages all around the planet, never mind on a small business line, with say a dsl line (forget modem!)
but then I don't get the big bucks for this either....
"It is a greater offense to steal men's labor, than their clothes"
Posting goatse.cx links makes baby Anne Marie cry.
Thanks for the links. This looks more compelling than the WebQL product.
The first example in the WebQL manual demonstrates a regex to extract phone numbers. One of your first examples appears to implement the Google technique. I find the latter more interesting.
Thunderstone Texis... web spider, regex scraper and SQL-compliant RDBMS. Oh, and a _tad_ more maturity and market experience. Of course, the price tag is a bit steep...
WebQL seems kinda like a friendlier name for an already-established market.
PDHoss
======================================
======================================
Writers get in shape by pumping irony.
Um, how about Unix? Basically, it was created by the open source community, though it wasn't called that back then.
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger
SELECT sex.image, text.description
FROM web_images AS sex, web_text AS text
WHERE sex.primary_key = text.primary_key
AND text.description LIKE UPPER('%NATALIE%')
AND text.description LIKE UPPER('%PORTMAN%')
AND text.description NOT LIKE UPPER('%GRITS%')
Woo-hoo! Our sweet mother of Akamai accelerated download, don't fail me now!
It's worth mentioning that BSD is a descendant, and Linux is a clone of Unix. The UNIX source was available, but by no means could its license fit the Open Source definition. MANY people had illegal copies of the the copyrighted code, however, which was likely one of the primary inspiration for Free/Open Source software later on.
(end comment) */ }
(end comment) */ }
[an error occurred while processing this directive]
>drop table internet;
OK, 135454265363565609860398636678346496
rows affected.
"oh fuck"
FluX
After 16 years, MTV has finally completed its deevolution into the shiny things network
"It is seldom that liberty of any kind is lost all at once." -David Hume
Define "Open Source definition" :) And AFAIK, legal version of the source code were initially given to universities almost for free (the commercial source licenses cot an arm and a leg though) which was one reason why it spread so quickly.
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger
Interesting one this. I note that it is based on MySQL, a lovely, wonderful, useful toy if ever there was one - may the contributors to it have fast ping times, high data tranfer rates and few systems crashes.
I am curious to know exactly how it sorts through the data though: does it refer to some kind of externally held central database server via the 'net which is continually updated? I fail to see how else such a system could be truly usefully maintained otherwise to an acceptable standard of accuracy.
Elgon
(end comment) */ }
(end comment) */ }
[an error occurred while processing this directive]
The Semantic Web Page is a good starting point.
TBLs personal notes Is another one. Probably the best one, actually.
"The Semantic Web" was a term coined by Tim Berners-Lee (we all know who that is, don't we?) to describe a www-like global knowledge base, which when combined with some simple logic forms a really interesting KR system. His thesis is that early hypertext systems died of too much structure limiting scalability, and current KR systems (like CYC) have largely failed for similar reasons. The Semantic Web is an attempt to do KR in a web-like way.
This really could be the next major leap in the evolution of the web. Do yourself a favour and check it out. And it's not based on hacks for screen-scraping HTML, it's based on real KR infrastructure.
For simple techniques (without learning or any kind of intelligence) such as regular expression to extract or label contents from web pages, you won't expect a good coverage from pages written in all kinds of templates and with so many types of errors.
Right now I'm writting a Java program to extract links from Google search results (easy, don't shoot! Academic use only). What I'm using is OROMatcher, one of the best regular expression packages for Java. I'll say it's still a mission impossible to get 100% recall and be error-free even for this simple task.
The formal name of such a program (labelling and extracting contents) is a "wrapper". Probably the only way to improve the efficiency of a wrapper is to apply machine learning techniques. A well-trained wrapper program with good learning algorithm could be smart enough to adapt to HTML coding formats with small variances. A good example is in this paper.
just think... if the commercial licenses had been cheaper perhaps we wouldn't be seeing all these windows boxen today, as more things would have been developed on and for unix. then again, maybe not. *shrug*
eudas
Blessed is he who expects the worst, for he shall not be disappointed.
webql is a data extraction/data aggrigate/web crawling/data mining tool.
See the unbroken link here for proof that this is on-topic and funny.
The webql site info reads
Sounds like nothing but a spam e-mail address collector to me.
However, a proprietary piece of software - sold for $450 is not the best way to surface an excellent idea. What we need is a protocol: a common query language for searching the web that will be easily supported by today's available search engines. Something like this would enable programmers to easily interface their programs with web search engines (which i guess is a good thing).
Also, if their manual is correct, no inserts, updates or deletes are allowed. A carefully drafted protocol like the one mentioned above should support all these, e.g. for adding documents into search engines, removing deleted web sites, coping with new URLs and so on.
Imagine:
delete *
from Yahoo
where errcode = 404
update Yahoo
set url = redirected_url
where redirecton = True
if you haven't looked into RDF and the importance of metadata on the web, there's no time like the present.
it wouldn't hurt to read weaving the web, by tim berners-lee, the inventor of the world-wide web, either. he has chapter 1 online.
I am appauled by the large number of posts that I have read already bashing this thing. Did you guys just read the news article? if that is all you did shame on you. Go to the site, download the manual http://www.webql.com/webqlmanual.zip (sorry, I don't create clickable links, cut and paste it in) Anyway, this is a nice idea, I once wanted to gain an edge on ebay when I was once addicted, so I wrote a program to allow me to query ebay, with my program I can query all ended auctions, and find out which items were in demand by the number of bids, which items sold the most, using such knowledge, I can try to find such items and sell them on ebay. Using such a program, you can query all ended auctions, find out which auctions are not in demand, then find if there are any thing you could use from those auctions.
:)
What I am pondering about tho, is if someone will soon make an opensource implementation, if so, will that be fair? I mean, if I started a company with a neat idea, and 3 months later, someone cranked out an opensource version of my product, I do be heartbroken. Ah well...
------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
%returnhash = sendhash('domain.com','port',%hash);
or something like that.
I've written an interface to CyberCash and to the Tucows OpenSRS system and they implement this in entirely different ways, that both require installing their own perl libraries and learning their own syntax. They could easily have been implemented in a one standard way though. There just doesn't seem to be one for everybody to use.
Basically this seems to be what ms is trying to come up with with .net. It doesn't seem particularly difficult though. Really that's all we need: to be able to pass a hash to a domain:port and to get a hash back (probably there'd be a standard status field... 200, 404, etc.. but the rest would be whatever fields are definied by their API). Services like MapQuest could return a jpg when you pass it an address in your hash. Services like slashdot could return an article (or all articles, or whatever). Services like etrade could return stock prices. We can get all this information already, but with this standard interface there'd be no parsing html or crazy hacks like WebQL!
Does anyone know if there's anything like this already? And if so, why nobody uses it? And if not, ideas on how to get it out there?
There's a chapter or two written by Charles Allen about WIDL in the XML Handbook (Goldfarb, et al).
But it's a technology that is dated now -- webMethods has moved on to B2B, and anyone who is jumping up and down about screen scraping in 2000 is just a little bit behind the times.
--brian
"quietbit", you are all over this page writing little snide comments followed by ":)" , saying that WebQL is "obviously" so good and everything else is "obviously" shit, and that you edited their webpage, and so on.
Funny how you were silent when the guy who wrote an open source Web SQL thing posted his work earlier.
Here is the intro:
The future of the Internet is in what I call "rational programming" derived from a revival of Bertrand Russell's Relation Arithmetic. Rational programming is a classically applicable branch of relation arithmetic's sub theory of quantum software (as opposed to the hardware-oriented technology of quantum computing). By classically applicable I mean it is applies to conventional computing systems -- not just quantum information systems. Rational programming will subsume what Tim Berners Lee calls the semantic web. The basic problem Tim (and just about everyone back through Bertrand Russell) fails to perceive is that logic is irrational. John McCarthy's signature line says it all about this kind of approach: "He who refuses to do arithmetic is doomed to talk nonsense."
Seastead this.
and see also : http://www.cs.wisc.edu/niagara/
Unix was created at Bell, for sale, by people who at first did it more for sport than as work. The source was readily available, very cheaply for academic institutions. Those are the facts. What's your problem.
The illegal we do immediately. The unconstitutional takes a little longer.
--Henry Kissinger