Slashdot Mirror


WebQL Turns the Web Into A Giant Database

An anonymous reader says " This article was posted on ZDNet by Bill Machrone on a new type of query language for aggregating information from the Web." Somewhat light on the details, but definitely something to think about.

24 of 84 comments (clear)

  1. I have my doubts by superid · · Score: 2
    I've done my share of "screen scraping"...that new buzzword where I grab the html and apply various forms of on the fly text processing in an attempt to grab the meat of whatever content is being presented.

    Remember, remote content is not under your control. It will change (often) and is very very likely to not have a nice structure, and is even more likely to contain mismatched tags and other errors.

    OK, its in its infancy, but IMHO if/when XHTML is widely adopted, a special query language or tool will largely be irrelevant because most of what is alleged in that brief article could be done in the magical wonderful world of XML.

  2. Databases, web, syndication by ivanl · · Score: 2
    The idea is this: you want to link everyones' databases together. But linking databases is a sensitive (security-wise for one) issue and you have to have agreements on a one-by-one basis.

    The breakthru is that you notice almost everyone (those significant enough) have a web frontend to their database. Now if you can just go via that web front, you don't have to go direct to the database and can bypass all the above issue!

    The first company that I know of that does this is an Israelic company: Orsus.com. Since then, OnePage.com also does it.

    What they (at least Orsus) did was build a language (based on XML) that instructs a web spidering engine that has the ability to parse HTML and Javascript. A GUI IDE (no less) is used by the lay person to write the XML-based code.

  3. Hey Taco by Shoeboy · · Score: 2

    Never use select * from... for production queries if you can help it. It's bad style. If you change your schema to include more columns you can wind up returning more data to the front end than you need to. This caused errors in display at worst and wasted bandwidth at best.
    If you have different developers working on the front end and the database, this will really make them hate each other. It also makes the query optimizer work harder than it needs to (the amount of cpu wasted this way is totally insignificant, but it's bad form anyway.
    Also, if you're going to run select * from internet without a where clause, be prepared for an extremely long running query.
    --Shoeboy

  4. Re:Ingenius! by Shoeboy · · Score: 3

    ROTFLMAO!!!!!!!
    Good god! An "Al Gore invented the internet" joke is combinied with a "stupid patent idea" joke! The originality of the average slashbot never ceases to amaze me! You should send some of your jokes to illiad so he can put them in user friendly!
    --Shoeboy

  5. Yeah, okay. by 1010011010 · · Score: 4

    How will this be different from Google's back-end query interface? I ask, because I can't imagine someone making a "screen-scraping" search engine that returns bits of data and not just a link. They will probably get sued by the owners of the purloined content. Plus, parsing HTML to extract one little field of data is tricky, and highly dependant on the layout of the page. I've written a number of things to do just that, from Amazon, IMDB, Borders, finance.Yahoo.com, etc., for my own purposes. I wrote them in both C and Perl. It's a job keeping the filters updated to accomodate the changes in page layout style, regardless of language. Good luck to them and all, but until we have an XML + XSL web, with standard DTDs for the XML, forget it.


    ________________________________________

    --
    Napster-to-go says "Fill and refill your compatible MP3 player", which is a lie. It's not MP3. It's WMA with DRM.
    1. Re:Yeah, okay. by Nexx · · Score: 2

      Actually, I don't think you need standard DTDs, as long as each site stuck with their own DTDs long enough to make it worthwhile. At that point, you can build a custom XSL for each site to transform their content to conform to your DTD, and you can parse that.


      --
  6. Re:but how do you pronounce it? by PD · · Score: 2

    WQL - wackle
    SQL - squeal

  7. Re:FreeQL ? by PD · · Score: 2

    The appropriate response to an ASK is a NASK.

  8. just a pretty interface by _|()|\| · · Score: 4
    I grab the html and apply various forms of on the fly text processing

    I downloaded the WebQL Business Edition manual. Here's an abbreviated version of the first example query:

    select
    text("(\(206\)\s+\d{3}-\d{4})","","","T")
    from
    http://foo/bar.html
    where
    approach=sequence("1","10","1","XX")
    The select clause accepts a variety of functions, of which text() seems to be the most useful. You can see that the first argument is a regex designed to match phone numbers. The from clause is an URL. The where clause primarily takes the approach "descriptor," which can crawl or guess new URLs.

    So basically, it doesn't do anything a Perl script can't. It just presents a simpler interface.

  9. simple interface? by cpeterso · · Score: 2

    select
    text("(\(206\)\s+\d{3}-\d{4})","","","T")
    from
    http://foo/bar.html
    where
    approach=sequence("1","10","1","XX")


    I wouldn't exactly call that a simple interface! ;-)

    1. Re:simple interface? by dodobh · · Score: 2

      Compared with a Perl script, the ointerface is simple. :)

      --
      I can throw myself at the ground, and miss.
  10. Freely-Available Web Query Languages by Ellen+Spertus · · Score: 5

    For my thesis, I created a Web query system called ParaSite. The best introduction is the paper Squeal: A Structured Query Language for the Web, which I presented at the World-Wide Web Conference. Anybody is welcome to use my code, algorithms, or ideas.

    See also WebSQL and W3QL, which also come from academia.

  11. Here by perdida · · Score: 2

    is a link to the ZD revue of the Biz version.

    meow

    Diffs between this and Google, for instance, abound. Central is the fact that it's not limited to urls.

    "Version 1.0 of WebQL uses a wizard to simplify writing queries, but only users with SQL experience will be able to create useful queries. (Ordinary-language queries will be supported in future versions.) The wizard lets you select whether to return text, URLS, table rows or columns, or any combination thereof. You can then specify to search for text, regular expressions, or table cells, and you can add refinements such as case sensitivity and the number of matches returned per page."

    I will buy this when it supports ordinary language queries.

    Through its access to directories will this thing allow you to bypass registrations on all sites? Pay sites?

    How about an image search? (Since people don't name their files informatively all the time..)

  12. I don't know about this by Alien54 · · Score: 3
    with all of the varient site structures, never mind security issues, pay sites, and things like Microsoft constantly rebuilding/breaking its' website, it is hard to see how it would be better results then any meta search across the common search engines sites with prebuilt indexing, etc.

    especially with the web running at well over a billion pages by now. Just think of the time to query a billion pages all around the planet, never mind on a small business line, with say a dsl line (forget modem!)

    but then I don't get the big bucks for this either....

    --
    "It is a greater offense to steal men's labor, than their clothes"
  13. Re:How much better could this be in terms of PR by vectro · · Score: 2
    Erhm, perhaps you might want to visit the product info, where it mentions:
    Server Component System Requirements: Linux
    The client runs on Windows, but the server is for linux.
  14. goof? by fluxrad · · Score: 5

    >drop table internet;
    OK, 135454265363565609860398636678346496
    rows affected.

    "oh fuck"


    FluX
    After 16 years, MTV has finally completed its deevolution into the shiny things network

    --
    "It is seldom that liberty of any kind is lost all at once." -David Hume
  15. The Semantic Web by hemul · · Score: 3
    WebQL looks like an interesting hack, but have a look at the semantic web project for people trying to do it properly.

    The Semantic Web Page is a good starting point.
    TBLs personal notes Is another one. Probably the best one, actually.

    "The Semantic Web" was a term coined by Tim Berners-Lee (we all know who that is, don't we?) to describe a www-like global knowledge base, which when combined with some simple logic forms a really interesting KR system. His thesis is that early hypertext systems died of too much structure limiting scalability, and current KR systems (like CYC) have largely failed for similar reasons. The Semantic Web is an attempt to do KR in a web-like way.

    This really could be the next major leap in the evolution of the web. Do yourself a favour and check it out. And it's not based on hacks for screen-scraping HTML, it's based on real KR infrastructure.

  16. Parsing Information from the web page by jayhop · · Score: 2

    For simple techniques (without learning or any kind of intelligence) such as regular expression to extract or label contents from web pages, you won't expect a good coverage from pages written in all kinds of templates and with so many types of errors.

    Right now I'm writting a Java program to extract links from Google search results (easy, don't shoot! Academic use only). What I'm using is OROMatcher, one of the best regular expression packages for Java. I'll say it's still a mission impossible to get 100% recall and be error-free even for this simple task.

    The formal name of such a program (labelling and extracting contents) is a "wrapper". Probably the only way to improve the efficiency of a wrapper is to apply machine learning techniques. A well-trained wrapper program with good learning algorithm could be smart enough to adapt to HTML coding formats with small variances. A good example is in this paper.

    1. Re:Parsing Information from the web page by Bazzargh · · Score: 2

      Put the pages through a normalisation stage first. - e.g. the HTML Tidy utility at
      http://www.w3.org/People/Raggett/tidy/
      Unless what you are searching for is broken html, your life will be improved by this step...

      BTW, using a regular expression matcher to pull out information from HTML is not the smartest idea. You should use a parser to do the job. I can see why you would do what you've done - e.g. the html doesn't parse, and you don't want to guess all the tricks that MS/NS use to fix luser code - but still, you're better off passing the html through a tidying step, then using a proper parser. It's not like you can't get HTML parser code for free these days. Since you use java, look at javax.swing.text.html.parser .

      -Baz

  17. Spam? by mindriot · · Score: 2

    The webql site info reads

    • Market Research
    • Aggregate Information of any Kind
    • Develop Targeted Contact Lists

    Sounds like nothing but a spam e-mail address collector to me.

    ...and, it's not free. So forget it.

  18. Excellent, but... by laoman · · Score: 2
    I agree this is an excellent idea. Personally, I enjoy working both with on-line stuff and with databases (although my grades in both DB courses I've taken while at school were among my lowest).

    However, a proprietary piece of software - sold for $450 is not the best way to surface an excellent idea. What we need is a protocol: a common query language for searching the web that will be easily supported by today's available search engines. Something like this would enable programmers to easily interface their programs with web search engines (which i guess is a good thing).

    Also, if their manual is correct, no inserts, updates or deletes are allowed. A carefully drafted protocol like the one mentioned above should support all these, e.g. for adding documents into search engines, removing deleted web sites, coping with new URLs and so on.

    Imagine:

    delete *
    from Yahoo
    where errcode = 404

    update Yahoo
    set url = redirected_url
    where redirecton = True

  19. Re:Looks like a SPAMmer's dream by pen · · Score: 2
    IMHO, 99% of all Web/Internet users post their plain email address one time or another. That provides sufficient volume for the spammers. I highly doubt that they care about the obfuscated addresses, especially because the a person who obfuscates her email address is much more likely (in terms of probability) to be a person who reports spammers to their ISPs repeatedly.

    --

  20. Whoa, take it easy peeps by segmond · · Score: 2

    I am appauled by the large number of posts that I have read already bashing this thing. Did you guys just read the news article? if that is all you did shame on you. Go to the site, download the manual http://www.webql.com/webqlmanual.zip (sorry, I don't create clickable links, cut and paste it in) Anyway, this is a nice idea, I once wanted to gain an edge on ebay when I was once addicted, so I wrote a program to allow me to query ebay, with my program I can query all ended auctions, and find out which items were in demand by the number of bids, which items sold the most, using such knowledge, I can try to find such items and sell them on ebay. Using such a program, you can query all ended auctions, find out which auctions are not in demand, then find if there are any thing you could use from those auctions.

    What I am pondering about tho, is if someone will soon make an opensource implementation, if so, will that be fair? I mean, if I started a company with a neat idea, and 3 months later, someone cranked out an opensource version of my product, I do be heartbroken. Ah well... :)

    --
    ------ Curiosity killed the cat. {satisfaction brought it back | it didn't die ignorant | lack of it is killing mankind
  21. The Relation Arithmetic Alternative by Baldrson · · Score: 2
    A while back, I posted an article on an alternative to the Tim Berner-Lee's Semantic Web based on the aspect of Bertrand Russell's work that Russell thought was his most under-rated achievement: Relation Arithmetic.

    Here is the intro:

    The future of the Internet is in what I call "rational programming" derived from a revival of Bertrand Russell's Relation Arithmetic. Rational programming is a classically applicable branch of relation arithmetic's sub theory of quantum software (as opposed to the hardware-oriented technology of quantum computing). By classically applicable I mean it is applies to conventional computing systems -- not just quantum information systems. Rational programming will subsume what Tim Berners Lee calls the semantic web. The basic problem Tim (and just about everyone back through Bertrand Russell) fails to perceive is that logic is irrational. John McCarthy's signature line says it all about this kind of approach: "He who refuses to do arithmetic is doomed to talk nonsense."