Company Offers Customizable Web Spidering
TechReviewAl writes "A company called 80legs has come up with an interesting new web business model: customized, on-demand web spidering. The company sells access to its spidering system, charging $2 for every million pages crawled, plus a fee of three cents per hour of processing used. The idea is to offer Web startups a way to build their own web indexes without requiring huge server farms. 'Many startups struggle to find the funding needed to build large data centers, but that's not the approach 80legs took to construct its Web crawling infrastructure. The company instead runs its software on a distributed network of personal computers, much like the ones used for projects such as SETI@home. The distributed computing network is put together by Plura Processing, which rents it to 80legs. Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and other rewards.'"
Lets assume that spidering a page costs 10 kB of data.
So thats $2 for 1M pages, or 10 GB of data download.
So thats at least $1 of data transfer that is being shifted onto the suckers, err "volunteers" who's home network is running this app.
Test your net with Netalyzr
But whenever I see something that is nifty combined with the internet, I immediately think "now how will this be used to spam and/or infect people..."
Sounds like a legitimate front for identity thieves, spammer, or even worse... Marketers.
I suppose its easier to do than running your own bot net.
"I am the king of the Romans, and am superior to rules of grammar!"
-Sigismund, Holy Roman Emperor (1368-1437)
free games for spare cycles/bandwidth? thats more interesting to me than that spidering stuff, how do i sign up?
Seems like an awfully cheap way to spider millions of pages of porn. It would be worthwhile if Google didn't do it already for free.
This is apparently the service that caused a lot of controversy when people discovered it was somewhat hidden in Digsby.
They are currently recruiting only flash game developers but I can imagine this getting as big as advertising is right now. It could even keep newspapers alive. "Do you want to access my free content? Sure, but gimme 10% of your processing power." As long as there is demand for this computing power, we are quite able to harness it.
There is a spider crawling the web that claims to be building a free, downloadable web index for similar purposes.
Torrent link for the index and information at http://www.dotnetdotcom.org/.
Another rationalization to spend more money on my computer hardware next upgrade.
I can see how they might get a fair number of people to donate their spare cycles for this, if the rewards are seen as sufficiently interesting. But are there really a whole bunch of startups (or other companies) that are really champing at the bit to create a new search engine? Other than marketers or malware purveyers, I mean. And do these searches honor robots.txt exclusions?
BTW I took a quick look at 80legs' website in an attempt to get these answers. I came up empty in that regard - so I will comment on how the CEO's hair makes him look like an in-disguise member of the Conehead family. Seriously, what's with the hair?
#DeleteChrome
Congratulations on the proper use of the word "champing". I hear people use "chomping" in that context all the time, and can't recall the last time I heard the correct word.
The levels of indirection present to support this system -- distributed clients, incentives for being a distributed client, power supply vs demand, payment for custom spidering -- make the system many things at the same time and unnecessarily complex, because those things already exist for free and in less complex ways. Many needs are sufficed by the simpler mechanisms and always have been.
Plura gets computer users to supply unused processing power in exchange for access to games, donations to charities, and spyware.
What's a Sig?
I am surprised that a post containing the words "SETI", "80 legs", "crawling", "computer", "spider", "farm", and "unused power" does not have the plot of Jodi Foster listening to radio telescope and discovering evil giant mutant cyborg space spiders are trying to invade earth and capture humans as batteries
Is there really a big demand out there for outsourced spidering? I had not heard of this market. They seem to be implying that there are all these start-up outfits out there who have invented really amazing, unique UIs that allow people to find exactly what they need on the Web, and all they need to be successful is access to a searchable index. Huh??
I mean, if you're going to be some kind of start-up search engine or "semantic company" (whatever that means), shouldn't Web spidering be your core competency? If you're going to differentiate yourself in the market, how can you buy spidering as a commodity? How to you expect to attract any investment if you're telling potential investors that you rent your spidering capability from another start-up -- let alone one that uses some kind of half-baked P2P technology to do the work?
Seriously, in a world where Google seems willing to partner with just about anybody who needs any kind of searching for reasonable rates, what is this company's proposed customer base? (And no, the Technology Review article includes no quotes from customers at all.)
Breakfast served all day!
This looks like an attempt to monetize a botnet. What, exactly, do the people running their "client" get out of this? Do they know they're sucking bandwidth, and possibly being billed for it, on behalf of someone else?
I run a web spider of sorts. And I know the people who run a big search engine. Reading the web sites isn't the bottleneck. Analyzing the results and building the database is. Outsourcing the reading part doesn't buy you much. If this just did a crawl, it would be of very limited value. That's not what it does.
What they're really doing is offering a service that lets their customers run the customer's Java code on other people's machines in the botnet. That's worrisome. There are some security limits, which might even work. Supposedly, all the Java apps can do is look at crawled pages and phone results home. Right.
This thing uses the Plura botnet. "Plura® is a grid computing system. We contract with affiliates, who are owners of web pages, software, and other services, to distribute our grid computing code. We utilize the excess resources of peripheral computers that are browsing the internet when such browsing leads to a web page of one of our affiliates. That web page has imbedded code that allows the visitor to participate in the grid computing process. We also utilize embedded code in software and other services to allow such participation." Not good.
The main infection vector is apparently the Digsby chat client, which comes bundled with various crapware. The Digsby feature list does not mention that Plura is in their package.
This thing needs to be treated as hostile code by firewalls and virus scanners.
Can we generate a list of applications known to use plura? or does one already exist?
It is really easy to make a web crawler in Java. (Look at java.net.http or maybe java.awt.net.http) I made a decent one by myself in about a week. Okay, so my web crawler only does TEXT/HTML. No images, no Active X, no video. From experience, an average web page is about 10Kbytes. Now, anyone's specific application will probably be looking for key words, or else you are just re-creating Google. A key word data crawl would return a LOT less information, but would still require a lot of bandwidth and processing power to do the work. So bandwidth and processing time is what they are selling -- the place where this company's services would be most useful.
- I live the greatest adventure anyone could possibly desire. - Tosk the Hunted
http://www.insuma.de/ offers a similar service
I'm starting a website for car restoration of a specific kind of car. There are a few other sites out there that talk about their own cars or clubs you can join. Some even have a few links to other sites.
I'd love to use a service like this to search for tons of links to all sorts of places on the web - but do it without just copy/paste of someone's links page. I'd rather do my own work with my own tool and not spend tons of time sorting through thousands of Google hits.
This might work for me.