Google Indexing In Near-Realtime
krou writes "ReadWriteWeb is covering Google's embrace of a system that would enable any Web publisher to 'automatically submit new content to Google for indexing within seconds of that content being published.' Google's Brett Slatkin is lead developer of PuSH, or PubSubHubbub, a real-time syndication protocol based on ATOM, where 'a publisher tells the world about a Hub that it will notify every time new content is published.' Subscribers then wait for the hub to notify them of the new content. Says RWW: 'If Google can implement an Indexing by PuSH program, it would ask every website to implement the technology and declare which Hub they push to at the top of each document, just like they declare where the RSS feeds they publish can be found. Then Google would subscribe to those PuSH feeds to discover new content when it's published. PuSH wouldn't likely replace crawling, in fact a crawl would be needed to discover PuSH feeds to subscribe to, but the real-time format would be used to augment Google's existing index.' PuSH is an open protocol, and Slatkin says that 'I am being told by my engineering bosses to openly promote this open approach even to our competitors.'"
...someone help me out here. People can still find my articles through google before I see the googlebot hit any new articles I post...how is that possible? How would my pages show up on google before the bot actually crawls them?
Living With a Nerd
What's there to stop spammers from stuffing their crap to Google?
Google's #1 problem, and which they have done nothing about it appears, is that over 99.99% of their search results are absolutely useless.
I would actually pay money to have that fixed.
If google notices your site/blog updates frequently the bot will come around more often and especially if its a high page rank site.
How is this any different from sitemaps? Sitemaps are by major search engines and have been in use for years now.
GOTO Subject
If you were blocking sigs, you wouldn't have to read this.
I typed 'google' into the google box on the page and I ended up at this readwriteweb place. GIVE ME MY GOGLLE BKCA!!11!
This strikes me as another solution looking for a problem. Those who would use this don't need it, and those who would need it won't use it.
If Google makes a new sight queue and then You could request your URL be put on that Queue. Then the google Bot would not have to find your content from links on old URL's.
The result, your content scanned in seconds not hours or days.
Yahoo! BINGo!
This sounds a bit like Twitter. Put your content in one hole and it comes out lots of places.
"If a tree falls in the forest and no one is around to hear it, does it make a noise?"
internet era update:
"If a webpage is published on the web and no google spider notices it, does it exist?"
near future update:
"If a thought enters your mind that is not already indexed by google, is it real?"
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
As usual I tried to make a tongue in cheek remark and ended up chewing my tongue. I meant Google’s indexer is so fast. Original posting was made at March 3, 2010 2:09 PM. It was in the index by March 3, 2010 5:08 PM. And it was not even from news.google.com, it is the general web search. Pretty soon Google will tell me that I’m out of milk even before I open the fridge door.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Keep in mind Google is quickly becoming an all controlling entity.
I have concerns that this technology could expose users to additional threats.
Likely I see it as one more way for Google to corner the search market.
Lastly I ponder the legal implications of a direct tying to a web site's content. What if there is a copyright violation.
Generally I find this to be a dud tech.
Long ago we had to publish to search engines then the crawlers came and life was good.
Again automation is what made things better.
Diving into this without thinking would be suicide.
Often I fear technology more now then I did before.
Simply put I think this is a crpytic way to pillage and trick web develoeprs to forking over additional metrics for page ranking since more frequent updates I would assume gets you a higher page rank.
-=[ Who Is John Galt? ]=-
I remember when I worked on this back in the turn of the century, it was called GridIR back then. http://www.gir-wg.org/index.html A subscription based indexing/search/collection engine.
Amusingly, since this is based on Atom, the client still has to poll. It just has to poll fewer sources. The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.
Also, the "pushsubhub" caches and redistributes the feeds, which means the feed operator no longer sees their own clients.
They don't seem to have addressed the general RSS problem of "server timestamp/ID changed, but content did not". Some RSS feeds get this right; some don't. Reuters is good, but not perfect. Other sites vary; there's a common problem where the RSS feed is provided from multiple servers on a load balancer, and the servers don't coordinate on timestamps and IDs. Twitter is awful. An RSS feed from Twitter appears to change on each poll even when the content has not changed.
Actually determining that RSS content really hasn't changed currently requires computing a message digest on the content. If you're going to aggregate RSS feeds, it's probably necessary to do that.
I updated my site on 15th Feb and today, 4th March, I can see the old links in Google and none of the new ones. It seems that they need also more servers...
please drink more vodak
k thx
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
It would be unreal how fast spammers would exploit this.
I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
http://www.urbandictionary.com/define.php?term=vodak
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.
This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.
Right now Google seems to spend a couple of minutes a day crawling my website, but even once it knows what has changed it only updates it's index, and the SERPs once a month or so. I realize larger sites get indexed far more quickly, and often, but this seems like it is only augmenting the crawling of websites with a client site push. I'm unsure how this all relates to the long lag in indexing that data. Any ideas?
Seems to me that push-publishing already is implemented on the web via services like Ping-o-Matic and such. I can't see why a new push-publishing method would be needed since the blog ping works elegantly. Obviously, the system is abused by spammers, but Google's solution would suffer from the same problem too.
Football Odds