Google Indexing In Near-Realtime
krou writes "ReadWriteWeb is covering Google's embrace of a system that would enable any Web publisher to 'automatically submit new content to Google for indexing within seconds of that content being published.' Google's Brett Slatkin is lead developer of PuSH, or PubSubHubbub, a real-time syndication protocol based on ATOM, where 'a publisher tells the world about a Hub that it will notify every time new content is published.' Subscribers then wait for the hub to notify them of the new content. Says RWW: 'If Google can implement an Indexing by PuSH program, it would ask every website to implement the technology and declare which Hub they push to at the top of each document, just like they declare where the RSS feeds they publish can be found. Then Google would subscribe to those PuSH feeds to discover new content when it's published. PuSH wouldn't likely replace crawling, in fact a crawl would be needed to discover PuSH feeds to subscribe to, but the real-time format would be used to augment Google's existing index.' PuSH is an open protocol, and Slatkin says that 'I am being told by my engineering bosses to openly promote this open approach even to our competitors.'"
...someone help me out here. People can still find my articles through google before I see the googlebot hit any new articles I post...how is that possible? How would my pages show up on google before the bot actually crawls them?
Living With a Nerd
If google notices your site/blog updates frequently the bot will come around more often and especially if its a high page rank site.
How is this any different from sitemaps? Sitemaps are by major search engines and have been in use for years now.
GOTO Subject
If you were blocking sigs, you wouldn't have to read this.
If Google makes a new sight queue and then You could request your URL be put on that Queue. Then the google Bot would not have to find your content from links on old URL's.
The result, your content scanned in seconds not hours or days.
Yahoo! BINGo!
This sounds a bit like Twitter. Put your content in one hole and it comes out lots of places.
"If a tree falls in the forest and no one is around to hear it, does it make a noise?"
internet era update:
"If a webpage is published on the web and no google spider notices it, does it exist?"
near future update:
"If a thought enters your mind that is not already indexed by google, is it real?"
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
As usual I tried to make a tongue in cheek remark and ended up chewing my tongue. I meant Google’s indexer is so fast. Original posting was made at March 3, 2010 2:09 PM. It was in the index by March 3, 2010 5:08 PM. And it was not even from news.google.com, it is the general web search. Pretty soon Google will tell me that I’m out of milk even before I open the fridge door.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
I remember when I worked on this back in the turn of the century, it was called GridIR back then. http://www.gir-wg.org/index.html A subscription based indexing/search/collection engine.
Amusingly, since this is based on Atom, the client still has to poll. It just has to poll fewer sources. The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.
Also, the "pushsubhub" caches and redistributes the feeds, which means the feed operator no longer sees their own clients.
They don't seem to have addressed the general RSS problem of "server timestamp/ID changed, but content did not". Some RSS feeds get this right; some don't. Reuters is good, but not perfect. Other sites vary; there's a common problem where the RSS feed is provided from multiple servers on a load balancer, and the servers don't coordinate on timestamps and IDs. Twitter is awful. An RSS feed from Twitter appears to change on each poll even when the content has not changed.
Actually determining that RSS content really hasn't changed currently requires computing a message digest on the content. If you're going to aggregate RSS feeds, it's probably necessary to do that.
This was a triumph... I'm making a note here: HUGE SUCCESS!
(For the uninitiated read the letters of the start of each sentence downwards.)
Unto the upright there arises light in the darkness...
I updated my site on 15th Feb and today, 4th March, I can see the old links in Google and none of the new ones. It seems that they need also more servers...
please drink more vodak
k thx
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
It would be unreal how fast spammers would exploit this.
I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
http://www.urbandictionary.com/define.php?term=vodak
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.
This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.
Seems to me that push-publishing already is implemented on the web via services like Ping-o-Matic and such. I can't see why a new push-publishing method would be needed since the blog ping works elegantly. Obviously, the system is abused by spammers, but Google's solution would suffer from the same problem too.
Football Odds