Google Indexing In Near-Realtime

← Back to Stories (view on slashdot.org)

Google Indexing In Near-Realtime

Posted by kdawson on Thursday March 4, 2010 @04:58AM from the faster-pussycat dept.

krou writes "ReadWriteWeb is covering Google's embrace of a system that would enable any Web publisher to 'automatically submit new content to Google for indexing within seconds of that content being published.' Google's Brett Slatkin is lead developer of PuSH, or PubSubHubbub, a real-time syndication protocol based on ATOM, where 'a publisher tells the world about a Hub that it will notify every time new content is published.' Subscribers then wait for the hub to notify them of the new content. Says RWW: 'If Google can implement an Indexing by PuSH program, it would ask every website to implement the technology and declare which Hub they push to at the top of each document, just like they declare where the RSS feeds they publish can be found. Then Google would subscribe to those PuSH feeds to discover new content when it's published. PuSH wouldn't likely replace crawling, in fact a crawl would be needed to discover PuSH feeds to subscribe to, but the real-time format would be used to augment Google's existing index.' PuSH is an open protocol, and Slatkin says that 'I am being told by my engineering bosses to openly promote this open approach even to our competitors.'"

62 of 79 comments (clear)

Min score:

Reason:

Sort:

Maybe I'm just a noob, but... by Pojut · 2010-03-04 05:02 · Score: 3, Interesting

...someone help me out here. People can still find my articles through google before I see the googlebot hit any new articles I post...how is that possible? How would my pages show up on google before the bot actually crawls them?

--
Living With a Nerd
1. Re:Maybe I'm just a noob, but... by NovTest · 2010-03-04 05:15 · Score: 2
  
  Test
  
  --
  This is a temporary sig
2. Re:Maybe I'm just a noob, but... by garcia · 2010-03-04 05:17 · Score: 3, Interesting
  
  My site is by no means something high traffic but Googlebot indexes my pages (and shows them in search results) within three minutes:
  crawl-66-249-65-232.googlebot.com - - [04/Mar/2010:10:33:34 -0600] "GET /current-crime-decline-to-cause-public-safety-cuts HTTP/1.1" 200 47330 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?
3. Re:Maybe I'm just a noob, but... by Anonymous+Monkey · 2010-03-04 05:17 · Score: 1
  
  Google can see the future. Didn't I tell you about that tomorrow?
  
  --
  We are the Borg...
4. Re:Maybe I'm just a noob, but... by Entrope · 2010-03-04 05:25 · Score: 1
  
  Absolutely. With this breakthrough technology, a cutting-edge new media purveyor can ensure that their reportage, opinions and commentary are easily accessible to the general public with a minimal delay. In today's fast-paced Internet, a few minutes' delay can make the difference between being on the breaking edge of news and being an Johnny-come-lately.
  (To be more succinct, PuSH lets bloggers make sure they have the first post.)
5. Re:Maybe I'm just a noob, but... by K.+S.+Kyosuke · 2010-03-04 05:25 · Score: 4, Funny
  
  I have just found your test comment using Google.
  
  --
  Ezekiel 23:20
6. Re:Maybe I'm just a noob, but... by mmkkbb · 2010-03-04 05:26 · Score: 1
  
  Some blog engines will automatically notify search engines of an updated site map upon publishing new content.
  
  --
  -mkb
7. Re:Maybe I'm just a noob, but... by Pojut · 2010-03-04 05:27 · Score: 1
  
  I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?
  Not really...I generally update my site between 4-6 times per week, but when I update it I'm only posting one article a day with the odd site announcement every so often...maybe I just suck, I don't know, but it seems like it takes a week or two before people start really reading what I write, they always seem to read what I wrote a week or so ago instead of the new content. This happens even if they land on my main page (linked in my sig) rather than on an actual article. ::shrug:: whatever. I average between 100-300 people per day, and that is fine by me :-)
  
  --
  Living With a Nerd
8. Re:Maybe I'm just a noob, but... by sfraggle · 2010-03-04 05:32 · Score: 1
  
  I've noticed that when I post a new blog entry on Livejournal, it appears in Google's results within 2-3 minutes. I know that Livejournal has a public feed for all new blog entries across the site, so I assume Google must be indexing this (and presumably others).
  
  --
  were you expecting to see a sig here? perhaps you'd rather see the inside of an ambulance!
9. Re:Maybe I'm just a noob, but... by garcia · 2010-03-04 05:37 · Score: 2, Interesting
  
  maybe I just suck, I don't know, but it seems like it takes a week or two before people start really reading what I write, they always seem to read what I wrote a week or so ago instead of the new content.
  As you write more often (say on a specific time schedule and daily) the people who don't read via RSS (which in my case is the majority of my readers) will learn to make going to your site a part of their daily routine and thus your visits on new material will go up.
  I watched visiting trends, by hour, over the last two years in Google Analytics and picked 7:30 AM and 10:30 AM as the times to post material. It seemed as if most people were checking once in the morning when they got to the office and once at breaktime/lunchtime around 11 AM. To account for some of the time variance seen across those two years I went with 15 minutes earlier than the stats showed. Seems to work for me.
  Good luck.
10. Re:Maybe I'm just a noob, but... by Pojut · 2010-03-04 05:48 · Score: 1
  
  Cool, thank you! I'll definitely have a look at that.
  
  --
  Living With a Nerd
11. Re:Maybe I'm just a noob, but... by wizardforce · 2010-03-04 05:50 · Score: 1
  
  It's like an RSS feed for Google. Just like you'd use an RSS feed to keep up with various blogs instead of visiting constantly.
  
  --
  Sigs are too short to say anything truly profound so read the above post instead.
12. Re:Maybe I'm just a noob, but... by truthsearch · 2010-03-04 06:37 · Score: 1
  
  Google's "webmaster tools" already let you set an RSS feed as the sitemap source.
  
  --
  Developers: We can use your help.
13. Re:Maybe I'm just a noob, but... by vux984 · 2010-03-04 06:43 · Score: 1
  
  I watched visiting trends, by hour, over the last two years in Google Analytics and picked 7:30 AM and 10:30 AM as the times to post material. It seemed as if most people were checking once in the morning when they got to the office and once at breaktime/lunchtime around 11 AM. To account for some of the time variance seen across those two years I went with 15 minutes earlier than the stats showed. Seems to work for me.
  Odd that everyone who reads your content is in your timezone. Do you primarily post articles of local interest? Or is there some other local network effect in play here? Or are the two spikes separated by 3 hours simply morning on the east coast, and morning on the west coast? ....
14. Re:Maybe I'm just a noob, but... by garcia · 2010-03-04 06:57 · Score: 1
  
  95% of my content isn't just local, it's hyperlocal. Thank for asking about this as I did limit the analysis to those who I put into an "Advanced Segment" where the visitors' region was Minnesota.
15. Re:Maybe I'm just a noob, but... by Jurily · 2010-03-04 07:07 · Score: 1
  
  I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?
  In rare cases like the swine flu panic, 3 minutes can be the difference between fame and obscurity.
16. Re:Maybe I'm just a noob, but... by zonky · 2010-03-04 07:55 · Score: 1
  
  RSS?
17. Re:Maybe I'm just a noob, but... by Anonymous Coward · 2010-03-04 09:35 · Score: 1, Insightful
  
  Oh, wow:
  http://www.google.com/search?q=NovTest+(909599)+test
kinda done now by hey · 2010-03-04 05:05 · Score: 4, Informative

If google notices your site/blog updates frequently the bot will come around more often and especially if its a high page rank site.
1. Re:kinda done now by seanadams.com · 2010-03-04 05:14 · Score: 1
  
  That is still slower, not to mention far less efficient for both parties, than event-driven updates.
2. Re:kinda done now by shird · 2010-03-04 09:54 · Score: 1
  
  1. Go to 4chan/b and post a unique sentence.
  2. Observe how quickly stuff gets posted to that site.
  3. Search for that sentence through Google
  4. Be amazed that Google actually indexes this site.
  
  --
  I.O.U One Sig.
3. Re:kinda done now by dotancohen · 2010-03-04 20:40 · Score: 1
  
  There is no such thing as a high Page Rank site. The name Page Rank is a play on words: for one, it is the inventor's last name (Larry Page). Two, it is on a per-page basis.
  
  --
  It is dangerous to be right when the government is wrong.
Sitemaps? by PhrostyMcByte · 2010-03-04 05:09 · Score: 1

How is this any different from sitemaps? Sitemaps are by major search engines and have been in use for years now.
1. Re:Sitemaps? by djsmiley · 2010-03-04 05:16 · Score: 1
  
  that involves the googlebot hitting the site map, or you submitting it manually...
  this is all automatic.
  However, How is this any different from RSS? (except this is designed to be viewed by a machine rather than a human?)
  
  --
  - http://www.milkme.co.uk
2. Re:Sitemaps? by jalefkowit · 2010-03-04 05:27 · Score: 1
  
  The only way a standard RSS reader can find out if a feed has updated is by "polling" the feed periodically. PuSH and similar systems remove the need for this polling by pinging the client directly when something changes.
  
  --
  Read my blog.
3. Re:Sitemaps? by schlesinm · 2010-03-04 05:30 · Score: 1
  
  However, How is this any different from RSS? (except this is designed to be viewed by a machine rather than a human?
  RSS is a pull technology. I update my blog, which updates my RSS feed and the googlebot goes out and pulls my sitemap (which is my RSS feed on Blogger) and indexes any new pages. This technology sounds like I can ping Google when my site is updated and they can know there is new data for them to pull.
  
  --
  My company home page
4. Re:Sitemaps? by HarrisonFisk · 2010-03-04 05:30 · Score: 1
  
  RSS is pull technology, so the interested server (ie Google) needs to keep polling you asking if you have new content.
  
  PubSubHubbub is push technology. So when you make a change, you submit it to a hub which in turn knows the interested parties that have asked to know about your site and then distributes it to them.
  
  So it is more efficient since there isn't a constant polling and it is faster since there isn't a poll lag.
5. Re:Sitemaps? by ircmaxell · 2010-03-04 06:08 · Score: 1
  
  No, this is not a ping technology. The hub actually sends the new data to the recipient. So basically you publish a feed. The hub subscribes to that feed. When you post new content, you ping the hub. The hub then fetches the new data. It then turns around and sends the new data to anyone who's subscribed to the hub. So it saves on two fronts. First, there's no polling of anything anymore (since you tell the hub when it's updated, and the hub sends out the new data when it has it). Second, the load of sending the content to all the subscribers falls to the hub instead of the main server.
  
  --
  If a man isn't willing to take some risk for his opinions, either his opinions are no good or he's no good
6. Re:Sitemaps? by physburn · 2010-03-04 07:58 · Score: 1
  
  Yes, a good site map, lists the last changed date for each page. Google reads the site map for each site first. So the above Author is right the PUSH system is already integrated into sitemaps in the last Modified and changed attributes, and no new protocols or hubs systems are needed.
  ---
  Internet Protocols Feed @
Submit, check your page rank, edit by Rogerborg · 2010-03-04 05:10 · Score: 2, Interesting

GOTO Subject

--
If you were blocking sigs, you wouldn't have to read this.
Assume Google makes a new sight queue. by bobs666 · 2010-03-04 05:27 · Score: 1

If Google makes a new sight queue and then You could request your URL be put on that Queue. Then the google Bot would not have to find your content from links on old URL's.

The result, your content scanned in seconds not hours or days.
Google indexing in near realtime by Mantis8 · 2010-03-04 05:28 · Score: 1

Yahoo! BINGo!
twitter by hey · 2010-03-04 05:31 · Score: 1

This sounds a bit like Twitter. Put your content in one hole and it comes out lots of places.
1. Re:twitter by loconet · 2010-03-04 05:52 · Score: 1
  
  or like...
  
  --
  [alk]
2. Re:twitter by glwtta · 2010-03-04 09:54 · Score: 1
  
  Yes, exactly, because publish-subscribe did not exist before Twitter.
  
  --
  sic transit gloria mundi
zen saying: by circletimessquare · 2010-03-04 05:34 · Score: 3, Funny

"If a tree falls in the forest and no one is around to hear it, does it make a noise?"
internet era update:
"If a webpage is published on the web and no google spider notices it, does it exist?"
near future update:
"If a thought enters your mind that is not already indexed by google, is it real?"

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
1. Re:zen saying: by Virak · 2010-03-04 06:44 · Score: 1
  
  "If a tree falls in the forest and no one is around to hear it, does it make a noise?"
  Yes.
  
  "If a webpage is published on the web and no google spider notices it, does it exist?"
  Very yes. There are many other channels of communication you can use to give the link to someone else that Google doesn't index. IRC, IM, email, paper (remember that stuff?), and so on. Even if you don't give the link to anyone, it's still not even close to analogous to the original, as a person is still around to see it, namely the site's creator.
  
  "If a thought enters your mind that is not already indexed by google, is it real?"
  Now you're just trying way too hard to be clever, to the point of not making the slightest bit of sense.
2. Re:zen saying: by Deisatru · 2010-03-04 09:00 · Score: 1
  
  "If a tree falls in the forest and no one is around to hear it, does it make a noise?"
  Yes.
  Actually no, a noise is something heard by a person or animal. it makes a sound, but not a noise.
3. Re:zen saying: by Virak · 2010-03-04 10:30 · Score: 1
  
  And it's only censorship if a government does it, right? Excessive pedantry is bad enough, but excessive pedantry with absolutely no basis in reality is particularly annoying.
I just noticed it yesterday. by 140Mandak262Jamuna · 2010-03-04 05:34 · Score: 3, Interesting

Funny I just posted this yesterday in Pandas Thumb

As usual I tried to make a tongue in cheek remark and ended up chewing my tongue. I meant Google’s indexer is so fast. Original posting was made at March 3, 2010 2:09 PM. It was in the index by March 3, 2010 5:08 PM. And it was not even from news.google.com, it is the general web search. Pretty soon Google will tell me that I’m out of milk even before I open the fridge door.

--
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
1. Re:I just noticed it yesterday. by girlintraining · 2010-03-04 05:40 · Score: 2, Funny
  
  Pretty soon Google will tell me that I'm out of milk even before I open the fridge door.
  It also knows what you did last summer. *ominous look towards the laptop in the corner*
  
  --
  #fuckbeta #iamslashdot #dicemustdie
2. Re:I just noticed it yesterday. by Splab · 2010-03-04 05:45 · Score: 1
  
  Hope it isn't too far away, having my google apps account telling me what I need to restock in the fridge (or even the apartment) would be friggin awesome. Then when cookingwithgoogle.com starts up, just writing the recipe I want could give me a grocery list, instant win.
3. Re:I just noticed it yesterday. by FlyingBishop · 2010-03-04 06:27 · Score: 1
  
  I'd like to put together a kitchen computer with a camera/barcode reader to keep track of what's in my fridge.
  If food came RFID tagged, it would work even better. Of course RFID & food don't mix too well.
4. Re:I just noticed it yesterday. by 140Mandak262Jamuna · 2010-03-04 07:36 · Score: 1
  
  Almost all the food is bar coded. And bar code readers are cheap. Barcode readers with some local memory could be built. Or wi-fi enabled to transmit the bar code to a local computer.
  We should be able to build contraptions where you scan every empty carton you throw in the garbage, and it updates the inventory and emails a shopping list, sorted by the aisle for my local grocery store, thank you, to your cell phone.
  Yeah, if I can think about it, I am sure someone has already done it. I am not exactly the sharpest knife in the drawer you see.
  
  --
  sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
5. Re:I just noticed it yesterday. by D+Ninja · 2010-03-04 07:57 · Score: 1
  
  This is a very fantastic idea. I would love to have something like this as, when I typically go to the grocery store, I find myself buying the same stinking food again and again (it's tough to have a good imagination when you're in a rush).
  Any Google engineers out there with a penchant for cooking - this would be a great 20% time project.
6. Re:I just noticed it yesterday. by Virtual_Raider · 2010-03-04 12:47 · Score: 1
  
  Hope it isn't too far away, having my google apps account telling me what I need to restock in the fridge (or even the apartment) would be friggin awesome. Then when cookingwithgoogle.com starts up, just writing the recipe I want could give me a grocery list, instant win.
  Some of these services annoy me because I don't want to be a creature of habit in everything I do. I personally want some variety from time to time and being able to predict individual whims is so far out in the future its not even scifi, its plain fantasy. Or maybe there is an overall pattern there, something that says routine for 4 weeks, then 75% chance of a random choice of ingredients from Wed to Fri and 95% on weekends. But if there is, I don't want to know about it and more importantly, I don't want somebody else to find out either.
  Besides, no algorithm of recomendations I've seen so far, including Netflix, Google or Amazon can propose to me something new I would really like so far. All they do is cater to the fat section of the bell curve and I can perfectly find stuff inside on my own. For example, whenever I exhaust my reading (I do use the Goog's reader) and I see their "recommendations", its full of crap, pap and bland. Just because the have classified one site I read as 'humour' they throw in every low-brow knuckledragging lolcat and dailyfailing site they index. Or because I have a Slashdor RSS feed they throw in every cnet and Tech for Illiterates blogs (what does that say about /. I wonder...)
  Er, but I digress. My point was... personally I wouldn't want it to nag me about running out of milk, or much less, pre-order for me. Some times I want soy. Or some times I just don't want anything but water. And above all, most of the time I hate the feeling of being told what to do :)
  But yeah, I do see the appeal for some so I'm not saying its an evil invasion of privacy or anything, just that it wouldn't be for everybody (although I suspect they would still want to profile every one)
  
  --
  +Raider of the lost BBS
7. Re:I just noticed it yesterday. by cybernanga · 2010-03-04 13:09 · Score: 1
  
  I seriously thought about this once, and realised that the supermarkets will NOT cooperate.
  Ever notices how supermarkets are forever changing the location of your favourite product? They want you to walk through the whole store because that way you are likely to make additional/unplanned purchases. Having a shopping list sorted by store aisle would defeat their nefarious marketing plans.
  I thought of using user-generated data to create the store maps, (i.e scan the barcode when you grab an item off the shelf) but then realised that GPS is generally not accurate enough when used inside the store.
  If any one has any ideas, I'd love to hear them.
  
  --
  www.Buy-Proxy.com - A "buyer-driven" global marketplace.
8. Re:I just noticed it yesterday. by 140Mandak262Jamuna · 2010-03-05 12:38 · Score: 1
  
  Aisle wise sorting is just the icing on the cake. Simply having a battery operated bar code scanner next to the garbage can, so that we can scan what we toss, (things that we want to restock) is enough. When you plug the scanner into a smart phone, it dumps the data and the phone has an app that looks up the upc in the web and converts it to a real shopping list. That is basically the important functionality. You can jazzit up by making the scanner really small and portable and you can carry it to the store and scan items to be added to wish list etc. If it takes off, you can get grocery stores to print coupons that could be scanned by these scanners etc.
  
  --
  sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
I can suz google? by mr.witherspoone · 2010-03-04 06:18 · Score: 1

I remember when I worked on this back in the turn of the century, it was called GridIR back then. http://www.gir-wg.org/index.html A subscription based indexing/search/collection engine.
It's a pull, not a push by Animats · 2010-03-04 06:38 · Score: 1

Amusingly, since this is based on Atom, the client still has to poll. It just has to poll fewer sources. The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.
Also, the "pushsubhub" caches and redistributes the feeds, which means the feed operator no longer sees their own clients.
They don't seem to have addressed the general RSS problem of "server timestamp/ID changed, but content did not". Some RSS feeds get this right; some don't. Reuters is good, but not perfect. Other sites vary; there's a common problem where the RSS feed is provided from multiple servers on a load balancer, and the servers don't coordinate on timestamps and IDs. Twitter is awful. An RSS feed from Twitter appears to change on each poll even when the content has not changed.
Actually determining that RSS content really hasn't changed currently requires computing a message digest on the content. If you're going to aggregate RSS feeds, it's probably necessary to do that.
Re:This makes... by dark_15 · 2010-03-04 06:46 · Score: 1

This was a triumph... I'm making a note here: HUGE SUCCESS!
(For the uninitiated read the letters of the start of each sentence downwards.)

--
Unto the upright there arises light in the darkness...
not that fast for me by vacarul · 2010-03-04 07:04 · Score: 1

I updated my site on 15th Feb and today, 4th March, I can see the old links in Google and none of the new ones. It seems that they need also more servers...
dear Virak: by circletimessquare · 2010-03-04 07:10 · Score: 1

please drink more vodak
k thx

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
1. Re:dear Virak: by just_another_sean · 2010-03-04 07:29 · Score: 1
  
  vodak? Is that like Zima?
  
  --
  Creationist Textbook Stickers Declared Unconstitutional by CowboyNeal
2. Re:dear Virak: by Virak · 2010-03-04 10:37 · Score: 1
  
  I fail to see the merits of doing so, or the relevance to the topic at hand.
3. Re:dear Virak: by logixoul · 2010-03-07 08:50 · Score: 1
  
  Umm... I had deja vu.
Spammers delight! by SlappyBastard · 2010-03-04 07:32 · Score: 1

It would be unreal how fast spammers would exploit this.

--
I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
KNOW YOUR RETARDED INTERNET MEMES by circletimessquare · 2010-03-04 07:37 · Score: 1

http://www.urbandictionary.com/define.php?term=vodak

--
intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
1. Re:KNOW YOUR RETARDED INTERNET MEMES by just_another_sean · 2010-03-04 08:13 · Score: 1
  
  So it's even worse then Zima! Thanks. :-)
  
  --
  Creationist Textbook Stickers Declared Unconstitutional by CowboyNeal
No really, it's push by Wesley+Felter · 2010-03-04 07:52 · Score: 1

The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.
This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.
1. Re:No really, it's push by Animats · 2010-03-04 18:17 · Score: 1
  
  This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.
  You're right. Which implies that the subscriber has to have a web server. Somebody will probably try a "web server in the browser" thing for browser-type subscribers.
  To some extent, they've re-invented Usenet.
Blog Ping by bjourne · 2010-03-04 12:19 · Score: 1

Seems to me that push-publishing already is implemented on the web via services like Ping-o-Matic and such. I can't see why a new push-publishing method would be needed since the blog ping works elegantly. Obviously, the system is abused by spammers, but Google's solution would suffer from the same problem too.

--
Football Odds