Slashdot Mirror


Google Indexing In Near-Realtime

krou writes "ReadWriteWeb is covering Google's embrace of a system that would enable any Web publisher to 'automatically submit new content to Google for indexing within seconds of that content being published.' Google's Brett Slatkin is lead developer of PuSH, or PubSubHubbub, a real-time syndication protocol based on ATOM, where 'a publisher tells the world about a Hub that it will notify every time new content is published.' Subscribers then wait for the hub to notify them of the new content. Says RWW: 'If Google can implement an Indexing by PuSH program, it would ask every website to implement the technology and declare which Hub they push to at the top of each document, just like they declare where the RSS feeds they publish can be found. Then Google would subscribe to those PuSH feeds to discover new content when it's published. PuSH wouldn't likely replace crawling, in fact a crawl would be needed to discover PuSH feeds to subscribe to, but the real-time format would be used to augment Google's existing index.' PuSH is an open protocol, and Slatkin says that 'I am being told by my engineering bosses to openly promote this open approach even to our competitors.'"

79 comments

  1. Maybe I'm just a noob, but... by Pojut · · Score: 3, Interesting

    ...someone help me out here. People can still find my articles through google before I see the googlebot hit any new articles I post...how is that possible? How would my pages show up on google before the bot actually crawls them?

    1. Re:Maybe I'm just a noob, but... by NovTest · · Score: 2

      Test

      --
      This is a temporary sig
    2. Re:Maybe I'm just a noob, but... by garcia · · Score: 3, Interesting

      My site is by no means something high traffic but Googlebot indexes my pages (and shows them in search results) within three minutes:

      crawl-66-249-65-232.googlebot.com - - [04/Mar/2010:10:33:34 -0600] "GET /current-crime-decline-to-cause-public-safety-cuts HTTP/1.1" 200 47330 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

      I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

    3. Re:Maybe I'm just a noob, but... by Anonymous+Monkey · · Score: 1

      Google can see the future. Didn't I tell you about that tomorrow?

      --
      We are the Borg...
    4. Re:Maybe I'm just a noob, but... by Anonymous Coward · · Score: 0

      It could be hitting your RSS, or if you use Blogger as a front end, then it's already in Google's database.

      I don't understand why anyone would voluntarily give yet more information to Google without them even asking to steal it from you.

    5. Re:Maybe I'm just a noob, but... by Entrope · · Score: 1

      Absolutely. With this breakthrough technology, a cutting-edge new media purveyor can ensure that their reportage, opinions and commentary are easily accessible to the general public with a minimal delay. In today's fast-paced Internet, a few minutes' delay can make the difference between being on the breaking edge of news and being an Johnny-come-lately.

      (To be more succinct, PuSH lets bloggers make sure they have the first post.)

    6. Re:Maybe I'm just a noob, but... by K.+S.+Kyosuke · · Score: 4, Funny

      I have just found your test comment using Google.

      --
      Ezekiel 23:20
    7. Re:Maybe I'm just a noob, but... by mmkkbb · · Score: 1

      Some blog engines will automatically notify search engines of an updated site map upon publishing new content.

      --
      -mkb
    8. Re:Maybe I'm just a noob, but... by Pojut · · Score: 1

      I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

      Not really...I generally update my site between 4-6 times per week, but when I update it I'm only posting one article a day with the odd site announcement every so often...maybe I just suck, I don't know, but it seems like it takes a week or two before people start really reading what I write, they always seem to read what I wrote a week or so ago instead of the new content. This happens even if they land on my main page (linked in my sig) rather than on an actual article. ::shrug:: whatever. I average between 100-300 people per day, and that is fine by me :-)

    9. Re:Maybe I'm just a noob, but... by sfraggle · · Score: 1

      I've noticed that when I post a new blog entry on Livejournal, it appears in Google's results within 2-3 minutes. I know that Livejournal has a public feed for all new blog entries across the site, so I assume Google must be indexing this (and presumably others).

      --
      were you expecting to see a sig here? perhaps you'd rather see the inside of an ambulance!
    10. Re:Maybe I'm just a noob, but... by garcia · · Score: 2, Interesting

      maybe I just suck, I don't know, but it seems like it takes a week or two before people start really reading what I write, they always seem to read what I wrote a week or so ago instead of the new content.

      As you write more often (say on a specific time schedule and daily) the people who don't read via RSS (which in my case is the majority of my readers) will learn to make going to your site a part of their daily routine and thus your visits on new material will go up.

      I watched visiting trends, by hour, over the last two years in Google Analytics and picked 7:30 AM and 10:30 AM as the times to post material. It seemed as if most people were checking once in the morning when they got to the office and once at breaktime/lunchtime around 11 AM. To account for some of the time variance seen across those two years I went with 15 minutes earlier than the stats showed. Seems to work for me.

      Good luck.

    11. Re:Maybe I'm just a noob, but... by Pojut · · Score: 1

      Cool, thank you! I'll definitely have a look at that.

    12. Re:Maybe I'm just a noob, but... by wizardforce · · Score: 1

      It's like an RSS feed for Google. Just like you'd use an RSS feed to keep up with various blogs instead of visiting constantly.

      --
      Sigs are too short to say anything truly profound so read the above post instead.
    13. Re:Maybe I'm just a noob, but... by Anonymous Coward · · Score: 0

      Well, isn't it obvious?

      TACHYONS!

      They are working with D-Wave remember.

    14. Re:Maybe I'm just a noob, but... by truthsearch · · Score: 1

      Google's "webmaster tools" already let you set an RSS feed as the sitemap source.

    15. Re:Maybe I'm just a noob, but... by vux984 · · Score: 1

      I watched visiting trends, by hour, over the last two years in Google Analytics and picked 7:30 AM and 10:30 AM as the times to post material. It seemed as if most people were checking once in the morning when they got to the office and once at breaktime/lunchtime around 11 AM. To account for some of the time variance seen across those two years I went with 15 minutes earlier than the stats showed. Seems to work for me.

      Odd that everyone who reads your content is in your timezone. Do you primarily post articles of local interest? Or is there some other local network effect in play here? Or are the two spikes separated by 3 hours simply morning on the east coast, and morning on the west coast? ....

    16. Re:Maybe I'm just a noob, but... by garcia · · Score: 1

      95% of my content isn't just local, it's hyperlocal. Thank for asking about this as I did limit the analysis to those who I put into an "Advanced Segment" where the visitors' region was Minnesota.

    17. Re:Maybe I'm just a noob, but... by Jurily · · Score: 1

      I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

      In rare cases like the swine flu panic, 3 minutes can be the difference between fame and obscurity.

    18. Re:Maybe I'm just a noob, but... by zonky · · Score: 1

      RSS?

    19. Re:Maybe I'm just a noob, but... by Anonymous Coward · · Score: 1, Insightful

      Oh, wow:

      http://www.google.com/search?q=NovTest+(909599)+test

    20. Re:Maybe I'm just a noob, but... by Anonymous Coward · · Score: 0

      Wtf, this is no test?? Gimme a REAL test...!!!11oneone

    21. Re:Maybe I'm just a noob, but... by Anonymous Coward · · Score: 0

      I really don't see a need for something to be any more "real time" than that for someone's blog. Do you?

      Three minutes should be enough for everyone...

    22. Re:Maybe I'm just a noob, but... by Anonymous Coward · · Score: 0

      One advantage, I guess, is that with PuSH your site sends out a notification when there's new content, so they wouldn't have to keep crawling your site. Less bandwidth for you, less cost for them, everybody wins.

  2. Eh... by jo42 · · Score: 0

    What's there to stop spammers from stuffing their crap to Google?

    Google's #1 problem, and which they have done nothing about it appears, is that over 99.99% of their search results are absolutely useless.

    I would actually pay money to have that fixed.

    1. Re:Eh... by Anonymous Coward · · Score: 0

      google is already stuffed with spammers.

  3. kinda done now by hey · · Score: 4, Informative

    If google notices your site/blog updates frequently the bot will come around more often and especially if its a high page rank site.

    1. Re:kinda done now by seanadams.com · · Score: 1

      That is still slower, not to mention far less efficient for both parties, than event-driven updates.

    2. Re:kinda done now by shird · · Score: 1

      1. Go to 4chan/b and post a unique sentence.
      2. Observe how quickly stuff gets posted to that site.
      3. Search for that sentence through Google
      4. Be amazed that Google actually indexes this site.

      --
      I.O.U One Sig.
    3. Re:kinda done now by dotancohen · · Score: 1

      There is no such thing as a high Page Rank site. The name Page Rank is a play on words: for one, it is the inventor's last name (Larry Page). Two, it is on a per-page basis.

      --
      It is dangerous to be right when the government is wrong.
  4. Sitemaps? by PhrostyMcByte · · Score: 1

    How is this any different from sitemaps? Sitemaps are by major search engines and have been in use for years now.

    1. Re:Sitemaps? by djsmiley · · Score: 1

      that involves the googlebot hitting the site map, or you submitting it manually...

      this is all automatic.

      However, How is this any different from RSS? (except this is designed to be viewed by a machine rather than a human?)

      --
      - http://www.milkme.co.uk
    2. Re:Sitemaps? by jalefkowit · · Score: 1

      The only way a standard RSS reader can find out if a feed has updated is by "polling" the feed periodically. PuSH and similar systems remove the need for this polling by pinging the client directly when something changes.

    3. Re:Sitemaps? by schlesinm · · Score: 1

      However, How is this any different from RSS? (except this is designed to be viewed by a machine rather than a human?

      RSS is a pull technology. I update my blog, which updates my RSS feed and the googlebot goes out and pulls my sitemap (which is my RSS feed on Blogger) and indexes any new pages. This technology sounds like I can ping Google when my site is updated and they can know there is new data for them to pull.

    4. Re:Sitemaps? by HarrisonFisk · · Score: 1

      RSS is pull technology, so the interested server (ie Google) needs to keep polling you asking if you have new content.

      PubSubHubbub is push technology. So when you make a change, you submit it to a hub which in turn knows the interested parties that have asked to know about your site and then distributes it to them.

      So it is more efficient since there isn't a constant polling and it is faster since there isn't a poll lag.

    5. Re:Sitemaps? by Anonymous Coward · · Score: 0

      a decent sitemap generator will ping the search engine when the sitemap updates, all the major engines have a 'push' page for this purpose... Any decent sitemap is already automatic. ( google publishes one for people to use here: http://code.google.com/p/googlesitemapgenerator/ )

    6. Re:Sitemaps? by ircmaxell · · Score: 1

      No, this is not a ping technology. The hub actually sends the new data to the recipient. So basically you publish a feed. The hub subscribes to that feed. When you post new content, you ping the hub. The hub then fetches the new data. It then turns around and sends the new data to anyone who's subscribed to the hub. So it saves on two fronts. First, there's no polling of anything anymore (since you tell the hub when it's updated, and the hub sends out the new data when it has it). Second, the load of sending the content to all the subscribers falls to the hub instead of the main server.

      --
      If a man isn't willing to take some risk for his opinions, either his opinions are no good or he's no good
    7. Re:Sitemaps? by physburn · · Score: 1
      Yes, a good site map, lists the last changed date for each page. Google reads the site map for each site first. So the above Author is right the PUSH system is already integrated into sitemaps in the last Modified and changed attributes, and no new protocols or hubs systems are needed.

      ---

      Internet Protocols Feed @

  5. Submit, check your page rank, edit by Rogerborg · · Score: 2, Interesting

    GOTO Subject

    --
    If you were blocking sigs, you wouldn't have to read this.
  6. give me my google back! by Anonymous Coward · · Score: 0

    I typed 'google' into the google box on the page and I ended up at this readwriteweb place. GIVE ME MY GOGLLE BKCA!!11!

  7. But is it useful? by tpstigers · · Score: 0

    This strikes me as another solution looking for a problem. Those who would use this don't need it, and those who would need it won't use it.

  8. Assume Google makes a new sight queue. by bobs666 · · Score: 1

    If Google makes a new sight queue and then You could request your URL be put on that Queue. Then the google Bot would not have to find your content from links on old URL's.

    The result, your content scanned in seconds not hours or days.

  9. Google indexing in near realtime by Mantis8 · · Score: 1

    Yahoo! BINGo!

  10. twitter by hey · · Score: 1

    This sounds a bit like Twitter. Put your content in one hole and it comes out lots of places.

    1. Re:twitter by loconet · · Score: 1

      or like...

      --
      [alk]
    2. Re:twitter by glwtta · · Score: 1

      Yes, exactly, because publish-subscribe did not exist before Twitter.

      --
      sic transit gloria mundi
    3. Re:twitter by Anonymous Coward · · Score: 0

      This sounds a bit like Tentacle Porn. Put your content in one hole and it comes out lots of places.

  11. zen saying: by circletimessquare · · Score: 3, Funny

    "If a tree falls in the forest and no one is around to hear it, does it make a noise?"

    internet era update:

    "If a webpage is published on the web and no google spider notices it, does it exist?"

    near future update:

    "If a thought enters your mind that is not already indexed by google, is it real?"

    --
    intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
    1. Re:zen saying: by Virak · · Score: 1

      "If a tree falls in the forest and no one is around to hear it, does it make a noise?"

      Yes.

      "If a webpage is published on the web and no google spider notices it, does it exist?"

      Very yes. There are many other channels of communication you can use to give the link to someone else that Google doesn't index. IRC, IM, email, paper (remember that stuff?), and so on. Even if you don't give the link to anyone, it's still not even close to analogous to the original, as a person is still around to see it, namely the site's creator.

      "If a thought enters your mind that is not already indexed by google, is it real?"

      Now you're just trying way too hard to be clever, to the point of not making the slightest bit of sense.

    2. Re:zen saying: by Deisatru · · Score: 1

      "If a tree falls in the forest and no one is around to hear it, does it make a noise?"

      Yes.

      Actually no, a noise is something heard by a person or animal. it makes a sound, but not a noise.

    3. Re:zen saying: by Virak · · Score: 1

      And it's only censorship if a government does it, right? Excessive pedantry is bad enough, but excessive pedantry with absolutely no basis in reality is particularly annoying.

  12. I just noticed it yesterday. by 140Mandak262Jamuna · · Score: 3, Interesting
    Funny I just posted this yesterday in Pandas Thumb

    As usual I tried to make a tongue in cheek remark and ended up chewing my tongue. I meant Google’s indexer is so fast. Original posting was made at March 3, 2010 2:09 PM. It was in the index by March 3, 2010 5:08 PM. And it was not even from news.google.com, it is the general web search. Pretty soon Google will tell me that I’m out of milk even before I open the fridge door.

    --
    sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
    1. Re:I just noticed it yesterday. by girlintraining · · Score: 2, Funny

      Pretty soon Google will tell me that I'm out of milk even before I open the fridge door.

      It also knows what you did last summer. *ominous look towards the laptop in the corner*

      --
      #fuckbeta #iamslashdot #dicemustdie
    2. Re:I just noticed it yesterday. by Splab · · Score: 1

      Hope it isn't too far away, having my google apps account telling me what I need to restock in the fridge (or even the apartment) would be friggin awesome. Then when cookingwithgoogle.com starts up, just writing the recipe I want could give me a grocery list, instant win.

    3. Re:I just noticed it yesterday. by FlyingBishop · · Score: 1

      I'd like to put together a kitchen computer with a camera/barcode reader to keep track of what's in my fridge.

      If food came RFID tagged, it would work even better. Of course RFID & food don't mix too well.

    4. Re:I just noticed it yesterday. by 140Mandak262Jamuna · · Score: 1
      Almost all the food is bar coded. And bar code readers are cheap. Barcode readers with some local memory could be built. Or wi-fi enabled to transmit the bar code to a local computer.

      We should be able to build contraptions where you scan every empty carton you throw in the garbage, and it updates the inventory and emails a shopping list, sorted by the aisle for my local grocery store, thank you, to your cell phone.

      Yeah, if I can think about it, I am sure someone has already done it. I am not exactly the sharpest knife in the drawer you see.

      --
      sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
    5. Re:I just noticed it yesterday. by D+Ninja · · Score: 1

      This is a very fantastic idea. I would love to have something like this as, when I typically go to the grocery store, I find myself buying the same stinking food again and again (it's tough to have a good imagination when you're in a rush).

      Any Google engineers out there with a penchant for cooking - this would be a great 20% time project.

    6. Re:I just noticed it yesterday. by Virtual_Raider · · Score: 1

      Hope it isn't too far away, having my google apps account telling me what I need to restock in the fridge (or even the apartment) would be friggin awesome. Then when cookingwithgoogle.com starts up, just writing the recipe I want could give me a grocery list, instant win.

      Some of these services annoy me because I don't want to be a creature of habit in everything I do. I personally want some variety from time to time and being able to predict individual whims is so far out in the future its not even scifi, its plain fantasy. Or maybe there is an overall pattern there, something that says routine for 4 weeks, then 75% chance of a random choice of ingredients from Wed to Fri and 95% on weekends. But if there is, I don't want to know about it and more importantly, I don't want somebody else to find out either.

      Besides, no algorithm of recomendations I've seen so far, including Netflix, Google or Amazon can propose to me something new I would really like so far. All they do is cater to the fat section of the bell curve and I can perfectly find stuff inside on my own. For example, whenever I exhaust my reading (I do use the Goog's reader) and I see their "recommendations", its full of crap, pap and bland. Just because the have classified one site I read as 'humour' they throw in every low-brow knuckledragging lolcat and dailyfailing site they index. Or because I have a Slashdor RSS feed they throw in every cnet and Tech for Illiterates blogs (what does that say about /. I wonder...)

      Er, but I digress. My point was... personally I wouldn't want it to nag me about running out of milk, or much less, pre-order for me. Some times I want soy. Or some times I just don't want anything but water. And above all, most of the time I hate the feeling of being told what to do :)

      But yeah, I do see the appeal for some so I'm not saying its an evil invasion of privacy or anything, just that it wouldn't be for everybody (although I suspect they would still want to profile every one)

      --
      +Raider of the lost BBS
    7. Re:I just noticed it yesterday. by cybernanga · · Score: 1

      I seriously thought about this once, and realised that the supermarkets will NOT cooperate.

      Ever notices how supermarkets are forever changing the location of your favourite product? They want you to walk through the whole store because that way you are likely to make additional/unplanned purchases. Having a shopping list sorted by store aisle would defeat their nefarious marketing plans.

      I thought of using user-generated data to create the store maps, (i.e scan the barcode when you grab an item off the shelf) but then realised that GPS is generally not accurate enough when used inside the store.

      If any one has any ideas, I'd love to hear them.

      --
      www.Buy-Proxy.com - A "buyer-driven" global marketplace.
    8. Re:I just noticed it yesterday. by 140Mandak262Jamuna · · Score: 1

      Aisle wise sorting is just the icing on the cake. Simply having a battery operated bar code scanner next to the garbage can, so that we can scan what we toss, (things that we want to restock) is enough. When you plug the scanner into a smart phone, it dumps the data and the phone has an app that looks up the upc in the web and converts it to a real shopping list. That is basically the important functionality. You can jazzit up by making the scanner really small and portable and you can carry it to the store and scan items to be added to wish list etc. If it takes off, you can get grocery stores to print coupons that could be scanned by these scanners etc.

      --
      sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
  13. This makes... by kenp2002 · · Score: 0, Troll

    Keep in mind Google is quickly becoming an all controlling entity.
    I have concerns that this technology could expose users to additional threats.
    Likely I see it as one more way for Google to corner the search market.
    Lastly I ponder the legal implications of a direct tying to a web site's content. What if there is a copyright violation.

    Generally I find this to be a dud tech.
    Long ago we had to publish to search engines then the crawlers came and life was good.
    Again automation is what made things better.
    Diving into this without thinking would be suicide.
    Often I fear technology more now then I did before.
    Simply put I think this is a crpytic way to pillage and trick web develoeprs to forking over additional metrics for page ranking since more frequent updates I would assume gets you a higher page rank.

    --
    -=[ Who Is John Galt? ]=-
    1. Re:This makes... by dark_15 · · Score: 1

      This was a triumph... I'm making a note here: HUGE SUCCESS!

      (For the uninitiated read the letters of the start of each sentence downwards.)

      --
      Unto the upright there arises light in the darkness...
  14. I can suz google? by mr.witherspoone · · Score: 1

    I remember when I worked on this back in the turn of the century, it was called GridIR back then. http://www.gir-wg.org/index.html A subscription based indexing/search/collection engine.

  15. It's a pull, not a push by Animats · · Score: 1

    Amusingly, since this is based on Atom, the client still has to poll. It just has to poll fewer sources. The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.

    Also, the "pushsubhub" caches and redistributes the feeds, which means the feed operator no longer sees their own clients.

    They don't seem to have addressed the general RSS problem of "server timestamp/ID changed, but content did not". Some RSS feeds get this right; some don't. Reuters is good, but not perfect. Other sites vary; there's a common problem where the RSS feed is provided from multiple servers on a load balancer, and the servers don't coordinate on timestamps and IDs. Twitter is awful. An RSS feed from Twitter appears to change on each poll even when the content has not changed.

    Actually determining that RSS content really hasn't changed currently requires computing a message digest on the content. If you're going to aggregate RSS feeds, it's probably necessary to do that.

  16. not that fast for me by vacarul · · Score: 1

    I updated my site on 15th Feb and today, 4th March, I can see the old links in Google and none of the new ones. It seems that they need also more servers...

  17. dear Virak: by circletimessquare · · Score: 1

    please drink more vodak

    k thx

    --
    intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
    1. Re:dear Virak: by just_another_sean · · Score: 1

      vodak? Is that like Zima?

      --
      Creationist Textbook Stickers Declared Unconstitutional by CowboyNeal
    2. Re:dear Virak: by Virak · · Score: 1

      I fail to see the merits of doing so, or the relevance to the topic at hand.

    3. Re:dear Virak: by logixoul · · Score: 1

      Umm... I had deja vu.

  18. Spammers delight! by SlappyBastard · · Score: 1

    It would be unreal how fast spammers would exploit this.

    --
    I scream. You scream. I assume that means we're both acquainted with the problem. We proceed.
    1. Re:Spammers delight! by ventmonkey · · Score: 0

      I don't understand how this would be exploited by spammers. This just seems like a way of getting indexed far more quickly (they already do this for a lot of blogs, and hot news topics), I can't see google upping your authority/ranking because you update content a lot.

  19. KNOW YOUR RETARDED INTERNET MEMES by circletimessquare · · Score: 1
    --
    intellectual property law is philosophically incoherent. it is your moral duty to ignore it or sabotage it
    1. Re:KNOW YOUR RETARDED INTERNET MEMES by just_another_sean · · Score: 1

      So it's even worse then Zima! Thanks. :-)

      --
      Creationist Textbook Stickers Declared Unconstitutional by CowboyNeal
  20. No really, it's push by Wesley+Felter · · Score: 1

    The connection between the original source and the "pushsubhub" server really is a "push" connection, but the hub to client connection is not.

    This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.

    1. Re:No really, it's push by Animats · · Score: 1

      This isn't right. You can see in section 7.3 of the spec that the hub sends an HTTP POST to each client (subscriber) for each update; there's no polling.

      You're right. Which implies that the subscriber has to have a web server. Somebody will probably try a "web server in the browser" thing for browser-type subscribers.

      To some extent, they've re-invented Usenet.

  21. Crawling, or Indexing? by ventmonkey · · Score: 0

    Right now Google seems to spend a couple of minutes a day crawling my website, but even once it knows what has changed it only updates it's index, and the SERPs once a month or so. I realize larger sites get indexed far more quickly, and often, but this seems like it is only augmenting the crawling of websites with a client site push. I'm unsure how this all relates to the long lag in indexing that data. Any ideas?

  22. Blog Ping by bjourne · · Score: 1

    Seems to me that push-publishing already is implemented on the web via services like Ping-o-Matic and such. I can't see why a new push-publishing method would be needed since the blog ping works elegantly. Obviously, the system is abused by spammers, but Google's solution would suffer from the same problem too.