Slashdot Mirror


Google Launches Google Sitemaps

Ninwa writes "Google has launched Google Sitemaps. It seems to be a service that allows webmasters to define how often their sites' content is going to change, to give Google a better idea of what to index. It uses some basic XML as the method of submitting a sitemap. More information on the protocol is available in an FAQ. What's most interesting is that Google is licensing the idea under the Attribution/Share Alike Creative Commons license. According to the Google Blog, this is being done '...so that other search engines can do a better job as well. Eventually we hope this will be supported natively in webservers (e.g. Apache, Lotus Notes, IIS).' They even offer an open source client in Python."

223 comments

  1. great interview by professorhojo · · Score: 5, Informative

    for more crunchy detail, here's a great Q&A interview i found with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps:

    http://blog.searchenginewatch.com/blog/050602-1952 24

  2. More unabashed Google loving... by sachmet · · Score: 5, Funny

    Everyone else defines a protocol. But apparently Google defines protocools.

    I guess the rest of the world has a long way to go to catch up...

  3. Cool idea by aftk2 · · Score: 4, Interesting

    This is a cool idea, because I've often wondered about being able to "talk" to search engines at a slightly higher level than robots.txt allows.

    For example, a website we launched a couple months ago is primarily images. We played nice - all of the images have legitimate alt tags, and we tried to let the site degrade properly in older browsers (although you really wouldn't get much, in those instances).

    But the biggest problem we had was trying to get the site spidered by Google. It would be, and it would appear in the index, but it would be listed far below sites that linked to it. I don't believe Google likes sites that are primarily images. We populated meta tags with descriptions, but they weren't included; we even tried using hidden text - legitimate, hidden text that would serve as the sites description, but not break the design - but you know how Google feels about those sorts of things. We had to walk a fine line. This'll be nicer.

    --
    concrete5: a cms made for marketing, but strong enough for geeks.
    1. Re:Cool idea by RealityMogul · · Score: 1, Interesting

      I think Google doesn't like NEW sites. I run a high school alumni website, and it was at least 6 months before you could type in the title of the homepage (which was "[Town nobody has heard of] Alumni") into Google and have it listed in the top 10. Once it did start appearing in the top ten, it was still below sites that linked to it. Most of the higher results simply had "Alumni" in them and nothing with the town name. After about 9 months, my site now has the #1 slot for that search string.

    2. Re:Cool idea by Eric+Giguere · · Score: 3, Informative

      Quite right, a new site can be listed in the Google index pretty quickly -- it only took a few days for my latest site to be found by the Googlebot -- but it takes a while before any PageRank gets assigned to its pages, especially if there are no inbound links to the site. No PageRank, no top listing...

      Eric
      Currently at #1 for adsense tips
    3. Re:Cool idea by rehannan · · Score: 4, Informative

      I just put a new site online. About 4 or 5 days after submitting it to google, it was the number one hit when searching for the title of the site.

    4. Re:Cool idea by KillerDeathRobot · · Score: 1

      That's pretty strange, because Google definitely has a sandbox that they keep sites in for 6-8 months.

      Maybe you had few competitors and those competitors (for the search result) were also new.

      --
      Thinkin' Lincoln - a web comic of presidential proportions
    5. Re:Cool idea by hostyle · · Score: 1
      --
      Caesar si viveret, ad remum dareris.
    6. Re:Cool idea by DrSkwid · · Score: 1

      I've had PageRank after 3 weeks on new sites

      --
      There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
    7. Re:Cool idea by Eric+Giguere · · Score: 1

      It depends on how many pages there are that match those keywords. If your title is unique enough, then sure, your site will show up first. But as soon as there's contention for the keywords, don't expect to stay up top.

      Eric
      View your HTTP headers here
    8. Re:Cool idea by brian0918 · · Score: 1

      This isn't going to change your page rank. They'll still be listed where they are currently listed. See their FAQ.

    9. Re:Cool idea by caluml · · Score: 1

      I'm still waiting for my site, Calum to get indexed. The bots come regularly, but nothing in there. If people could just paste the following link onto their pages, Calum, I'm sure everything would be right with Calum and Google. I'm sure Google doesn't hate Calum, and that there is just some misunderstanding.
      :)

    10. Re:Cool idea by singleantler · · Score: 4, Informative

      It's quite common to be high up for matching terms for about a week, then disappear for three months or so. This seems to be normal behaviour for new sites and is nicknamed the Google sandbox and seems to have been confirmed by the patent application recently made public.

      The sandbox is just an artificial lowering, so if you're a match for a rare term you can still be found quite easily.

      --
      "What if they're using IE?" "I've dumbed Mozilla down to cope with it." - BOFH
    11. Re:Cool idea by IGnatius+T+Foobar · · Score: 1

      I just put a new site online. About 4 or 5 days after submitting it to google, it was the number one hit when searching for the title of the site.

      So you're the one who came up with "DISCREET ONLINE PHARMACY" ?? :)

      Seriously though, if there aren't a lot of other sites containing your title, that's easy. If you're one among a dozen or so, not so easy.

      --
      Tired of FB/Google censorship? Visit UNCENSORED!
    12. Re:Cool idea by mgbaron · · Score: 3, Informative

      I think I can shed a little light on this situation as I have had both of the above cases happen to me.

      This is how the system works. Google can index your site very quickly (within a couple of days), if you have an incoming link or submit to their crawler. If your site is well keyword optimized for a fairly rare keyword, it is entirely plausible that it would come up number one fairly quickly.

      What takes a long time is for google to update their pagerank index. This is where your site will sit in the Google Sandbox for a while Google updates your pagerank.

      In most cases, the sites initial pagerank of 0 will not be enough to take it to the top.

      For a site that we just released about 10 days ago, this was not the case (http://www.jimschlessinger.com/). Since the keywords we were optimizing were fairly rare, it climbed right to the top.

    13. Re:Cool idea by Anonymous Coward · · Score: 0

      For example, a website we launched a couple months ago is primarily images. [...] But the biggest problem we had was trying to get the site spidered by Google.


      Good, you got what you deserve. It's called Hypertext Transfer Protocol for a reason!

    14. Re:Cool idea by daviddennis · · Score: 1

      From what I understand, Google ignores links placed on Slashdot comment pages for exactly this reason :-(.

      Sorry.

      D

    15. Re:Cool idea by caluml · · Score: 1

      Really? (It was intended as a joke btw). Can Google really be that analytical that it has a list of forums that people can post to?

    16. Re:Cool idea by rehannan · · Score: 1

      For those interested... The Site and The Search

    17. Re:Cool idea by Anonymous Coward · · Score: 1, Insightful

      This is a cool idea, because I've often wondered about being able to "talk" to search engines at a slightly higher level than robots.txt allows.

      One of my students wrote his thesis on this a few years ago. He proposed a neat XML based protocol to inform crawlers about (a) directory structure of the web site (b) frequency of change (c) files that had been changed since last visit plus all other things already supported by robots.txt (excludes, time of visit, etc.)

      Alex Lopez-Ortiz

    18. Re:Cool idea by enrico_suave · · Score: 1

      no where in that robots.txt do you see anything about dissalowing article.pl for google... what are your section do you think is blocking google from seeing "comments"...

      IIRC besides exluding them completely the only other way of taking the incentive of comment spamming is google's use of attribute (rel="nofollow") in hyperlinks.

      *Shrug* but I might be off base/wrong about the robot thing.

      e.

      --
      Build Your Own PVR/HTPC news, reviews, &
    19. Re:Cool idea by WeblionX · · Score: 1

      User-agent: *
      Disallow: /article.pl

      --
      (\(\
      (=_=) Bani!
      (")")
    20. Re:Cool idea by enrico_suave · · Score: 1

      doh... I realized that just after I posted it.

      But wouldn't the previous google media partners entry have precedence?

      *Shrug* I need more coffee

      e.

      --
      Build Your Own PVR/HTPC news, reviews, &
  4. IIS? by Kewjoe · · Score: 0, Troll

    Good Luck convincing Microsoft to adopt a Google Standard into their enterprise web server product.

    1. Re:IIS? by rpozz · · Score: 1

      It appears that you simply have an xml file on your webserver, and you point google at it. Nothing special and certainly possible with IIS.

      Remember that MS doesn't have a monopoly on web servers, so they can't be dicks about it like they can with everything else.

    2. Re:IIS? by nam37 · · Score: 1

      Quick... Quick... Bash Microsoft! (You know it DOES get old after a while.)

      --
      The two rules for success are:
      1) Never tell them everything you know.
    3. Re:IIS? by Anonymous Coward · · Score: 0

      so are assholes bashing the bashers, and most likely the child of this comment if there ever is one.

    4. Re:IIS? by Kewjoe · · Score: 1

      I'm not bashing Microsoft. Why would Microsoft want Google to gain more market share. If MS is smart they will try to make a competing product and go against this site maps idea. I just was failing to see why Microsoft would jump onboard a product by one of their main competitors in the search engine market.

    5. Re:IIS? by Anonymous Coward · · Score: 0

      Jesus Christ man! It's friggen IIS. What's not to bash?? Seriously.

    6. Re:IIS? by Jarlsberg · · Score: 1

      You mean MSN, don't you? Because this is an indexing thing, not anything that goes on a web server. FYI, Microsoft IIS is quite capable of running Python, so this should work without any problems at all.

  5. Off Topic, Yeah, But I Am So-o-o-o Googled Out by RobotRunAmok · · Score: 0, Offtopic

    Sure, if I don't want to read about Google, don't open the article, I know. But I can't even do a search on the site here without now being reminded it's a "Google Slashdot." (See new button on bottom of this page.)

    The Slashdot promotion of Google is reaching Onion-Level parody status, 'cept it's not a parody, it's real.

    Just... rest it... mebbe a coupla two-three days, but just... rest it.

    1. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by Mant · · Score: 2, Insightful

      Well, maybe if Google stop doing stuff for a while?

      Lots of slashdotters seem interested in what Google does, either becuase it tends to be neat, or so they can worry about privacy and the info Google potentially has access to.

    2. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by nandhp · · Score: 1

      It's not slashdot's fault that Google does a better job of search then slashdot does itself. Just try to use slashdot search to find that dead voters article(*). (*)Finding said article is an excersise for the reader, therefore the link will not be included.

    3. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by McFadden · · Score: 1
      Get a life dude... If a company does something that's newsworthy, it should be reported. Period.

      The day Slashdot starts editing its news output to appease the petty whining of individuals who don't like seeing anyone get positive press, will be a sad day for us all.

      Google isn't perfect, but they're doing a lot of pretty good things right now. We're all ready to jump on companies when they screw up, so why shouldn't we give them credit when its due.

    4. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by Anonymous Coward · · Score: 0

      The nice part about having an account on Slashdot is that you can block stories in specific categories (or by specific editors). I went months without realizing Jon Katz was gone thanks to this feature!

      You can also do this for Ask Slashdot, Politics, etc., but I'm still looking for the "Highly Misleading Article Summary" box to uncheck...

    5. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by Anonymous Coward · · Score: 0

      I agree that Google does a ton of cool stuff. But is everything Google does newsworthy?

      It's not about Slashdot "editing its news output." It's about news topic myopia and that editors/submitters/readers are getting comfortable hearing about the same company three times a day.

      Just one observation:
      Google launches something - it's on this site day 1, within the hour.

      Yahoo launches something (Mindset) - it's on this site a month later.

    6. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by Doctor+Crumb · · Score: 1

      New web technology is certainly "Stuff that matters" to webmasters. I am happy that inventions like this are covered on my primary geek news site.

    7. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by Momoru · · Score: 1

      This is hardly new web technology...its not something like BitTorrent or VOIP or anything like that that is a truely innovative cool technology.

      This is just a idea that allows a corporation to do its job better. It would be like if the Census Bureau asked everyone to mail a list of all of the members of their household to them, and update it when they have kids, so they didn't have to take the time to count door to door.

    8. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by EastCoastSurfer · · Score: 1

      Funny you should say that. I was just doing some research this morning to figure out when I want to short their stock.

    9. Re:Off Topic, Yeah, But I Am So-o-o-o Googled Out by Anonymous Coward · · Score: 0

      Thanks for this. I'm a little bit freaked out that the bitter geeks have not yet turned on Google. A large for-profit company really ought to be the enemy and there is just something really wrong about the open source crowd continuing the standing O.

      Cmon, it is like a brand new Microsoft to attack relentlessly.

  6. fuckedgoogle.com anyone? by Anonymous Coward · · Score: 2, Interesting
    1. Re:fuckedgoogle.com anyone? by lb746 · · Score: 0

      Is this SFW? From the URL I'd assume no, but a simple SFW, or NSFW would be nice...

    2. Re:fuckedgoogle.com anyone? by grandmofftarkin · · Score: 1

      On my initial glance it appears to be safe for work. Apart from the word fucked in the URL there is nothing really bad.

  7. Stop doing that... by SySOvErRiDe · · Score: 1

    ...you're supposed to be evil!

  8. Still in Beta by bogaboga · · Score: 0

    The site is still in Beta! Is it launched while still in beta?

    1. Re:Still in Beta by iolagnm · · Score: 1

      Isn't everything Google still beta?

    2. Re:Still in Beta by Mant · · Score: 1

      Just about all of Google seems to be in beta. While it is nice to get the stuff early, "beta" is a pretty meaningless term as far as Google stuff is concerned.

    3. Re:Still in Beta by timeOday · · Score: 1

      Google never has a final release, they just leave everything in beta, forever - see groups.google.com, maps.google.com, gmail.google.com, froogle.google.com, news.google.com, and who knows what else.

    4. Re:Still in Beta by Anonymous Coward · · Score: 0

      While it is nice to get the stuff early, "beta" is a pretty meaningless term as far as Google stuff is concerned.

      Au contraire. Beta has a very specific meaning to Google. It means that the product stays off the shareholders' balance sheets and are not expected to make money yet. The developers like to keep stuff in beta as long as they can, but when they need a new source of revenue, they figure out how to make money from it and take it out of beta.

  9. Sitemaps abuse? by iolagnm · · Score: 3, Insightful

    It will take a company with enough influence like Google to really promote XML sitemaps, which could lead to a great thing... but what is to stop them from becoming like MetaTags where companies will just flood them with useless keywords and entries in an attempt to get better search rankings?

    1. Re:Sitemaps abuse? by Sancho · · Score: 1

      I'd really like to see a site-influenced system like this that defines areas of news and areas of non-news. I'm tired of searching for multiple terms and getting main articles devoted to one of the terms and sidebar links to one of the others. For example, [insert notebook model] and Linux.. you might get a site like Slashdot where there's an article about the new notebook and many, many sidebar items about Linux.

    2. Re:Sitemaps abuse? by Mant · · Score: 1

      I've not seen anything to suggest sitemaps will improve your ranking, just get you indexed more often.

      If you claim pages update every day, but they don't, it will be pretty easy for the spider to tell. So you could stop the frequent scans if they aren't really needed, if after say a month the supposed daily updates never happened.

    3. Re:Sitemaps abuse? by Anonymous Coward · · Score: 1, Informative

      Look at the schema. None of the content of a sitemap file has anything to do with the content of your pages. It is all metadata -- url, last modified time, expected modification frequency, etc -- meant to help crawlers find your pages and be smarter about keeping their index/cache up to date with a minimum expenditure of bandwidth.

    4. Re:Sitemaps abuse? by drnlm · · Score: 2, Interesting
      That's really up to the search engine implementation, isn't it.

      Anyway, a brief look at the proposed format gives very little scope for abuse - you can specify location, change frequency, last modified and a priority, and that's it. The priority is specified as only applying to urls from the same site, so what you can do with it is fairly limited. Overall, it looks written as a set of additional hints to spiders crawling the site.

    5. Re:Sitemaps abuse? by Anonymous Coward · · Score: 0

      did you even RTFA?

      this is not about rank.

      This allows google (and other search engines if they implement it) to see deeper into your site and for you to communicate to google about how often various pieces of your site is updated.

      robots.txt on steroids.

    6. Re:Sitemaps abuse? by ArbitraryConstant · · Score: 3, Informative

      Well, I noticed two things about it...

      First, the priority is a relative priority, so if you want to set every page to 1.0 (defined as the highest priority) it'll mean nothing.

      Second, if you lie about update frequency or the date of the last update they'll figure it out pretty quick.

      These aren't commands, they're hints.

      --
      I rarely criticize things I don't care about.
    7. Re:Sitemaps abuse? by Jellybob · · Score: 2, Interesting
      Using XHTML this shouldn't be too hard - something along the lines of:
      <goog:index>
      Stuff that actually matters
      </goog:index>
      Advertising crap which people don't care about.
      It's not going to fix the problem on sites which are doing this delibrately, but for those of us who actually care about getting indexed relevantly it would be great.
  10. Blog related? by TheOzz · · Score: 1

    My first thought was this will really help bloggers. Not really because those blogs updated the most are generally the ones getting the most traffic already anyway?

  11. Can't google sneeze by DrinkingIllini · · Score: 0, Troll

    with /. taking a picture? You'd think that google is Slashdot's first born. Better ease up a bit before little Mac gets jealous.

  12. While they are at itmaybe new meta tags? by LWATCDR · · Score: 1

    I would love to see a new meta tag for address to become common. Could make things like Google local even more useful.

    --
    See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    1. Re:While they are at itmaybe new meta tags? by Enrico+Pulatzo · · Score: 1

      I'd look into using RSS (Really Simple Syndication) with DC (Dublin Core) metadata.

      I think the "coverage" tag would be probably what you're looking for.

    2. Re:While they are at itmaybe new meta tags? by LWATCDR · · Score: 1

      Okay.. Anyone use it?
      Most static sites do not use RSS.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    3. Re:While they are at itmaybe new meta tags? by Enrico+Pulatzo · · Score: 1

      "Okay..Anyone use it?"
      Sure. Lots of people use it. Does Google grok it? I dunno.

      "Most static sites..."
      So. Be a trend setter. I encourage you to not think in terms in static and dynamic but in terms of modern and outdated.

      Do modern search engines even care about meta tags anymore? It seems that technorati-style tags may be a more modern equivalent, but they do tend to influence the way you display information on your site. There's a lot of discussion on whether or not these tags are primed to spiral out of control, but I think that being visible they have a much better chance of not being abused like the meta.

    4. Re:While they are at itmaybe new meta tags? by LWATCDR · · Score: 1

      If Google local does not use them then they are useless.
      I have to admit that I wonder how a location tag could be abused.
      modern vs static?
      Not every page needs to be dynamic. A page for a restaurant does not tend any dynamic content so RSS seems like over kill.

      --
      See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
    5. Re:While they are at itmaybe new meta tags? by Enrico+Pulatzo · · Score: 1

      A location tag could be abused the same way any keywords could: an abuser could map its services to locations that it doesn't belong to. That's why these seemingly new meta tags (the technorati style tags) work: they're visible (which makes it harder to get away with fraudulent location reporting) and they're links, which makes Google consider them important (increases the page rank and such).

      I agree that not every page needs to be dynamic (well, i suppose that really depends on your definition of "dynamic"). Something like a permalink for a blog doesn't _need_ to be generated any other way than a flat file, but that doesn't automatically mean that that url doesn't need syndication.

      That's why I said "modern/outdated" versus "dynamic/static". Modern sites can be a mix while utilizing new technologies to get the word out in a bandwidth-friendly (not to mention machine-interpretation-friendly) manner.

      As RSS content becomes more mainstream (thanks to efforts made by a ton of people, most recently Apple's Safari 2 browser) domains will be expected to have syndication feeds. This will lead to the adoption of those feeds by the search engines (if not Google, then someone wanting to break into an area that's not yet dominated by Google) which will lead to an altogether more efficient and accurate searching experience for all.

  13. Anonymous Whoring by Anonymous Coward · · Score: 0

    About Google Sitemaps

    1. What is Google Sitemaps?

    Google Sitemaps is an experiment in web crawling. Using Sitemaps to inform and direct our crawlers, we hope to expand our coverage of the web and improve the time to inclusion in our index. By placing a Sitemap-formatted file on your webserver, you enable our crawlers to find out what pages are present and which have recently changed, and to crawl your site accordingly.

    Basically, the two steps to participating in Google Sitemaps are:

    1. Generate a Sitemap in the correct format using Sitemap Generator.
    2. Update your Sitemap when you make changes to your site.

    2. Who can use Google Sitemaps?

    Google Sitemaps is intended for all web site owners, from those with a single web page to companies with millions of ever-changing pages. If either of the following are true, then you may be especially interested in Google Sitemaps:

    * You want Google to crawl more of your web pages.
    * You want to be able to tell Google when content on your site changes.

    3. How much does it cost?

    Absolutely nothing. Google has never charged for placement in our search results, and we don't have any plans to do so.

    4. Why is Google doing this?

    In alignment with Google's mission to organize the world's information and make it universally accessible, this collaborative crawling system will allow our crawlers to optimize the usefulness of Google's index for users by improving its coverage and freshness.

    5. How do I get started?

    Read 'How do I create a Sitemap' below to learn about the format for Google Sitemaps. We also have detailed documentation on the Sitemap Protocol and the Sitemap Generator if you'd like to skip straight to the technical details.

    6. Do I need to sign up for a Google Account?

    You don't need an account to generate and submit a Sitemap. However, we encourage you to sign up for an account so that you can track the status of your Sitemaps and view diagnostic information for your submissions. Having an account will not affect your site's ranking within our results. If you already use Gmail, Groups, My Search History, Alerts, or Froogle Shopping List, you already have a Google Account and can sign in with your existing account to use Google Sitemaps.

    7. Will participating in this program change my pages' ranking in Google search results?

    No. Using Google Sitemaps will not influence your PageRank; there will be no change in how we calculate the ranking of your pages.
    Sitemaps

    1.What is the Sitemap Protocol?

    The Sitemap Protocol is a dialect of XML for summarizing sitemap information that is relevant to web crawlers. For each URL, you can include crawl "hints" like the last modified date and approximate change frequency. You can read more about the Sitemap Protocol here.

    2. How do I create a Sitemap?

    There are a number of methods you can use to create a Sitemap. You can use Google's Sitemap Generator, downloadable from Google Code - it's a simple script that generates Sitemaps for basic use cases. You can read more about the Sitemap Generator below. If the Sitemap Generator will not work for your site structure, we encourage you to write your own script for generating Sitemaps and share it with others.

    3. Will Google crawl and index all of the URLs in my Sitemap?

    We don't guarantee that we'll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site's structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future. In most cases, webmasters will benefit from Sitemap submission, and in no case will you be penalized for it.

    4. How do I submit my Sitemap to Google?

    There are a number of ways to submit your Sitemap for inclusion in Google Sitemaps. The Sitemap Generator script can build and submit your Sitemap automatically. If you don't use the Sitemap Generator, you may also submi

  14. Reinventing the wheel? by baadger · · Score: 1

    Ermm this is all well and good and such but isn't a large chunk of this information already made available via Cache-Control and Last-Modified HTTP headers?

    Reminds me of blog pings - what's wrong with using the Referer header? Doing some checking and then fetching the referering page and checking for linkage?

    Has the world gone XML crazy?

    1. Re:Reinventing the wheel? by game+kid · · Score: 1

      <response value="yes" kidding="false" serious="true" />

      --
      You can hold down the "B" button for continuous firing.
    2. Re:Reinventing the wheel? by iabervon · · Score: 1

      Cache-Control only works on a per-request basis, and Last-Modified only works if you decide to check again. They're designed for clients like web browsers, where you only care about whether there have been changes when the user is checking on the site; they're not good for trying to schedule spidering, because many things specify "no-cache" (if the user wants to look at the page, just get a new one) and doing HEAD requests on the whole web for the Last-Modified dates is going to be slow.

    3. Re:Reinventing the wheel? by baadger · · Score: 1

      "Cache-Control only works on a per-request basis"

      I believe proxies cache the headers as well, unless must-revalidate is specified in which case it must do a If-Modifed-Since or similar request which will return fresh headers. How is it not Google's responsibility to remember when to crawl your page anyway? Thats exactly what they intend to do.

      "They're designed for clients like web browsers, where you only care about whether there have been changes when the user is checking on the site"

      Why is Google any different to these clients? It still needs to know if content has been updated on a crawl or whether it can safely assume it's unchanged and the index needs to be updated when it does change.

      "they're not good for trying to schedule spidering"

      If a HTML file hasn't been changed since late 2000 and the cache-control says to keep it for 3 months I think we can safely say this page deserves a low priority value and can go to the back of the spider queue.

      "many things specify "no-cache""

      no-cache doesn't effect Googles indexing or acknowledgement of a pages existance. Or retreiving these headers.

      "Doing HEAD requests on the whole web for the Last-Modified dates is going to be slow."

      And fully retrieving a XML document is not? Remember that most of the pages Google spiders are already in it's index. Only a minority of pages on an established website will be additions to the Googledex on any given crawl and their headers need to be retreived. For the existing stuff the engine will need to check for modifications anyway, and hence receive headers for these as well. This is nothing new and happens each time the bot visits your website.

      I just feel the current XML Sitemap specification is over engineered and not very human friendly. A changelog (in the form of an XML document) that shows URI's that have been created and modified and when, would provide much better information for a crawler than a document describing when content may be updated. When Googlebot next visits it would have a central location to find new and updated documents that have appeared since it's last crawl. IF such a log only listed new documents (not updates), the bot still only has to do a quick scan (If-Modified-Since requests) for pages already in the index. This scan doesn't really need to be prioritised although can be using information from HTTP headers it retreived on the last crawl.

      IMHO the suggested specification introduces more ways for webmasters to screw with (for better or for worse) the efficiency of indexing on their website. This is the first official way that you've been able to actually screw with Googles interaction with your site! and to me this could lead to a more doubtful future for casual webmasters, not upto speed with the latest search engine aids, from being found in Google search results.

      Hopefully though, being extendable and under a good license, this draft will be replaced with a format that provides data currently not immediately available to search engines.

  15. Google is IT's Willy Wonka by stlhawkeye · · Score: 5, Funny

    I envision the interior of Google as this huge warehouse full of oversized transistors, data streams with paddleboats, waterfalls of caffeinated beer, chairs contoured like a keyboard key, where diminutive men in green hair sing songs about electrons and logic gates and if you wander into the room where Duke Nukem 3D is being tested you'll be thrown out.

    --
    "I have never won a debate with an ignorant person." -Ali ibn Abi Talib
    1. Re:Google is IT's Willy Wonka by novakreo · · Score: 1

      if you wander into the room where Duke Nukem 3D is being tested you'll be thrown out.

      I think you mean Duke Nukem Forever.
      Duke Nukem 3D is nearly ten years old, I remember playing it at high school on my Pentium-100 laptop.

      --
      O frabjous day! Callooh! Callay!
    2. Re:Google is IT's Willy Wonka by Anonymous Coward · · Score: 1, Informative

      I work at Google.

      I envision the interior of Google as this huge warehouse full of oversized transistors, data streams with paddleboats, waterfalls of caffeinated beer, chairs contoured like a keyboard key, where diminutive men in green hair sing songs about electrons and logic gates and if you wander into the room where Duke Nukem 3D is being tested you'll be thrown out.

      I'm actually 6'2", and my hair is brown, and it's Duke Nukem Forever, but otherwise you're right on.

    3. Re:Google is IT's Willy Wonka by Wee · · Score: 2, Informative
      I never saw any paddleboats, but they did have a keg of beer outside the cafe yesterday. And there's no shortage of caffeinated drinks in the mini-kitchens.

      I can neither confirm nor deny the existence of any secret video game testing rooms.

      -B

      --

      Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.

    4. Re:Google is IT's Willy Wonka by ajs · · Score: 1

      Yeah, you've mostly got the description of the public tour, but if you step off the boat and go searching around, you'll find a room with this 3-story-tall slug, spewing out search results from it's back-side! It's a disturbing site, but I still can't get myself to stop using Google!

    5. Re:Google is IT's Willy Wonka by CrackHappy · · Score: 1

      Do they call the excrement Slurm by chance?

      Woohoo!

      --
      1f u c4n r34d th1s u r34lly n33d t0 g37 l41d Capitalization really works: i helped my uncle jack off a horse
  16. How does this benefit me? by duffer_01 · · Score: 1

    I would be interested to hear what the benefits are to me for doing this? From the FAQ it indicates that it will not change my Page Rank. Now, I know the page rank does not really mean much to my overall raking in Google results. However, if I go through the effort of creating and updating this whenever my site changes, how will it benefit me?

    1. Re:How does this benefit me? by Eric+Giguere · · Score: 4, Insightful

      It benefits you because:

      • Google will hopefully crawl your frequently-changing pages more often
      • Conversly, Google won't crawl other pages as often, saving your bandwith
      • Google will find pages that it wouldn't normally find just by following links

      Also, you wouldn't necessarily have to maintain more than one sitemap. You could use XSLT to create the sitemap.html file for your site from the XML file you create for Google. In fact, wouldn't it be nice for Web authoring tools to do this automatically for you?

      Eric
      Make Easy Money with Google: The Blog (powered by blojsom)
    2. Re:How does this benefit me? by DigitalRaptor · · Score: 2, Interesting

      Because when you launch a new site, or new section of your site, you create the site map and notify Google, rather than hoping some day they'll follow a link somewhere and come spider your site.

      Google immediately knows that the site exists, immediately knows how many pages there are, how often they are supposed to change, AND what priority I place on them, so out of my 150 pages, the 10 I want spidered first are labeled as higher priority.

      This makes total sense to me.

      --
      Lose Weight and Feel Great with Isagenix
    3. Re:How does this benefit me? by Anonymous Coward · · Score: 0

      If you want an automatic generator, heres mine: http://www.elifulkerson.com/projects/gensitemap.ph p No XSLT done yet, I'm probably going to whip that up next.

  17. Creative Commons Meme by broward · · Score: 2, Informative

    It's not surprising that Google is using a Creative Commons license. The meme has been steadily gaining strength for over a year.

    http://www.realmeme.com/miner/preinflection.php?st artup=/miner/preinflection/creativecommonscontentD ejanews.png

    1. Re:Creative Commons Meme by Anonymous Coward · · Score: 0

      Geez, the Creative Commons people are pretentious enough without throwing "meme" into the mix...

    2. Re:Creative Commons Meme by Anonymous Coward · · Score: 0

      *Bzzzzzzzzt!!* Oh, so sorry. You said "meme."

      May we have the next latte-drinking buffoon, please?

  18. Hidden jab at Yahoo? by solomonrex · · Score: 1

    "According to the Google Blog, this is being done '...so that other search engines can do a better job as well."

    I love the fact that they're saving us all a lot of time by giving Yahoo! access to this, so we don't have to wait for them to create their own version...

    1. Re:Hidden jab at Yahoo? by m85476585 · · Score: 1

      I thought Yahoo used Google's search (or at least technology) Maybe not.

  19. How is this a win-win? Here's how.... by doublem · · Score: 1

    This sounds like a really cool idea.

    Livejournal.com has had a number of problems with Google, and often just plain outright bans them from spidering the site. Part of the problem is that all the registered users have their journals at journalname.livejournal.com as well as livejournal.com\users\journalname. This means indexing the journals for resisted users doubles the load on their server farm!

    With something like this, livejournal would be able to define exactly how often the indexing process occurs, and could control which version f the URL is indexes.

    I assume issues like this are far from unique.

    This is a win-win. Google doesn't have to have it;s spiders crawl sites as often, server load on the various sites is reduced, and indexing frequency is in line with how often the webmaster wants the site to be indexed.

    And licensing means that hopefully, the same XML file will be end up being good for multiple search engines!

    Very cool technology. Hopefully it's also highly abuse proof. I'd hate to see the results of something like this being used by the "Search engine optimization" firms.

    --
    "Live Free or Die." Don't like it? Then keep out of the USA
  20. No more messing around with index hacking... by CardiganKiller · · Score: 1



    Do you n3ed V1agra or Sialis? We have the best and af728 most potent types fo...

    1. Re:No more messing around with index hacking... by CardiganKiller · · Score: 1

      oh dear God, my XML joke above the first post didn't display because I forgot to use the right tags. mod me down into a deep dark hole where I have nothing left but my shreds of shattered humor.

    2. Re:No more messing around with index hacking... by m85476585 · · Score: 1

      ues teh perveiw buuton beofer sbumting!

    3. Re:No more messing around with index hacking... by CardiganKiller · · Score: 1

      Ko!

  21. robots.txt by shmlco · · Score: 1

    It's too bad they couldn't use figure out a way to add addtional keywords to robots.txt. (w/o breaking it) Now one needs to create both files for a site to index properly.

    --
    Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
    1. Re:robots.txt by Seanasy · · Score: 1

      Google wants this sitemap funtionality to make into the web server itself. So, it looks like they're opting for the long-term solution.

    2. Re:robots.txt by Sjobeck · · Score: 0

      Indeed.

      Touche'.

  22. Google Evil Index by yotto · · Score: 5, Funny

    In other news, the Google Evil Index went down 3.2 points today, and is currently at 13.8, the lowest it's been since right before the beta rollout of Google Web Accelerator.

  23. Lotus Notes? by Blakey+Rat · · Score: 1

    Somebody's using Lotus Notes as a webserver? May God have mercy on their souls.

    (The submitter probably meant Lotus Domino, which is still a bad webserver, but not nearly as bad as Notes would be.)

    1. Re:Lotus Notes? by circusboy · · Score: 1

      They may have really meant 'notes,' this has been seen in the wild... the person who I knew who worked there lamented about it quite a lot. but it is done... sad to say.

      --
      -- it's ridiculous how many people misspell ridiculous... (damn, damn, damn...)
    2. Re:Lotus Notes? by Displaced+Cajun · · Score: 1

      Notes/Domino whatever... its still Lotus Notes. Dominio is just a server process that runs on the Notes server.

      Whatever, its still one of the most secure platforms to run a web server on. No gaping holes like that MS IIS crapola. I think you would be highly suprised at the number of corporate web sites that run a Domino server. While they don't have the HUGE numbers that other platforms do, they also dont have to worry about security or exploits either.

      --
      Executive ability is deciding quickly and getting someone else to do the work. --John G. Pollard
  24. Information already available to Google? by N3Roaster · · Score: 1

    Okay, I've read the article but I guess I don't really get it. Why do we need an XML sitemap to give Google this information? Does this provide enough of an advantage over the unsupported and obsolete revisit-after meta? As for when things were last changed, wget seems to be able to figure that out just fine already. I'm guessing that it can be used to quickly inform a search engine that new pages exist on the site and I can imagine some nice things being possible to end users with the appropriate browser patches, but strictly from a search perspective, why is this needed? (Honest questions, I want to know if this is something I should have on my sites.)

    --
    Remember RFC 873!
    1. Re:Information already available to Google? by __aaqcxr690 · · Score: 1

      To me, this seems as more of a way for the google-monster to eat up the web just a bit quicker than the rest. When you update, they update, because now they have a xml feed on your site. Id like to think this will reach out into blogger land and really increase the pages they cover.

    2. Re:Information already available to Google? by cobrabyte · · Score: 1

      That's because the revisit-after 'meta' is as good as an urban legend. It simply doesn't matter anymore ... if it ever did. -c

    3. Re:Information already available to Google? by ArbitraryConstant · · Score: 1

      It seems there's two goals:

      -Get preferences from admins (priority, approximate update frequency).
      -Get metadata (time of last update)

      They can tell most of that just by downloading the page regularly but with 8 billion pages it's probably pretty hard to do every one of them with any frequency and most of them probably change a few times a year if that.

      Now they can tell if a page has changed and whether it's likely to change in the future with a few kb of gzipped xml instead of megabytes of HTML.

      They've opened the format, so it seems likely that everyone else will be supporting it soon. It's too much of a competitive disadvantage to anyone that doesn't take advantage of it, and other protocols are less likely to be supported by any give site because this is first, open, and simple.

      --
      I rarely criticize things I don't care about.
    4. Re:Information already available to Google? by Daagar · · Score: 1
      Does this provide enough of an advantage over the unsupported and obsolete revisit-after meta?

      Wouldn't using something that isn't classifed as 'unsupported and obsolete' automatically count as enough of an advantage?

    5. Re:Information already available to Google? by N3Roaster · · Score: 1

      In that I meant advantage to Google. There are lots of suckers using that tag, even though nobody supports it. If Google did, it's a lot of information available with no additional burden on people with Web sites.

      Anyhow, thanks to all who provided good information above.

      --
      Remember RFC 873!
  25. Hi by Anonymous Coward · · Score: 0

    Every single time Google is mentioned someone posts how tired they are of Google stories. We know you're tired of Google stories, slashdot editors know you're tired of Google stories, hell's bells, my fucking cat knows you're tired of Google stories.

    Why don't you do everyone a favor and shut the fuck up about it already?

  26. great idea by utexaspunk · · Score: 1

    If people use this, it will likely remove much redundancy from google's indexing processes, possibly freeing up bandwidth and processing power in their datacenters for other projects like more web-based applications...

    1. Re:great idea by seweso · · Score: 0

      I think it gives them more time to spend on pages that actually changed! This is how it works: 1. You submit new content to your site 2. Your content-management systems updates the site-map and informs google that something has changed. 3. Your new page is indexed (sooner) 4. People can find the new page (earlier) 5. Profit

  27. Or maybe another hidden use... by 823723423 · · Score: 4, Insightful

    Navigation is sometimes the hardest part on the internet. A tree structure is sometimes the second easiest way of searching/browsing for information (1st being keyword searching). So maybe if more web designers set up server side solutions, it will lower the burden on web designers. More importantly, move navigation away from web designers to users just as Google displaced content from web designers unto Searchers. So instead of overburdening web servers like this Firefox extension Firefox extension with screenshot which automatically generates a sitemap br crawling a site. Sites can access a sitemap using a favicon.ico like or link rel="sitemap.rdf or sitemap.xml" protocol. Just as netscape NAVIGATOR originally proposed a while back. I think web designers should pay attention - at least those that don't use flash for their whole site. The web is slowly become a database of content rather than style. See the webmonkey wired article on netscape sitemap feature Sitemap rdf or the sitemap slide here Slide from seminar

  28. Re:How is this a win-win? Here's how.... by Skater · · Score: 1

    I agree. I used to host (publically accessible) Mailman archives, and once a month, Google would come through and scan every message. My bandwidth usage that day was at least 10 times what it was on other days, but I wanted the messages to be searchable. Using this to set them to "archive" so they'd only be scanned a couple times a year would've been great.

  29. Marketplace of Ideas by Doc+Ruby · · Score: 1

    "Google is licensing the idea under the Attribution/Share Alike Creative Commons license. "

    And I'm willing to license my idea, "better search engines with better user interfaces", to Google, for a modest sum.

    --

    --
    make install -not war

  30. Next thing you know... by Virtual+Karma · · Score: 1, Interesting

    And the next thing you know will be Google launching specs on web design and then content. Who will comply? well.. anybody who wishes to be indexed by Google. That is 100% of the website owners. And thus Google will control the design, content and other things... HELP... they are taking over the internet

    This might be marked as troll... but think about it. Isnt it possible?

    1. Re:Next thing you know... by Seanasy · · Score: 1
      This might be marked as troll...

      Which is a pretty good clue that it is a troll.

      but think about it. Isnt it possible?

      No, no it isn't possible.

    2. Re:Next thing you know... by Jellybob · · Score: 1

      No, it's not really possible.

      If Google starts doing crap like that, designers and devlopers aren't going to go along with it, at which point Google's usefulness drops dramatically, and all their users go to their competitors.

    3. Re:Next thing you know... by Momoru · · Score: 1

      What motivation would designers have to NOT do it? If Google is the most popular search engine, and you want to keep being searched, you have to keep up with what they want. Its like just because I may not like how MS embeds its browser in the OS, and its anticompetitive practices doesn't mean my company can stop producing software for Windows (and make a profit).

    4. Re:Next thing you know... by TuringTest · · Score: 3, Informative

      And the next thing you know will be Google launching specs on web design and then content.

      As long as everyone can freely and voluntarily use these specs without having to pay anything, how is this a bad thing?

      --
      Singularity: a belief in the "God" idea with the "demiurge" relation inverted.
    5. Re:Next thing you know... by caluml · · Score: 1

      I've often thought that Apache could do similar things. If after there was a big security hole found in Apache, they put out a non-vulnerable version that broke IE browsing, but didn't affect Firefox, it could force a lot of people over. Of course, it would have to be done carefully like other companies do it - The Opera/MSN debarcle so people didn't notice.

      And what is this about?:
      Slashdot requires you to wait 2 minutes between each successful posting of a comment to allow everyone a fair chance at posting a comment.

      It's been 10 minutes since you last successfully posted a comment

    6. Re:Next thing you know... by dema · · Score: 0, Redundant

      This might be marked as troll... but think about it. Isnt it possible?

      Whether or not it is possible has nothing to do with the fact that the post is a troll (:

    7. Re:Next thing you know... by That's+Unpossible! · · Score: 1

      And the next thing you know will be Google launching specs on web design and then content. Who will comply? well.. anybody who wishes to be indexed by Google. That is 100% of the website owners. And thus Google will control the design, content and other things... HELP... they are taking over the internet

      This might be marked as troll... but think about it. Isnt it possible?


      No. Because Google needs our pages indexed by its robots more than we need Google to index us.

      Sonny, let me tell you 'bout a time before Google. There was Yahoo! and AltaVista and tons of others.

      Oh wait, that time is now. Just because Google is the top dog right now doesn't mean they can stop competing and start dictating terms of our surrender. Imagine Google rolls out these "requirements," and 20% of the sites currently indexed comply. Now Google has a fifth of their previously indexed pages, and instantly their search engine becomes useless to most people.

      --
      Ironically, the word ironically is often used incorrectly.
    8. Re:Next thing you know... by Nimey · · Score: 1

      More like paranoid ranting. This was only modded up because of the current "Google is an evil coporation" slashthink.

      --
      Hail Eris, full of mischief...

      E pluribus sanguinem
    9. Re:Next thing you know... by phidipides · · Score: 4, Insightful

      And thus Google will control the design, content and other things... HELP... they are taking over the internet

      Nice. Google proposes a way to help web site administrators have a bit more control over how their site is perceived by a search engine, releases this proposal under an open source license, and at least a few people on slashdot accuse them of (*pinky to corner of mouth*) taking over the internet.

      Most of Google's recent actions have been good things -- sponsoring open source developers for the summer, proposing ways for site administrators to provide additional info about their site, and implementing a "nofollow" option to prevent spammers trying to increase their page ranking. However, if they constantly get criticized and second-guessed for doing good things, what incentive do they have to continue? If you give a charity $20 and they criticize you for not giving them $30, are ever going to give anything to that charity again?

      Let's give Google the benefit of the doubt. Just like a person, they'll probably make some mistakes, but like a person I'll give them the benefit of the doubt until they prove me wrong. Some corporations do actually do good things and still manage to be successful, and in those cases they should be supported, not attacked.

    10. Re:Next thing you know... by utexaspunk · · Score: 1

      Perhaps they created this with the hope that it would become an open de-facto standard before microsoft made a proprietary one.

    11. Re:Next thing you know... by Darkman,+Walkin+Dude · · Score: 1

      So long as everyone can freely and voluntarily use internet explorer without having to pay anything, how is this a bad thing?

      Okay, here's how it goes, the majority of searches on the internet are done via google, therefore there is a massive incentive to comply with whatever google comes up with next, for anyone that ever wants their site to be seen. This is called leveraging a strong market position, and could border on a monopoly style abuse. And just like MS, there will never be any public outcry, because the viewing public neither knows nor cares. All any web designer can do is dance to whatever tune google plays, just like they still have to dance to the tunes that IE plays. Mmmkay?

      For what its worth, I don't see how this system is effectively any better than what already exists; I mean, google already uses pagerank algorithms within sites, the number of links to and from a page define its importance, and the revisit-after tag does similar, although it is deprecated.

    12. Re:Next thing you know... by Darkman,+Walkin+Dude · · Score: 1

      The point being that web site admins won't have any choice but to implement this system. This is not a good thing.

    13. Re:Next thing you know... by Anonymous Coward · · Score: 0

      Excellent point. In fact, this proposal strikes me as being very much like robots.txt, which has single-handedly brought down the internet.

      In seriousness, web site admins will always have a choice, Google has released this specification under an open source license, and as a web admin I can't see any harm in it. If other search engines like this proposal they will probably adopt it as well, making it vastly easier for sites to indicate what are their most important pages. If you or any other web admin think it's evil, don't use it, but don't attack a company who is implementing it in good faith and has a history of doing the right thing unless you have a legitimate reason for doing so. And no, "web site admins won't have any choice but to implement this" is not a good reason.

    14. Re:Next thing you know... by avail4one · · Score: 1

      hmmm. sounds like internet is going more like tv to me. you have to push to get in the search engine, it isn't being pulled anymore :-)

      of course there will be good jobs for internet programming directors. hmmm like WKRP.

      btw,

      I have a FREE Google Sitemap Validation service at http://nodemap.com/

      I know there are excellent web based XML validators available, but I built this software
      specifically for validating your sitemap.xml *or* compressed
      sitemap.xml.gz file. Hopefully it is straightforward and easy.

      This service allows you to validate your Google Sitemap XML files. Your
      file may optionally be gzip compressed. Each report you generate may be
      stored in your account. You have the option to send each report via
      email to the recipient you specify. There is also a quick-help feature
      that allows you to ask a technical question, or make a comment, about
      your report.

      + works with text/xml content-type
      + checks and reports UTF-8 Byte Order Mark
      + converts the file to unix line terminators if necessary
      + re-encodes the xml file to UTF-8 if the file isn't UTF-8 (*see note)
      + gzips xml files
      + better error handling on web server redirects.
      + shows line numbers against your xml file if the file doesn't
      validate.

      Take care,

      Waitman

  31. One line by Varun+Soundararajan · · Score: 1

    Python Rocks!!!!!!!

  32. Re:How is this a win-win? Here's how.... by Sancho · · Score: 1

    Seems to me that a better solution would be EITHER disallowing indexing of the registered users ljname.livejournal.com pages OR disallowing everything BUT ljname.livejournal.com, granting more benefit for registration.

  33. ermm by Anonymous Coward · · Score: 0

    Not gonna /. google, dude.

    1. Re:ermm by mgbaron · · Score: 1

      Well it appears this is not the case. I'm getting a 502 error.

      (Now I'm waiting for my 19 seconds to be up)

  34. what's the basis of the license? by cahiha · · Score: 1

    I'm wondering: why do you need a license to implement this? Did Google patent this?

    In any case, patented or not, the CC license that this falls under seems acceptable for an open standard, even if it is patented, because it is transferable and because its requirements are minimal. Contrast this with the Microsoft Office XML license, which is royalty-free (for now...), but non-transferable.

    1. Re:what's the basis of the license? by GoogleGuy · · Score: 1

      There's python2.2 code to generate Sitemaps for people. I believe that's what was released under Creative Commons. The intent is to make this open and wide available to anyone that wants to use it.

    2. Re:what's the basis of the license? by Anonymous Coward · · Score: 0

      Nevermind that the Creative Commons people tell you not to license software with it...

  35. Darn it by David+Horn · · Score: 1

    It needs Python 2.2, and I only have 1.5 running. Unfortuately, so many things depend on it (*cough* Ensim *cough*) that attempting to upgrade is a death wish.

    Will wait until I get my new server. :)

    --
    PocketGamer.org - For the gamer on the go!
    1. Re:Darn it by ArbitraryConstant · · Score: 1

      Multiple versions of Python can coexist on a machine...

      --
      I rarely criticize things I don't care about.
    2. Re:Darn it by Anonymous Coward · · Score: 0

      Ensim, my condolences, it are a pain in the arse to administer, but hey, customers like it.

    3. Re:Darn it by David+Horn · · Score: 1

      Tell me more! I looked into it briefly this afternoon, but most forum threads started about it were left unanswered.

      Not knowing Python (I got taught Java at university) I never really payed much attention to it.

      --
      PocketGamer.org - For the gamer on the go!
    4. Re:Darn it by ArbitraryConstant · · Score: 1

      " Tell me more! I looked into it briefly this afternoon, but most forum threads started about it were left unanswered."

      I don't know how it works on Windows, but on Linux you can install as many versions as you want. "python" is a symlink to the default version, and if you want a specific one you can say "python2.4" or whatever.

      "Not knowing Python (I got taught Java at university) I never really payed much attention to it."

      Knowing any one language is a very dangerous specialization. Java isn't going to be in vogue forever.

      The ability to learn languages trumps specialization in any one language in about 99.999% of cases.

      --
      I rarely criticize things I don't care about.
  36. What does Creative Commons mean here? by Wesley+Felter · · Score: 1

    An idea cannot be copyrighted, and thus cannot be licensed under a copyright license like Creative Commons. File formats, being facts, shouldn't be copyrightable either. If the text of the spec is licensed as Attribution-ShareAlike, then all this allows is people to fork the spec, causing confusion.

    1. Re:What does Creative Commons mean here? by tinytim · · Score: 1

      A file format isn't a fact any more than the words that make up a best-selling novel are a fact.

      You could try to argue that it's a collection of facts, but collections of facts _ARE_ copyrightable. Either way, it's a tough argument to begin with.

      File formats are really procedures, not facts, and procedures are patentable. Take Unisys's (now expired) .gif patent for example.

    2. Re:What does Creative Commons mean here? by GoogleGuy · · Score: 1

      We're also offering a python2.2 program that will run on your computer and generate a Sitemap for you. I think that's what has the Creative Commons license. Google wants Sitemaps to be open/available to anyone that's interested in creating or using them (including other search engines, if they're interested).

    3. Re:What does Creative Commons mean here? by Wesley+Felter · · Score: 1

      ...Unisys's (now expired) .gif patent...

      There was never a patent on GIF; Unisys had a patent on the LZW algorithm which could be used in GIFs. Uncompressed GIFs were not covered by the patent.

      But if you look at a simple XML format like Google Sitemaps, there is no novel algorithm involved in reading or writing the format and thus no basis for patent.

  37. Re:Google Bitching by Rollie+Hawk · · Score: 1
    I know this is slightly off topic, but bitching about Google seems to be part of all google topics now.

    Why should Google be different than any other topic?

    --
    Before any liberals are tempted to mod up one of my comments, a word of warning: I'm actually making fun of you.
  38. Mapquest already has better sitemaps! by Anonymous Coward · · Score: 0

    Check it out. It can even tell you how many restaurants exist between Preferences and YRO.

  39. Re:SHUT UP SHUT UP SHUT UP by Momoru · · Score: 1

    And your contributions to the discussion are so helpful too.

  40. Re:How is this a win-win? Here's how.... by jrumney · · Score: 1

    Me too. I host a couple of sites on my home ADSL line, and my usage is about 6GB/month, mostly MSN, Google and Yahoo's crawlers indexing and reindexing the same pages over and over. MSN especially I would like to slow down.

  41. More proof that Google isn't Netscape by ShatteredDream · · Score: 1

    The thing that seems so cool about this sort of thing is that it opens up the search service to the rest of us to help us make our content easier to find when it is updated. One thing that I have come to really respect about Google is that they don't rely on the government to beat Microsoft back down the way Netscape did. Google has managed to make a product that 47% of the US Internet users want to use, even though MSN is the default in IE. Remember Netscape 4? There's a reason that bloated POS failed, anyone who remembers the releases of it for the first six months that it went public knows EXACTLY why that was.

    The only thing that Google can do at this point is continue to let some of their more biased employees run wild. They've been causing Google's Adsense and Adwords to take extremely partisan stances between the Dems and Reps, and that's gotten the ire of many on the right. My concern is primarily that Google will end up pissing off so many of these users that they will end up switching to MSN and helping Microsoft take Google down. Google is certainly not perfect, and I'm still wondering why Google News had the National Vanguard, a neo-nazi publication in their news feed list, but says that some of the bigger blogs like Michelle Malkin are not up to editorial snuff. Go figure, like the neo-nazis aren't biased or anything. Then there's their tendency to run ads for Hamas on their arabic pages.

    Oh well, in many respects they still have a lot farther to go before they have tried as much evil as Microsoft and they are still more innovative, so time will tell.

  42. Has the world gone XML crazy? by aug24 · · Score: 1
    Actually, no, I don't think it has. Precisely as you observe, only a large chunk is available. Now the fact that the vanilla aspects you mention can already be acheived is not a good enough reason to avoid implementing some kind of value-added extensible version of anything that is useful. This is the net evolving to serve humans better, right in front of us.

    Just think of this sort of thing as inter-linking web services sitting on top of the http protocol.

    Justin.

    --
    You're only jealous cos the little penguins are talking to me.
    1. Re:Has the world gone XML crazy? by baadger · · Score: 1

      But as it stands this XML Sitemap index doesn't provide any new information that HTTP headers don't (assuming dynamic pages update handle them well) except for the priority weighting...which should be derived from update frequency.

      I don't see how centralising all this header information serves webmasters better. Only Google.

    2. Re:Has the world gone XML crazy? by tilk · · Score: 1

      But as it stands this XML Sitemap index doesn't provide any new information that HTTP headers don't (assuming dynamic pages update handle them well) except for the priority weighting...which should be derived from update frequency

      No, it shouldn't. My page is a informational one, so most important pages - these with the information - pretty much don't change. And I want to get them indexed quickly. Sitemap seem to be the perfect choice to tell that to the crawler.
    3. Re:Has the world gone XML crazy? by aug24 · · Score: 1
      priority weighting...which should be derived from update frequency.

      Well there's a big fat assumption, right there.

      J.

      --
      You're only jealous cos the little penguins are talking to me.
    4. Re:Has the world gone XML crazy? by baadger · · Score: 1

      I don't see how update frequency has anything to do with getting your content indexed initially. Google should crawl stuff not in it's index as soon as possible regardless of it's update frequency.

    5. Re:Has the world gone XML crazy? by baadger · · Score: 1

      The priority refers to Googles indexing of your content.

      What over reason would Googlebot have to revisit your page other than you having made updates?

    6. Re:Has the world gone XML crazy? by aug24 · · Score: 1

      Because Google, rather sensibly, doesn't want to have to visit every page every day to see if it has changed. It wants to have a hint.

      J.

      --
      You're only jealous cos the little penguins are talking to me.
    7. Re:Has the world gone XML crazy? by baadger · · Score: 1

      And isn't that hint provided by the headers from when it initially discovered or indexed the document?

    8. Re:Has the world gone XML crazy? by aug24 · · Score: 1

      Now I think more about it, yes, fair point! I apologise. I'd still say though, that a one-stop shop for more information is not a bad thing, although it would be better done by w3c than Google.

      J.

      --
      You're only jealous cos the little penguins are talking to me.
    9. Re:Has the world gone XML crazy? by tilk · · Score: 1

      It doesn't do it. Crawling new medium to large websites takes months for Googlebot. The sitemap allows Googlebot to get the most important content indexed first - within days, not months.

    10. Re:Has the world gone XML crazy? by baadger · · Score: 1

      It means a foreseable end to Googlebot as a crawler. It's not a good thing and I don't think it's been well thought out.

  43. Re:Google Bitching by ColdGrits · · Score: 1

    "Like it or not Google is an inovative company."

    You are right.

    No other company has ever launched an Internet Search function.
    No other company has ever launched web-based email.
    No other company has ever provided online maps.
    No other company has ever offerd the contents of usenet via the web.
    No other company has ever offered navigable satalite photos of the planet.
    No other company has ever offerd realtime webcaching and compression to "speed up" one's access.
    No company has ever cached websites for access when they are down or no longer available.
    No company has ever offered a price-checking website.

    Oh, hang on, wait a minute...

    When you look at it, I mean actually take a step back and LOOK, Google is a highly derivative company, with not much in the way of true innovation.

    They take existing ideas and functions, and tweak them. Coupled with their "geek coolness" and hero-worship, they are simply riding the hype wave.

    --
    People should not be afraid of their governments - Governments should be afraid of their people.
  44. search forms by RealProgrammer · · Score: 1
    any site where certain pages are only accessible via a search form would benefit from creating a Sitemap and submitting it to search engines.

    If you have a bunch of data in a MySQL database, ordinarily Google can't find it. You have to create a static link somewhere with a URL for the search you want to make googlable. Those take maintenance.

    There may be some sites that want certain areas crawled, but not others, and those areas aren't maintained by the webmaster or only the top-level part should be hidden from search (which is awkward or impossible to handle with robots.txt). There are always user pages, maverick corporate departments, or whatever.

    This offers a way to do all of that in a systematic way. Very nice way to solve several seemingly unrelated problems at once.

    --
    sigs, as if you care.
  45. Cool idea: Browser utilization of this data! by swinte · · Score: 1

    Now it's just a matter of time until some enterprising developer creates a browser extension that allows this data to be used by the end user during a surfing session. A consistent, complete, trustworthy, and easily-parsed site map definition could allow for some really interesting new paradigms in navigation around a site. Just off the top of my head I imagine a simple tree view of where you are in relation to the rest of the site could be very handy when navigating some of the gigantic maze-like corporate-sites.

    1. Re:Cool idea: Browser utilization of this data! by shmlco · · Score: 1

      That would be nice, if google had made the map hierarchical, which they didn't, and if they allowed for directories, which they don't seem to have done, and if people included every page in their site in the map, which they also don't have to do.

      --
      Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
    2. Re:Cool idea: Browser utilization of this data! by belg4mit · · Score: 1

      ...and if you knew what the filename was.
      Oh, haven't RTFA? :-D It's a decent spec except
      that the XML is kinda verbose and uses tags
      instead of attributes. The other big things being
      the hardcoded limits on number of files, the
      security of only indexing directories below that
      where the map is, and the ability to name maps
      whatever you want.

      --
      Were that I say, pancakes?
    3. Re:Cool idea: Browser utilization of this data! by shmlco · · Score: 1
      ...and if you knew what the filename was.

      Which filename? Every entry must have a URL/LOC, which already has the... oh. You meant if you knew the page TITLE you could generate a TOC. Why didn't you say so? ;)

      And actually, I did RTFA.

      --
      Any sect, cult, or religion will legislate its creed into law if it acquires the political power to do so.
  46. Re:Google Bitching by Momoru · · Score: 1

    Every single product they put out is slashdot worthy.

    This is such bullshit. Some of the stuff they put out is very cool and newsworthy like Google Maps, Gmail, etc... But so much crap that is either not ready yet, not unique, or just plain boring gets posted here. Its literally a direct feed of the Google blog half of the time.

    It also annoys people like me how people on slashdot treat Google like its the second coming of Christ. If anyone says anything negative they get bombarded with posts saying they suck or their ideas are crazy, and any critism of google is given either a "ITS IN BETA!!!" or "YOUR A MICROSOFT ASTROTURFER!!!" and people also waste space giving google a handjob for an idea that has already existed for years. The personalized google portal and the google satellite pictures are two examples that come to the top of my head. Both of these things had been done by other sites for years and then google comes out with them and in the case of satellite pics did not improve upon the existing sites out there and in the case of the portal made an inferior product. Yet when Yahoo or MSN come out with a product that is an attempt at improving something existing, those same people say "COPY CATS!". It also doesn't help that the customized Google Portal only allowed you to add two news sites and one of them was slashdot. Furthermore many Google people post here, and they mention slashdot in the Google blog, so I think another thing that annoys me is there is a large amount of suspected Google astroturfing here.

  47. Re:How is this a win-win? Here's how.... by doublem · · Score: 1

    Last I heard they had blocked ALL Google indexing. robots.txt is somewhat restrictive.

    --
    "Live Free or Die." Don't like it? Then keep out of the USA
  48. Search Engine by Pac · · Score: 2, Funny

    [To ELP's "Lucky Man"]

    They had white pages
    And hits by the score
    All the people's queries
    Waiting by the door

    Ooooh, what a search engine it was
    Ooooh, what a search engine it was

    Many geeks and hackers
    They made up its core
    Everybody's dearest
    A daily stop for more

    Ooooh, what a search engine it was
    Ooooh, what a search engine it was

    It went to the market
    Of the engines it was king
    Of his honor and his glory
    Slashdot would sing

    Ooooh, what a search engine it was
    Ooooh, what a search engine it was

    A burst had found it
    Its money dried as it sank
    No praise could save it
    So it vanished and it died

    Ooooh, what a search engine it was
    Ooooh, what a search engine it was

    1. Re:Search Engine by Dr+Tall · · Score: 1

      I find your lack of faith disturbing.

    2. Re:Search Engine by daviddennis · · Score: 1

      You know, that was pretty clever, but a hint as to how this burst happened would be helpful.

      Especially since Google really does have a great idea here. I know Slashdotters on the whole love Google, and I know there's a bit of a backlash, but for the sake of the integrity of the argument, let's have that backlash be for some legitimate reason, not just because Google's too popular because, well, it really is great.

      D

    3. Re:Search Engine by Pac · · Score: 1

      Please, I was just making a quick joke - no predictions, nothing so serious. I love Google dearly (I have even installed the Accelerator for a day or two until it bothered me with that "23 minutes saved" message) and I think it is popular because of its merits and the hard work of its people, not because they got lucky or something. But even great companies can eventually disappear for one reason or another.

  49. Eh? by baadger · · Score: 1
    • Google should index my website as often as possible. It should use algorithms to detect update frequency and content type and assign it's own indexing priorities to meet the needs of people who are actually searching for information and present them with a fresh result set. Caching mechanisms do this all the time - the Last-Modified and Cache-Control headers tell you how likely it is content is to be updated and how often.
    • It won't save bandwidth if the algorithms mentioned above work correctly. HEAD requests we're also made for this reason.
    • That suggests to me bad web design and structuring. If you want something to be found, by robot or by human - you link to it


    Overall this is offloading Google's workload onto webmasters.
    1. Re:Eh? by frizop · · Score: 1

      Google should index my website as often as possible. Why? Because lots of web server's don't have the proper tags that say stuff like last updated, mostly because it causes people like myself to script things to check for such tags, making it much more difficult. I once md5'd the entire contents of a website to check for updates.

    2. Re:Eh? by Eric+Giguere · · Score: 1

      You're right, but I bet a lot (most?) sites don't do it right and Google figures this is the next best way... Easier to ask people to put up sitemaps than tell them to fix their pages/servers.

      Eric
      Why the Vioxx recall reduced spam (humor)
  50. Feeling the heat from google-watch and critics? by javaxman · · Score: 1
    Someone alerted me to google watch the other day. It's definitely an interesting take on the company, I have to say.

    You do have to wonder how much of the 'do no evil' philosophy is cover for the "let us store and index all information about everything, including you" philosophy. Not that I'm going to stop using Google until their results become less usable than Yahoo's results...

    1. Re:Feeling the heat from google-watch and critics? by Anonymous Coward · · Score: 0

      But who watchers the watchers?
      http://www.google-watch-watch.org/

  51. Slashdotted? by Datrio · · Score: 1

    I think we slashdotted Google! I get 502 Server Error all the time and can't connect with any of the Google pages in the article.

    1. Re:Slashdotted? by lux55 · · Score: 1

      Is there a Google cache of it? Ahhh, circular references!!!!!

      Their blog is holding up at least.

    2. Re:Slashdotted? by Anonymous Coward · · Score: 0

      Interesting that all of the pages that 502 are https. Probably killed the encryption layer rather than the server itself. Weakest link you know.

  52. It's possible, but then... by ShatteredDream · · Score: 1

    It's also possible that Google's CEO could go on a murderous rampage tomorrow at Microsoft's Redmond campus. 0.000000000001% is still a possibility you know. Do you realize what would happen to Google if they did that? They'd be dumped by most website owners faster than they could count the drop in their search and ad hits.

    Then again, Google coming up with detailed design guidelines for their pages for public consumption would be incredibly useful for designers. They use a lot of cutting edge JS tricks like AJAX and their layout is great. It's very clean and the kind of thing I wish I had the skills right now to emulate, but I have too much to learn about web design right now to do that.

  53. Google is mightier than slashdot by Varun+Soundararajan · · Score: 1

    You can google slashdot
    But you cant slashdot google!!!

    1. Re:Google is mightier than slashdot by Gnascher · · Score: 2, Informative

      Au contraire ... Google is returning a 502 on the provided link. Slashdot killed Google. Too bad too ... I wanted to read about this stuff...

      --
      It's not my fault! It was this way when I got here.
  54. Re:Cool idea (they stole my idea?) by neanderlander · · Score: 2, Interesting

    On february 16th i sent google the following email to suggestions@google.com: Hi,
    This is a suggestion for the people who take care of indexing web sites.
    Because Google is the first search engine of choice it has enough of influence to point noses into the same direction.
    So, i propose a new element to be added to websites: a sitemap file. Similar to the favicon file, every site could have an (xml?) file containing information about the info and the info-topography on the site.
    Google has already a 'similar pages' link added to search result. What about adding a link 'show context'. If clicked upon a page is shown that provides info on where the search result is located on the site: the context of the information.
    The sitemap file could also be used by in Googles core indexing-process: providing extra context to evaluate the validity of the indexed page.
    Some other related advantages: google could release a sitemap/browser plugin for users. For example: open a site and if the website contains the special sitemap file, a browserplugin is activated allowing the user to browse the website using there prefered navigational tool. (instead of, or together with, any normal website menu's).
    I hope to here from you
    Kind regards,
    mynamehere
    The Netherlands

    They even used the term 'sitemaps'.

  55. 502 Server Error! by md17 · · Score: 3, Funny

    OMG!!! We finally /.'d Google!

    1. Re:502 Server Error! by chrisblore · · Score: 1

      I don't know about Slashdotting it but whatever it is, that site is taking a hell of a long time to load!

    2. Re:502 Server Error! by GoogleGuy · · Score: 1

      I think the Sitemaps links to a "normal" webserver, as opposed to our custom setup. Plus the Sitemaps stuff is using https. Looks like a higher amount of interest than a typical Slashdotting too. I alerted the Sitemaps team, but you may have to wait for the techie stampede to subside. :)

    3. Re:502 Server Error! by roxtar · · Score: 1

      Ya one could alwyas google /. but to /. google is a totally different thing.

  56. don't flatter yourself by Anonymous Coward · · Score: 0

    Um. It's the most obvious name, isn't it?

    And I somehow suspect this has been in the works since before Feb of this year.

  57. Re:How is this a win-win? Here's how.... by spinfire · · Score: 1

    According to this Yahoo's bot is the most aggressive on my site. GoogleBot is really quite tame.

  58. Could be better by belg4mit · · Score: 1

    Instead of having to notify search engines (blech)
    What about a robots.txt extension to define the
    location of the sitemap index?

    --
    Were that I say, pancakes?
    1. Re:Could be better by m85476585 · · Score: 1

      Good idea! We won't need a copy for every search engine that uses this technology, but specifies a different filename.

  59. Slashdotting... by Anonymous Coward · · Score: 0

    Now even google can't withstand the power of slashdotting...

    Google: Error

    Server Error
    The server encountered a temporary error and could not complete your request.

    Please try again in 30 seconds.

  60. Google Smugness Index up 25 points by msbmsb · · Score: 1

    Reaching out to help the less fortunate search engines, how philanthropic.

  61. Why not just use rss/atom? by neves · · Score: 2, Insightful

    My rss feeds already publishes my newest/freshest pages. Why did they didn't just extended it with some aditional attributes/tags instead of forcing me to implement another xml format?

    1. Re:Why not just use rss/atom? by v3xt0r · · Score: 0

      exactly.

      It's kind of funny to watch google attempt to innovate, when in actuality, they are simply reinventing the wheel, again, and again, and again...

      --
      the only permanence in existence, is the impermanence of existence.
    2. Re:Why not just use rss/atom? by neves · · Score: 2, Interesting

      Silly me! Just found in their FAQ: you can use RSS/atom as your sitemap format!

    3. Re:Why not just use rss/atom? by v3xt0r · · Score: 0

      *doh* =p

      I'm so over google anyhow, that site is all hype.

      Aside from the Satelitte Maps, I haven't seen anything that they do better than other search engines like yahoo, or clusty.

      Although I wouldn't go as far as saying MSN or yahoo should be used, especially using IE, with all those wonderful exploits those sites serv daily. =/

      --
      the only permanence in existence, is the impermanence of existence.
  62. WE BROKE GOOGLE! Woohoo! by Anonymous Coward · · Score: 0

    Google
    Error

    Server Error
    The server encountered a temporary error and could not complete your request.

    Please try again in 30 seconds.

    1. Re:WE BROKE GOOGLE! Woohoo! by BillsPetMonkey · · Score: 1

      What's more interesting is that if you do just that - try again 30 seconds later ... it's still broken.

      --
      "It's not your information. It's information about you" - John Ford, Vice President, Equifax
  63. Yahoo Paid Inclusion? by MenThal · · Score: 1
    by giving Yahoo! access to this, so we don't have to wait for them to create their own version...

    I'm not sure how alive and well Paid Inclusion (or whatever it is called nowadays... is it "Search Submit" now?) is at the moment, but they have had solutions for ensuring timely updates and ensured inclusion the their index for some time now. Commercial solutions, but still.

    So I have a hard time imagining Yahoo touching this particular piece of Google technology any time soon... Unless they can assimilate it into the commercial offering.

  64. What About Us Python-Free Zones? by Zastrossi · · Score: 1

    I see an execution gap here, though. My blog is, what, 2600 pages? I'm obviously not going to build that XML file manually (with one node for each page). Google does provide a Sitemap Generator, but it's Python code meant to be run on my web server. My Python skills are nil, so that route isn't viable for me either. I expect that there's a good many 'webmasters' (as in, people who design and run websites) who don't know Python from perl. Given the CC license, though, maybe somebody will grab the code and build an idiot-proof solution for the Sitemap Generator.

    1. Re:What About Us Python-Free Zones? by Jarlsberg · · Score: 1

      It's incredibly easy, even if you don't know any Python at all. All you have to do is edit the config.xml file, upload the python script and the config file (plus an url list if you want) and type python /your_path/sitemap_gen.py --config=/your_path/config.xml, and you're done.

  65. Offloading Google's work... by Anonymous Coward · · Score: 0

    Bingo! You hit the nail on the head. This is just like the big-box stores and grocers making you scan and box your own purchases, and not evem giving you a discount for it.

    You are exactly right. The HTTP head can give info about whether the page has been changed or not and thus decisions can be made about whether to respider or not.

    While there may be epi-phenomenal benefits to this scheme at some later point, I agree with you that this is simply google offloading what should be their own workload onto the web-sites.

  66. google watch watch ? wow. by javaxman · · Score: 1
    But who watchers the watchers?

    That's a thing of beauty. Well, not really, it's a damn shame to waste a domain name on a nearly plain-text page, but it's still pretty funny. Does anyone really love google enough to host a page like that on their own? Wow, if so. I mean, I've always liked google, but would I rent out a domain to host a anti-anti-google website? I doubt it. Thanks for that, though. Definitely a +1 interesting from an AC.

  67. SiteMaps Generator crashed our server! by Un-Thesis · · Score: 1, Informative

    I had a very bad experience with python sitemap generator from SourceForge using the 'accesslog' option. I plugged in a 10MB sitelog from our corporate site Great Seats to Sold-out Events, which has ~22,000 pages.

    Within five minutes it crashed my development server, a 3200 MHz Pentium 4 with 2GB of RAM running Debian Linux. Just imagine if this had been the production server...the costs for over-utilizng the webserver

    For the details, see http://www.incendiary.ws/node/94 Please syndicate my content if you want :-)

    --
    Promote freedom; fight fascism.
    1. Re:SiteMaps Generator crashed our server! by Darkman,+Walkin+Dude · · Score: 1

      Funny, I just got this very story in a newsletter from webpronews... you've been dropping this anywhere you could, haven't you? Nyeh whatever, for what its worth, google for me is like a clown, many coloured and jolly, but it gives me the creeps...

  68. Re:Google Bitching by Fareq · · Score: 1

    Google creates innovative technologies and technological implementations of other people's models.

    Frequently the google implementation is vastly superior in some purely technological manner. In this way they are innovative.

    Frequently, google is able to turn the superior technology into a superior user experience as well. In this way, too, they are innovative.

    Frequently, google's cool creations are hyped beyond all belief. In this manner, they are... erm... the recipients of a geek love-affair, and not at all innovative.

  69. Python is Cool by codepunk · · Score: 1

    nuff said

    --


    Got Code?
  70. Google Bitching-"./" Statistics. by Anonymous Coward · · Score: 0

    Well now. Google is the Messiah, and MS is the anti-christ. Don't you just love the fact that we're 80,000 completely random people. Were's that silent majority that every "Slashdot has no Hypocrisy" defender mentions?

  71. Google site map by jlerner · · Score: 1

    A great idea - much better than waiting for the deep crawl. So far I've only seen a 502 error, but they are no doubt experiencing a deluge.
    http://www.myrealtalk.com/

  72. Re:How is this a win-win? Here's how.... by kneel · · Score: 1

    i see the problem here. noone is reading livejournal postings except google's spider.

    if they blocked google, they could probably reduce their bandwidth enough to run all of those sites on a cable modem!

    --

    indierock / punkrock band photos and more... http://www.digitaldefection.net

  73. insight into unlinked directories by e**(i+pi)-1 · · Score: 2, Informative

    I had been writing a primitive sitemap generator myself using shellscripts
    essentially using "find" and "grep" alone, but this tool is much better,
    faster and easy to configure. Cool.

    Note that this tool will allow google to reach files which never would be
    found by spidering a site, because the files are not linked. If you
    include something like

    <directory path="/var/www/html" url="http://www.example.com/" />

    in your config.xml and run "sitemap_gen.py" on it, you will give the world
    access to a large amount of material
    (like test versions of your website or source code you did not want to
    make accessible). We might see lot more material material which had been
    'hidden'.

    1. Re:insight into unlinked directories by David+Off · · Score: 1

      but because it has no inbound-links from the wider web it will never acquire any page rank and will be as good, if not, invisible for Google search

    2. Re:insight into unlinked directories by Jarlsberg · · Score: 1

      Definitely a hazard - so it's extremely important to run it in test mode and see what's being indexed before you go live. I had to edit my config file *extensively* to weed out stuff that didn't need to there. :)

  74. Google Adds New Content to "Google Information fo by Anonymous Coward · · Score: 0

    These guys are turning the world around every day.

  75. Re:How is this a win-win? Here's how.... by elemental23 · · Score: 1

    Nope. LiveJournal users have an option to allow indexing or not. It's off by default but can be enabled by simple checking a box. My LJ is spidered by Google no problem.

    Good thing, too, because Livejournal doesn't provide any way to search journals, even your own. If Google wasn't indexing it I'd never be able to find any old posts of mine. I wish other LJ users would enable this, as there's nothing more frustrating than being unable to find something you know someone posted six months or a year ago.

    --
    I like my women like my coffee... pale and bitter.
  76. Re:Google Bitching by emurphy42 · · Score: 1
    people also waste space giving google a handjob for an idea that has already existed for years. The personalized google portal and the google satellite pictures are two examples that come to the top of my head. Both of these things had been done by other sites for years and then google comes out with them and in the case of satellite pics did not improve upon the existing sites out there
    Sure, the pictures themselves are of equal quality, but IMO Google's "drag/click/cursor and move/reload just the images" interface is a vast improvement over the "click and wait a second for the whole page to reload" interface used by the existing sites.
    and in the case of the portal made an inferior product.
    Again, the interface (drag-and-drop in this case) is the impressive part. Certainly there's room for offering additional content.
  77. Re:SHUT UP SHUT UP SHUT UP by Anonymous Coward · · Score: 0

    You shut up. The GP actually has a valid point. It almost seems like the only remaining difference between Google and Microsoft is that slashdot loves Google and hates Microsoft. I'm also surprised there isn't a google.slashdot.org subdomain yet...

  78. Ignore section for WordPress by martijnd · · Score: 1

    I just ran the tool as a test on my pathetic little WordPress blog (it has a total of 3 pages). It happily chucked away and reported 842 files by combining the complete blog directory, awstats sub directory and the urls it found in the Apache access_log.

    That would be a lot of non-essential crud for Google to spider.

    Using filters is very simple, so the following filters removed most rubbish:

    <filter action="drop" type="wildcard" pattern="*index.htm*" >
    <filter action="drop" type="wildcard" pattern="*awstats*" >
    <filter action="drop" type="wildcard" pattern="*wp-admin*" >
    <filter action="drop" type="wildcard" pattern="*wp-includes*" >
    <filter action="drop" type="wildcard" pattern="*wp-content*" >
    <filter action="drop" type="wildcard" pattern="*wp-images*" >

  79. Failure to find sites by hhawk · · Score: 1

    I've entered a few sites I run and 2 of the 4 gave a "not found" error and then when retried, it worked. Perhaps a local DNS problem but probably something not working well enough on the google side...

    --
    http://www.hawknest.com/
  80. Editorial Discretion by Un-Thesis · · Score: 1

    Just demonstrates the editorial discretion of Slashdot. My news -- the first negative report of sitemaps -- was published by all the major search engine-related websites (over 20) yet here remains only a mod +1.

    --
    Promote freedom; fight fascism.