Googlebot and Document.Write

← Back to Stories (view on slashdot.org)

Posted by kdawson on Sunday March 11, 2007 @05:06PM from the ajax-the-foaming-indexer dept.

With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.

18 of 180 comments (clear)

Min score:

Reason:

Sort:

How did this make the front page? by Anonymous Coward · 2007-03-11 17:20 · Score: 2, Insightful

It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript. I was actually hoping this guy would show me wrong and demonstrate otherwise, but to my disappointment this was just another mostly pointless blog post.
Google request external JavaScript file? by JAB+Creations · 2007-03-11 17:22 · Score: 4, Insightful

Check your access log to see if Google actually requested the external JavaScript file. If it didn't there would be no reason to assume Google is interested in non-(X)HTML based content.

--
- John
http://www.jabcreations.com/
Re:Nonsense words? by Anonymous Coward · 2007-03-11 17:24 · Score: 1, Insightful

zonkdogfology ibbytopknot pignoklot zimpogrit fimptopo biggytink

Seriously, he shouldn't have posted these words until he was done with the test.
Doesn't work; Good (kind of) by The+Amazing+Fish+Boy · 2007-03-11 17:29 · Score: 5, Insightful
FTFA:
Why was I interested? Well, with all the "Web 2.0 technologies that rely on JavaScript (in the form of AJAX) to populate a page with content, it's important to know how it's treated to determine if the content is searchable.
Good. I am glad it doesn't work. Google's crawler should never support Javascript.

The model for websites is supposed to work something like this:
- (X)HTML holds the content
- CSS styles that content
- Javascript enhances that content (e.g. provides auto-fill for a textbox)
In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.

Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.

The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?
1. Re:Doesn't work; Good (kind of) by Rakishi · 2007-03-11 18:00 · Score: 2, Insightful
  
  Huh? He's talking about browser generated content, most dynamic content is server side generated (like slashdot but I think slashdot may have flat files as cache for speed reasons). No one said that nice xml file can't be generated by the server when the page is called.
2. Re:Doesn't work; Good (kind of) by cgenman · 2007-03-11 18:12 · Score: 2, Insightful
  
  In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.
  
  Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer. To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.
  
  Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry. With so much happening dynamically based upon end-user created pages, along with the somewhat annoying usage of Flash, Powerpoint, or PDF for important information, you really can't create a comprehensive index without being a little flexible.
  
  Saying that Google shouldn't take into account scripting when scanning pages is like saying they shouldn't index the PDF's that are online. Sure, it may not conform to what you believe is "good web coding standards," but the reality is that they're out there.
  
  --
  The ______ Agenda
3. Re:Doesn't work; Good (kind of) by WNight · 2007-03-11 18:59 · Score: 2, Insightful
  
  I don't know about you, but I write my webpages so that when the style goes away, the page still views in a basic 1996 kind of style. Put the content first and your index bars and ads last then use CSS to position them first, visibly. This way if a blind user or someone without style sheets sees the site it at least reads in order.
4. Re:Doesn't work; Good (kind of) by The+Amazing+Fish+Boy · 2007-03-11 19:11 · Score: 2, Insightful
  
  Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer.
  How's this? Disable CSS on Slashdot. First you get the top menu, then some options to skip to the menu, the content, etc. Then you get the menu, then the content. It's very easy to use it that way.
  To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.
  Well, for one thing, we are talking about a search engine here, which isn't a human being. So, there's one client that can "use" the information better in XHTML format. Then there's the visually impaired (who use screen readers as their clients), and those using a non-graphical client. Additionally, I would imagine it would be easier to screen scrape XHTML to get just the part you want (since a lot of content would be assigned an ID and/or a class.)
  Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry.
  Your first sentence is true, the second isn't. Web pages are dynamic, yes. I outlined how dynamic pages should be designed. That is, they should be made to work as static (X)HTML, then dynamically updated with Javascript. I don't see how your second sentence follows from the first at all. Web pages are dynamic... so we shouldn't follow standards? We shouldn't accommodate search engine crawlers, the blind, those using older browsers, or those who have Javascript support disabled?
  
  Notice I keep putting the X in (X)HTML in brackets. That's because I'm not convinced strict XHTML is the only viable method (though I'm not convinced it's not -- I'm on the fence).
5. Re:Doesn't work; Good (kind of) by Animats · 2007-03-11 20:03 · Score: 2, Insightful
  The model for websites is supposed to work something like this:
  If only. Turn off JavaScript and try these sites:
  
  Ford Motor Company
  
  Jeep
  
  Credit Suisse
google.com/?q=slashdotting+in+google+dollars by kale77in · 2007-03-11 18:30 · Score: 5, Insightful
I think the actual experiment here is:
- Create a 6-odd-paragraph page saying what everybody already knows.
- Slashdot it, by suggesting something newsworthy is there.
- Pack the page with Google ads.
- Profit.
I look forward to the follow-up piece which details the financial results.
1. Re:google.com/?q=slashdotting+in+google+dollars by Scarblac · 2007-03-11 19:46 · Score: 4, Insightful
  
  Exactly, this is the typical sort of fluff that Digg seems to love. As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.
  
  --
  I believe posters are recognized by their sig. So I made one.
2. Re:google.com/?q=slashdotting+in+google+dollars by dr.badass · 2007-03-12 02:12 · Score: 2, Insightful
  
  As far as I know, Slashdot had avoided this particular type of adword blog post crap until now
  
  It used to be that the web as a whole avoided this crap. Now, it's so easy to make stupid amounts of money from stupid content that a huge percentage of what gets submitted only even exists for the money -- it's like socially-acceptable spam. Digg is by far the worst confluence of this kind of crap, but the problem is web-wide, and damn near impossible to avoid.
  
  --
  Don't become a regular here -- you will become retarded.
Re:How does document.write mess up your DOM tree? by jesser · 2007-03-11 18:40 · Score: 4, Insightful

Perhaps more importantly, document.write can't be used to modify a page that has already loaded, limiting its usefulness for AJAX-style features.

--
The shareholder is always right.
Re:If they weren't, then they're trying by gregmac · 2007-03-11 19:16 · Score: 4, Insightful

You have to also remember though, that often the content generated dynmically is going to be of no use to a search engine, it will often be user-specific - there's obviously some reason it's being generated that way.

And if pages are designed using AJAX and dynamic rendering just for the sake of using AJAX and dynamic rendering.. well, they deserve what they get :)

--
Speak before you think
I would make normal links, then use JS on top by The+Amazing+Fish+Boy · 2007-03-11 19:33 · Score: 3, Insightful

So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page?
If I understand you, you something like this: The site has two parts, a menu and content. When you click a menu item, rather than being taken to a new URL, it executes Javascript which fetches only the new content from the web server, then replaces the content section. So the URL doesn't change.

It's a nice improvement. Less bandwidth used, and a quicker interface.

Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this:
// menuLink is the DOM element for each menu link. // (i.e. get it from document.getElementById(), etc.) menuLink.onclick = function() { getNewContent(); return false; }
(FYI, this is how I do pop-up windows, too.)

Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.

I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.

(Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)
Google holds back the web! by mumblestheclown · 2007-03-11 22:10 · Score: 1, Insightful

this is a pretty straightforward example of how google holds back the web. this is not google's fault, per se, but it definitely is true. We routinely resort to older, inefficient technologies for our websites simply to please google. it works well for us from an advertising standpoint, but is often incredibly stupid technologically.
Luckily blind people don't drive! by Anonymous Coward · 2007-03-11 23:15 · Score: 1, Insightful
1. Javascript redirects are a trait of the incompetent, I bet Ford payed some cowboy a whole lot of money for a site that doesn't work.
2. On the jeep site I can get to a few 'pages' that are actually just images with an image map and empty alt attributes for the html links. The HTML URLs are clean but not informative and the others don't work (unsupported URL scheme in lynx).
3. The credit Suisse site is reachable via a mislabeled link, "If you are a PALM, PSION, WINDOWS CE or NOKIA user click here". They even offer a sitemap for navigation. Tell-tale signs indicate this site was valid, accessible XHTML before some monkey was set loose on it.
Those selling professional web services should be liable under ADA and similar laws, that's how we fix the web.
Re:How does document.write mess up your DOM tree? by ultranova · 2007-03-12 03:49 · Score: 2, Insightful

Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;)

Based on all the segfaults, blue screens of death, X-Window crashes, Firefox crashes, code insertion bugs et cetera I've seen, I'd say that no, in general programmers don't know what they're doing, and certainly shouldn't be trusted to not fuck it up. The less raw access to any resource - be it memory or document stream - they are given, the better.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.