Slashdot Mirror


Googlebot and Document.Write

With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.

34 of 180 comments (clear)

  1. Nonsense words? by Whiney+Mac+Fanboy · · Score: 5, Funny
    An alert came in in the late evening of March 10th for "zonkdogfology", one of the words in the first pair

    zonkdogfology is a real word:

    zonk-dog-fol-o-gy zohnk-dog--ful-uh-jee
    noun, plural -gies.

    1. the name given to articles from zonk where the summary makes no sense whatsoever.
    Serious question now - is the author of the article worried that the ensuing slashdot discussion will mention all his other nonsense words? I've no doubt slashdotters will find & mention the other words here, polluting google's index....
    --
    There are shills on slashdot. Apparently, I'm one of them.
    1. Re:Nonsense words? by Anonymous Coward · · Score: 4, Funny

      zonkdogfology is a real word:

      It's a perfectly cromulent word, and it's use embiggens all of us.

  2. The Results: by XanC · · Score: 5, Informative

    Save a click: No, Google does not "see" text inserted by Javascript.

    1. Re:The Results: by temojen · · Score: 4, Informative

      And rightly so. You should be hiding & un-hiding or inserting elements using the DOM, never using document.write (which F's up your DOM tree).

  3. Google Pigeon technolog by sdugoten2 · · Score: 3, Funny

    The Google Pigeon is smart enough to read through Document.write. Duh!

  4. If they weren't, then they're trying by AnonymousCactus · · Score: 4, Interesting

    Google needs to consider script if they want high-quality results. Besides the obvious fact that they'll miss content supplied by dynamic page elements, they could also sacrifice page quality. Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser. It's interesting to know the extent to which they correct for this.

    Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead. Who knew web search would be restricted by the halting problem? I wonder how far Google goes...

    1. Re:If they weren't, then they're trying by gregmac · · Score: 4, Insightful

      You have to also remember though, that often the content generated dynmically is going to be of no use to a search engine, it will often be user-specific - there's obviously some reason it's being generated that way.

      And if pages are designed using AJAX and dynamic rendering just for the sake of using AJAX and dynamic rendering.. well, they deserve what they get :)

      --
      Speak before you think
  5. How did this make the front page? by Anonymous Coward · · Score: 2, Insightful

    It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript. I was actually hoping this guy would show me wrong and demonstrate otherwise, but to my disappointment this was just another mostly pointless blog post.

    1. Re:How did this make the front page? by VGPowerlord · · Score: 2, Informative

      Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.

      Did you know that 99% of all statistics are made up?

      I can source some Javascript statistics: W3Schools reports that, as of January 2007, 94% of their audience has Javascript turned on, a significantly lower statistic than you are reporting. Not only that, but it is actually the highest percentage since they started recording them binannually in late 2002.

      It's a moot point, though: As W3Schools stats page states "You cannot - as a web developer - rely only on statistics. Statistics can often be misleading." Meaning that you should always code things so that they work with HTML/CSS, then use javascript to make it look/act nicer.
      --
      GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
    2. Re:How did this make the front page? by Bitsy+Boffin · · Score: 2, Informative

      From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle.

      --
      NZ Electronics Enthusiasts: Check out my Trade Me Listings
    3. Re:How did this make the front page? by vidarh · · Score: 2, Interesting
      Because doing so without massive limitations would involve the halting problem. A search engine simply CAN'T determine whether a certain piece of javascript will terminate in the general case. In lots of special cases, yes (such as when there's no control constructs, or the control constructs can't possibly cause loops or recursion etc.) and they could use timeouts etc. or only execute the first "n" steps of an interpreter, yes. But all of it would mean essentially crippling the feature.

      And for what? So that some lazy web developer won't have to put the content they want indexed in a div and make it invisible and have their JS pick it up from there instead if they want to do more stuff with it?

      It would also pick up a lot of stuff that people have put in javascript because they don't want the search engines to index it.

  6. Google request external JavaScript file? by JAB+Creations · · Score: 4, Insightful

    Check your access log to see if Google actually requested the external JavaScript file. If it didn't there would be no reason to assume Google is interested in non-(X)HTML based content.

    1. Re:Google request external JavaScript file? by The+Amazing+Fish+Boy · · Score: 2, Informative

      I have actually seen some reports of a "new" Googlebot requesting the CSS and Javascript. The rumour I heard was that it was using the Gecko rendering engine or something along those lines. This was some time ago. I'm not sure what ever became of this.

  7. Doesn't work; Good (kind of) by The+Amazing+Fish+Boy · · Score: 5, Insightful
    FTFA:

    Why was I interested? Well, with all the "Web 2.0 technologies that rely on JavaScript (in the form of AJAX) to populate a page with content, it's important to know how it's treated to determine if the content is searchable.
    Good. I am glad it doesn't work. Google's crawler should never support Javascript.

    The model for websites is supposed to work something like this:
    • (X)HTML holds the content
    • CSS styles that content
    • Javascript enhances that content (e.g. provides auto-fill for a textbox)

    In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

    So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.

    Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.

    The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?
    1. Re:Doesn't work; Good (kind of) by doormat · · Score: 2, Informative

      I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.

      Regardless, I'm pretty sure you'd get banned from the search engines for using such tactics.

      --
      The Doormat

      If you're not outraged, then you're not paying attention.
    2. Re:Doesn't work; Good (kind of) by Rakishi · · Score: 2, Insightful

      Huh? He's talking about browser generated content, most dynamic content is server side generated (like slashdot but I think slashdot may have flat files as cache for speed reasons). No one said that nice xml file can't be generated by the server when the page is called.

    3. Re:Doesn't work; Good (kind of) by cgenman · · Score: 2, Insightful

      In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

      Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer. To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.

      Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry. With so much happening dynamically based upon end-user created pages, along with the somewhat annoying usage of Flash, Powerpoint, or PDF for important information, you really can't create a comprehensive index without being a little flexible.

      Saying that Google shouldn't take into account scripting when scanning pages is like saying they shouldn't index the PDF's that are online. Sure, it may not conform to what you believe is "good web coding standards," but the reality is that they're out there.

    4. Re:Doesn't work; Good (kind of) by WNight · · Score: 2, Insightful

      I don't know about you, but I write my webpages so that when the style goes away, the page still views in a basic 1996 kind of style. Put the content first and your index bars and ads last then use CSS to position them first, visibly. This way if a blind user or someone without style sheets sees the site it at least reads in order.

    5. Re:Doesn't work; Good (kind of) by The+Amazing+Fish+Boy · · Score: 2, Insightful

      Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer.
      How's this? Disable CSS on Slashdot. First you get the top menu, then some options to skip to the menu, the content, etc. Then you get the menu, then the content. It's very easy to use it that way.

      To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.
      Well, for one thing, we are talking about a search engine here, which isn't a human being. So, there's one client that can "use" the information better in XHTML format. Then there's the visually impaired (who use screen readers as their clients), and those using a non-graphical client. Additionally, I would imagine it would be easier to screen scrape XHTML to get just the part you want (since a lot of content would be assigned an ID and/or a class.)

      Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry.
      Your first sentence is true, the second isn't. Web pages are dynamic, yes. I outlined how dynamic pages should be designed. That is, they should be made to work as static (X)HTML, then dynamically updated with Javascript. I don't see how your second sentence follows from the first at all. Web pages are dynamic... so we shouldn't follow standards? We shouldn't accommodate search engine crawlers, the blind, those using older browsers, or those who have Javascript support disabled?

      Notice I keep putting the X in (X)HTML in brackets. That's because I'm not convinced strict XHTML is the only viable method (though I'm not convinced it's not -- I'm on the fence).
    6. Re:Doesn't work; Good (kind of) by Animats · · Score: 2, Insightful

      The model for websites is supposed to work something like this:

      If only. Turn off JavaScript and try these sites:

    7. Re:Doesn't work; Good (kind of) by VGPowerlord · · Score: 3, Informative

      In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." – Webmaster Guidelines, Technical Guidelines section, bullet point 1.

      --
      GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
  8. Accessibility? by BladeMelbourne · · Score: 2, Informative

    The bottom line is your web sites should probably degrade nice enough when JavaScript is not enabled. It might not flow as nice, the user may have to submit more forms, but the core functionality should still work and the core content should still be available.

    DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)

  9. Re:How does document.write mess up your DOM tree? by XanC · · Score: 4, Informative

    If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.

    In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.

  10. google.com/?q=slashdotting+in+google+dollars by kale77in · · Score: 5, Insightful

    I think the actual experiment here is:

    • Create a 6-odd-paragraph page saying what everybody already knows.
    • Slashdot it, by suggesting something newsworthy is there.
    • Pack the page with Google ads.
    • Profit.

    I look forward to the follow-up piece which details the financial results.

    1. Re:google.com/?q=slashdotting+in+google+dollars by Scarblac · · Score: 4, Insightful

      Exactly, this is the typical sort of fluff that Digg seems to love. As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.

      --
      I believe posters are recognized by their sig. So I made one.
    2. Re:google.com/?q=slashdotting+in+google+dollars by dr.badass · · Score: 2, Insightful

      As far as I know, Slashdot had avoided this particular type of adword blog post crap until now

      It used to be that the web as a whole avoided this crap. Now, it's so easy to make stupid amounts of money from stupid content that a huge percentage of what gets submitted only even exists for the money -- it's like socially-acceptable spam. Digg is by far the worst confluence of this kind of crap, but the problem is web-wide, and damn near impossible to avoid.

      --
      Don't become a regular here -- you will become retarded.
  11. Re:How does document.write mess up your DOM tree? by jesser · · Score: 4, Insightful

    Perhaps more importantly, document.write can't be used to modify a page that has already loaded, limiting its usefulness for AJAX-style features.

    --
    The shareholder is always right.
  12. I would make normal links, then use JS on top by The+Amazing+Fish+Boy · · Score: 3, Insightful

    So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page?
    If I understand you, you something like this: The site has two parts, a menu and content. When you click a menu item, rather than being taken to a new URL, it executes Javascript which fetches only the new content from the web server, then replaces the content section. So the URL doesn't change.

    It's a nice improvement. Less bandwidth used, and a quicker interface.

    Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this:

    // menuLink is the DOM element for each menu link.
    // (i.e. get it from document.getElementById(), etc.)
    menuLink.onclick = function() { getNewContent(); return false; }
    (FYI, this is how I do pop-up windows, too.)

    Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.

    I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.

    (Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)
  13. Google doesn't, but it's possible by Animats · · Score: 2, Informative

    I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth) and extract data, and I've been considering how to deal with JavaScript effectively.

    Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.

    It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.

    Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.

    OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.

    Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.

    Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

  14. If you want to see by BrynM · · Score: 3, Funny

    If you want to see through a search engine's eyes, open the page in Lynx. The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.

    (method does not account for image crawlers)

    --
    US Democracy:The best person for the job (among These pre-selected choices...)
  15. AJAX is for writing applications not Documents by e-Trolley · · Score: 2, Interesting

    AJAX is for writing applications not Documents. Why and how should an application be indexed?

  16. Re:How does document.write mess up your DOM tree? by hackstraw · · Score: 3, Funny


    One of the most clever uses of document.write I've seen was something like: document.write("<--") YOU NEED JAVSCRIPT FOR THIS PAGE document.write("--&gt")

  17. Re:How does document.write mess up your DOM tree? by ultranova · · Score: 2, Insightful

    Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;)

    Based on all the segfaults, blue screens of death, X-Window crashes, Firefox crashes, code insertion bugs et cetera I've seen, I'd say that no, in general programmers don't know what they're doing, and certainly shouldn't be trusted to not fuck it up. The less raw access to any resource - be it memory or document stream - they are given, the better.

    --

    Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

  18. Re:How does document.write mess up your DOM tree? by Sancho · · Score: 2, Interesting

    How should 'major', though? When most Firefox-borked sites were coded, Firefox probably had less than 5% (around what Safari had, last I heard). Is 5% enough to overlook? What about 3%? 1%?

    If you code to the standard, at least you can blame browsers for their broken implementation.