Googlebot and Document.Write
With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.
zonkdogfology is a real word:Serious question now - is the author of the article worried that the ensuing slashdot discussion will mention all his other nonsense words? I've no doubt slashdotters will find & mention the other words here, polluting google's index....
There are shills on slashdot. Apparently, I'm one of them.
Save a click: No, Google does not "see" text inserted by Javascript.
The Google Pigeon is smart enough to read through Document.write. Duh!
Google needs to consider script if they want high-quality results. Besides the obvious fact that they'll miss content supplied by dynamic page elements, they could also sacrifice page quality. Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser. It's interesting to know the extent to which they correct for this.
Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead. Who knew web search would be restricted by the halting problem? I wonder how far Google goes...
It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript. I was actually hoping this guy would show me wrong and demonstrate otherwise, but to my disappointment this was just another mostly pointless blog post.
Check your access log to see if Google actually requested the external JavaScript file. If it didn't there would be no reason to assume Google is interested in non-(X)HTML based content.
- John
http://www.jabcreations.com/
The model for websites is supposed to work something like this:
In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.
So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.
Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.
The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?
I don't believe you.
I doubt Google will notice DOM-created elements, either. But the author should re-test with that. And I would suggest that he post the result only if it turns out Google can see that, because we all assume it can't.
The bottom line is your web sites should probably degrade nice enough when JavaScript is not enabled. It might not flow as nice, the user may have to submit more forms, but the core functionality should still work and the core content should still be available.
DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)
Document.write() is executed as the page loads. Most AJAX-style implementation rely on either the innerHTML-property or creating nodes through the DOM. Testing these would tell us much more than testing Document.write().
.: Max Romantschuk
So, some friends and I have been bantering back and forth about how Google treats content that has been inserted into a page using Javascript. So I decided to do an experiment. This page has six nonsense words. Two are hardcoded into the page via straight HTML. Two are inserted via Javascript, but the script is part of the page HTML. The last two are inserted via Javascript, but the script is on a remote server. The purpose of the test is to see three things... * The time lapse between when the words appear in a Google alert and when they're searchable on the main Google site. * Which words return search results. * If the words from the remotely sourced script return search results, do they point to this page, the .js file on the remote server, or both?
Here are a couple of nonsense words that turn up no hits in Google. They are hardcoded into the HTML. zonkdogfology and ibbytopknot
I'll repeat them for emphasis... zonkdogfology and ibbytopknot
Here are two words inserted into the page via a javascript hardcoded into the page...
test words are pignoklot and zimpogrit - these have been inserted via javascript
repetition: pignoklot and zimpogrit - these have been inserted via javascript
And now a couple of nonsense words inserted with a remotely-sourced javascript...
test words are fimptopo and biggytink - these have been inserted via javascript
repetition: fimptopo and biggytink - these have been inserted via javascript
And that constitutes the test. I should know within a few weeks how well it worked.
The CB App. What's your 20?
I think the actual experiment here is:
I look forward to the follow-up piece which details the financial results.
...an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.That's a fantastic idea!
I predict that from now on, zonkdogfology will be a common tag for all articles that relate to google search...
This message brought to you by Jack Schitt's Previously Shat Shit
It's a nice improvement. Less bandwidth used, and a quicker interface.
Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this: (FYI, this is how I do pop-up windows, too.)
Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.
I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.
(Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)
... and also by SPAM spiders sneaking around for Email addresses!
I didn't want to change my contact information with additional FORM submit with visual challenge, but still wanted to leave a direct Email link obviously placed on page for quick contact/feedback.
Since I modified the mailto: with some tricky/sliced javascript Document.Write() I don't have a single SPAM coming from a semi-hidden address, which is still looking -- for the regular human visitors -- like the classic "Contact us" Email link.
I certainly hope this won't change in the future!
Rgds,
Julien
I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth) and extract data, and I've been considering how to deal with JavaScript effectively.
Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.
It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.
Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.
OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.
Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.
Still, with a different approach my AJAX generated site:u tchpipe.org/+dutchpipe&hl=en&ct=clnk&cd=1
http://dutchpipe.org/
is indexed perfectly:
http://66.102.9.104/search?q=cache:kvnpKdmDxwUJ:d
You can tell there's nothing interesting in the link from the fact that not even a summary of the results is given in the story. It looks like the average pay-to-get-diggs story, except you don't have to pay anything to be on Slashdot. Well done, and enjoy your Google Ads revenues!
God, root, what is difference ?
If you want to see through a search engine's eyes, open the page in Lynx. The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.
(method does not account for image crawlers)
US Democracy:The best person for the job (among These pre-selected choices...)
The moderator meant to mod this +1 Funny (I would!) but forgot to actually try to understand the post.
Or perhaps this is a mod of a new experimental viral moderation system, but the viruses haven't evolved enough yet?
Web apps these days consist nearly entirely of dynamic content invisible to googlebot. If you try to make your page visible on the web, this is really a problem. But think twice before adding invisible div's or alike in order to achieve proper seach results: Google might as well ban you (since they don't check whether or not the keywords you name in your invisible divs do in fact relate to the page's purpose or contents).
-- The Online Photo Editor - http://www.phixr.com
this is a pretty straightforward example of how google holds back the web. this is not google's fault, per se, but it definitely is true. We routinely resort to older, inefficient technologies for our websites simply to please google. it works well for us from an advertising standpoint, but is often incredibly stupid technologically.
funny as hell, but cruel!
Those selling professional web services should be liable under ADA and similar laws, that's how we fix the web.
This is an example of content that shouldn't be indexed by a search engine. I suppose you'd also like Google to auto compile, link and offer for download other forms of program source code? Don't blame Google because a small percentage of web developers are incompetent.
Friggin' vandals. Grow up, slashdot.
Can anyone here recommend a good place to download a current port of Lynx for Windows/XP? I'd like to be able to get formatted text out of a web page. I'm thinking along the lines of:
(I'd prefer a version that DOES NOT require Cygwin, as I use the GNU file/text/etc. utilities.) With all the web exploits that are out there, I'm relucant to download an out-of-date, vulnerable, and/or or poorly-ported port.
I'd appreciate knowing what YOU have found that works well, is up-to-date, and actively developed.
AJAX is for writing applications not Documents. Why and how should an application be indexed?
Hey I didn't think that after read-skipping the first paragraphs the article would actually state the obvious, that javascript generated content is not indexed... I would have expected an article to appear in case he found out it WAS INDEXED...
In other news: Most plants are green!
Copyright infringement is "piracy" in the same way DRM is "consumer rape"
Google doesn't hide its identity when it crawls. Just check for user agent Googlebot and serve different pages. This won't help with Yahoo, ask or Windows Live, but neither will targeting specific capabilities.
This was the first thing I noticed about Google when the first big buzz about this search engine came: No they do also NOT index any JavaScript!
The W3C's technical architecture group (TAG) has published a document exploring the tradeoffs in creating Web content using imperative languages, such as JavaScript, vs. declarative languages such as HTML. See The Rule of Least Power, edited by Tim Berners-Lee and Noah Mendelsohn (yours truly).
From that document:
"There is an important tradeoff between the computational power of a language and the ability to determine what a program in that language is doingWhile it's not impossible that Google or other search engines would use some heuristic to extract information from the JavaScript source, actually running the program would involve many complexities, some of which have been mentioned by other commentors. Not the least of these relates to the famous halting problem. As paraphrased for the purposes of the TAG finding:
"The tradeoff for such power is that you typically cannot determine what a program in a Turing-complete language [I.e. such as JavaScript] will do without actually running it. Indeed, you often cannot tell in advance whether such a program will even reach the point of producing useful output. Of course, you can easily tell what a simple program such as print "2+2" will do, but given an arbitrary program you'd likely have to run it, and possibly for a very long time. Conversely, if you capture information in a simple declarative form, anyone can write a program to analyze it in many ways."Comment removed based on user account deletion
If the background is so cluttered as to make the OCR difficult, then chances are the human will have trouble reading it too.
Web site images with logos against faint but busy backgrounds are moderately common. I'm talking about stuff like this. Commercial OCR programs interpret that as "a picture". Because we're working to automatically extract business identities from uncooperative websites, we sometimes need heavier technology than the search engines.
Given that search-driven traffic is so important to sites indexers such as Google have a huge amount of power. If the search engines choose to not index Javascript-generated content (or other dynamic content), will content creators avoid putting real content within such elements?
Dinomite.net
Take a look at the Web Accessibility roadmap from the W3C, and in particular the section on intent-based markup.
Your comment on Digg would of been:
"blogspam, no digg"
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
There is at least one site I know of which is entirely AJAX based for its contents: http://www.spotplex.com/ . Presumably as a bandwidth limiting measure they populate all their content using AJAX calls. When you load the page in Lynx it's basically empty.
"Zonkdogfology" is a GooglePig. That is zonkdogfology is a Google probe, designed to plumb the depths of the plumbing and report back its results. Just like the pigs used in pipelines. You know, the real world pipes.