Googlebot and Document.Write

Nonsense words? by Whiney+Mac+Fanboy · 2007-03-11 17:09 · Score: 5, Funny

An alert came in in the late evening of March 10th for "zonkdogfology", one of the words in the first pair

zonkdogfology is a real word:

zonk-dog-fol-o-gy zohnk-dog--ful-uh-jee
noun, plural -gies.

1. the name given to articles from zonk where the summary makes no sense whatsoever.

Serious question now - is the author of the article worried that the ensuing slashdot discussion will mention all his other nonsense words? I've no doubt slashdotters will find & mention the other words here, polluting google's index....

--
There are shills on slashdot. Apparently, I'm one of them.

Re:Nonsense words? by Anonymous Coward · 2007-03-11 17:24 · Score: 1, Insightful

zonkdogfology ibbytopknot pignoklot zimpogrit fimptopo biggytink

Seriously, he shouldn't have posted these words until he was done with the test.
Re:Nonsense words? by Anonymous Coward · 2007-03-11 17:26 · Score: 4, Funny

zonkdogfology is a real word:

It's a perfectly cromulent word, and it's use embiggens all of us.
Re:Nonsense words? by Dogtanian · 2007-03-11 23:52 · Score: 1

Seriously, he shouldn't have posted these words until he was done with the test. Absolutely, and a major problem I have with taking it seriously now is that Google uses words in links *to* a particular site. This means that there is now a very high risk of false positives if *anyone* has done so with the words that are only dynamically-written on the original page.

If he'd been serious that "Over the next two weeks, I'll be watching to see two things", he should have kept his mouth shut for those two weeks.

--
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
Re:Nonsense words? by LordEd · 2007-03-12 02:38 · Score: 1

so is this article 'zonkedbydesign'?
Re:Nonsense words? by tdmf · 2007-03-12 22:44 · Score: 1

This test is found on trillions of SEO-related platforms. It is old and all well-known if youre interested in optimizing a webpage for search-engines. A discussion on slashdot is just not nessessary. It seems more like the author wanted to generate some traffic for his adsense-page.

--
Aussenwerbung

The Results: by XanC · 2007-03-11 17:12 · Score: 5, Informative

Save a click: No, Google does not "see" text inserted by Javascript.

Re:The Results: by temojen · 2007-03-11 17:27 · Score: 4, Informative

And rightly so. You should be hiding & un-hiding or inserting elements using the DOM, never using document.write (which F's up your DOM tree).
Re:The Results: by Darius__ · 2007-03-12 01:39 · Score: 1

> You should be hiding & un-hiding or inserting elements using the DOM

Absolutely! One idea I embrace when coding dynamic/AJAX pages is that of extending functionality that's already there. This means that instead of relying on javascript to generate the 'base page', I have that pre-built in html. Then I use javascript to remove elements and replace them with dynamic sections. This ensures not only that people who have javascript disabled can use the site, but also that Google and other search engines will be able to index its content.

One other item to mention: you can still use meta tags to keyword your page and provide a limited sort of access to the data that would otherwise be dynamically generated in it.
Re:The Results: by Anonymous Coward · 2007-03-12 07:52 · Score: 0

Amazing... Google's spiders don't stop and run all your javascript on their server???!!!

Say it ain't so!

Google Pigeon technolog by sdugoten2 · 2007-03-11 17:14 · Score: 3, Funny

The Google Pigeon is smart enough to read through Document.write. Duh!

If they weren't, then they're trying by AnonymousCactus · 2007-03-11 17:18 · Score: 4, Interesting

Google needs to consider script if they want high-quality results. Besides the obvious fact that they'll miss content supplied by dynamic page elements, they could also sacrifice page quality. Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser. It's interesting to know the extent to which they correct for this.

Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead. Who knew web search would be restricted by the halting problem? I wonder how far Google goes...

Re:If they weren't, then they're trying by TubeSteak · 2007-03-11 18:27 · Score: 1

Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.

Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead.
Google tends to nuke those sites from orbit once it discovers they're gaming the system.

And by "nuke from orbit" i mean "delists them with no warning"

--
[Fuck Beta]
o0t!
Re:If they weren't, then they're trying by wumpus188 · 2007-03-11 18:28 · Score: 1

Yes, but supporting javascript won't fix the problem.
...that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.
They'll just go the other way around - show the static viagra content in browser and rewrite it for google bot.
Re:If they weren't, then they're trying by gregmac · 2007-03-11 19:16 · Score: 4, Insightful

You have to also remember though, that often the content generated dynmically is going to be of no use to a search engine, it will often be user-specific - there's obviously some reason it's being generated that way.

And if pages are designed using AJAX and dynamic rendering just for the sake of using AJAX and dynamic rendering.. well, they deserve what they get :)

--
Speak before you think
Re:If they weren't, then they're trying by hauntingthunder · 2007-03-11 21:10 · Score: 1

yeh right We have to do what google wants - some one joked that the ideal site for google is a university site circa 1997 dont forget the google bot is a dumb user agent it doesnt have javascript so javascript navigation, ajax and FLASH are the kiss of death for search engines.

--
You will never get to heaven with an Ak 47... But A Zu 30 is good for Low Flying Cherubim
Re:If they weren't, then they're trying by Anonymous Coward · 2007-03-11 22:19 · Score: 0

You mean you have to create standards compliant sites that don't rely heavily on wizz-bang scripts and totally 100% non-standard Flash? Boo hoo, my heart bleeds for you, you poor web "developer" you etc. etc. ad nuseum.
Re:If they weren't, then they're trying by jrumney · 2007-03-11 22:41 · Score: 1

Google should index the static content, but run/analyse the Javascript and throw out any pages where the user-visible content changes drastically. To be 100% effective though, they'd have to fake the IE or Firefox User-Agent, and use IP addresses from an ISP's dynamically assigned range for their crawling, which some people might see as evil.
Re:If they weren't, then they're trying by Arancaytar · 2007-03-11 23:01 · Score: 1

The point being?

If both Googlebot and users see the rewritten content, the original advertisement is never displayed.

Except for the few users that have Javascript disabled, but that's not the stupid-user target group that clicks on spam anyway...
Re:If they weren't, then they're trying by Arancaytar · 2007-03-11 23:16 · Score: 1

I'd rather suggest they don't look at script content at all.

Part of it is practicality, as already implied: With delays, self-writing code, horrible "quirks" that are not browser-independent, it's nearly impossible to predict what the script is going to do in the user's browser. Besides gobbling insane resources on the spidering server, increased by scripts that cause crashes.

Another part is philosophy and good practice - AJAX is for interactive applications, static HTML/XHTML for content. Applications shouldn't be indexed anyway, since the pages are user-specific and extremely dynamic. If you search the web, you're really looking for documents with content - and there's no reason why those shouldn't be entirely static.

Catering to the trend that anything, even simple text content, is only made accessible through barrier-heavy, browser-dependent AJAX applications is a step in the wrong direction. Google might as well execute flash movies, begin using OCR to read text in pictures or voice recognition to index mp3 files by song lyrics.
Re:If they weren't, then they're trying by CastrTroy · 2007-03-12 01:05 · Score: 1

Rightfully so for google news. Is there any way to configure google news only to show links to articles on certain sites? Or to blacklist certain sites? I really hate those "news" sites that put javascript on every seventh word, so that if you hover over the word, it shows a little pop-up div type ad. It's especially annoying because I like to highlight text as I read it, because I find it easier. I wish google would run all the JS in a page, and lower the ranking if it contained too many ads.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:If they weren't, then they're trying by Skreems · 2007-03-12 03:37 · Score: 1

They already can and do serve different content to GoogleBot than to normal web users. All it takes is checking the client string.

--
Slashdot needs a "-1, Wrong" moderation option.
The Urban Hippie
Re:If they weren't, then they're trying by Mixel · 2007-03-12 06:21 · Score: 1

Which is a PITA when the IEEE does it. Look! Free searchable online papers in pdf format!... NOT*!

*apologies, watched the Borat trailer too many times

How did this make the front page? by Anonymous Coward · 2007-03-11 17:20 · Score: 2, Insightful

It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript. I was actually hoping this guy would show me wrong and demonstrate otherwise, but to my disappointment this was just another mostly pointless blog post.

Re:How did this make the front page? by EvanED · 2007-03-11 17:26 · Score: 1

It should be pretty obvious that no search engine should interpret javascript...

Why's that?

Properly constructed there should be no security issue, and it would give more accurate results.
Re:How did this make the front page? by NoTheory · 2007-03-11 17:31 · Score: 1

Why should it? I mean, isn't the point of javascript content that responds dynamically to the intentions of an agent? The googlebot, although an extremely complicated AI agent, isn't intentional. It doesn't know what it's doing on a site, and so i figure probably shouldn't just be allowed out to wreak havoc. Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?

--
There are lives at stake here!
Re:How did this make the front page? by EvanED · 2007-03-11 17:38 · Score: 1

Why should it?

Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.

I mean, isn't the point of javascript content that responds dynamically to the intentions of an agent?

I probably wouldn't have the indexer run most events, but it seems that those in document.load or other places that are run when a page is loaded should be indexed.

Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well?

JavaScript doesn't have fork AFAIK. Besides, the broader question of resource consumption is trivially solvable by setting limits on the process doing the work.
Re:How did this make the front page? by zobier · 2007-03-11 18:10 · Score: 1

Also, wouldn't that allow one an opportunity to fork-bomb the googlebot then as well? JavaScript doesn't have fork AFAIK. The setTimeout function can do a similar thing.

--
Me lost me cookie at the disco.
Re:How did this make the front page? by Jake73 · 2007-03-11 18:11 · Score: 1

Yeah, I was kinda shocked, really. I always wondered how people with bad blogs were able to break into the mainstream and gather regular readers. I guess they just try like hell to get picked up on Slashdot/Digg/etc with some worthless blog post.
Re:How did this make the front page? by Hal_Porter · 2007-03-11 19:53 · Score: 1

Actually, the latest nightly builds of the Last Measure can burn even the electronic eyeballs of the google bot, without using Javascript.

It seems to find it, indeed you can find out what Last Measure is by Googling it, but I can see from the logs that it only checks once. Just like a human would. "Hmm, Last Measure what's that? Aiiiieeee!"

Very interesting.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;
Re:How did this make the front page? by VGPowerlord · 2007-03-11 19:54 · Score: 2, Informative

Because JavaScript can create content. Since 99% of people run with it enabled, they will see this content, so it makes sense to index it.

Did you know that 99% of all statistics are made up?

I can source some Javascript statistics: W3Schools reports that, as of January 2007, 94% of their audience has Javascript turned on, a significantly lower statistic than you are reporting. Not only that, but it is actually the highest percentage since they started recording them binannually in late 2002.

It's a moot point, though: As W3Schools stats page states "You cannot - as a web developer - rely only on statistics. Statistics can often be misleading." Meaning that you should always code things so that they work with HTML/CSS, then use javascript to make it look/act nicer.

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:How did this make the front page? by osu-neko · 2007-03-11 20:02 · Score: 1

It should be pretty obvious that no search engine should interpret javascript, let alone remotely sourced javascript.

Granted. It's just that some people like to actually have empirical evidence for something before they conclude it's true, rather than say "that's how it should work" and then pretend they know something that they really don't, based on the way they think the universe should be rather than on any actual evidence of the way it actually is.

--
"Convictions are more dangerous enemies of truth than lies."
Re:How did this make the front page? by Bitsy+Boffin · 2007-03-11 21:04 · Score: 2, Informative

From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle.

--
NZ Electronics Enthusiasts: Check out my Trade Me Listings
Re:How did this make the front page? by kv9 · 2007-03-11 22:38 · Score: 1

Yeah, I was kinda shocked, really. I always wondered how people with bad blogs were able to break into the mainstream and gather regular readers. I guess they just try like hell to get picked up on Slashdot/Digg/etc with some worthless blog post.
well that too, but in general it's even easier. just aim low and hope for the best. it's not very hard to appeal to the mainstream. shit, it's the largest audience out there.

--
Stop Computers/Cars Analogies on S
Re:How did this make the front page? by xoyoyo · 2007-03-11 22:42 · Score: 1

"You cannot - as a web developer - rely only on statistics. "

No, indeed, because doing things based on empirical evidence is foolish behaviour. On the other hand you should take a political position (that data, presentation and behaviour should be kept separate) and behave as though that was in some way more true than a statistical value.

I'm not saying that the separation of data, presentation and behaviour is wrong, just that you have to realise that it's a human engineered best practise, not a law of the universe. So saying that you cannot rely on statistics is wrong. Of course you can. Saying you *shouldn't* rely on statisics is entirely correct.
Re:How did this make the front page? by xoyoyo · 2007-03-11 22:44 · Score: 1

(Slashdot swallowed my sarcasm tag there - the first paragraph should be read in a mildly mocking voice - the second one is the meat of the matter)
Re:How did this make the front page? by vidarh · 2007-03-11 23:57 · Score: 2, Interesting

Because doing so without massive limitations would involve the halting problem. A search engine simply CAN'T determine whether a certain piece of javascript will terminate in the general case. In lots of special cases, yes (such as when there's no control constructs, or the control constructs can't possibly cause loops or recursion etc.) and they could use timeouts etc. or only execute the first "n" steps of an interpreter, yes. But all of it would mean essentially crippling the feature.
And for what? So that some lazy web developer won't have to put the content they want indexed in a div and make it invisible and have their JS pick it up from there instead if they want to do more stuff with it?
It would also pick up a lot of stuff that people have put in javascript because they don't want the search engines to index it.
Re:How did this make the front page? by EvanED · 2007-03-12 02:18 · Score: 1

Because doing so without massive limitations would involve the halting problem.

So is all of the work my research group is doing. In fact, so is a lot of the work groups are doing around the country.

A search engine simply CAN'T determine whether a certain piece of javascript will terminate in the general case.

So? That's a bit of a strawman, don't you think?

But all of it would mean essentially crippling the feature

Why do you think it would be crippled? First, most of the time it would do the author of a page no good to have a script that runs for more than a few seconds because then the PERSON visiting the page would probably not see it. Second, I have a strong suspicion that if you did a survey of pages that had JavaScript running when the page loads, any that were not finished running in half a second would not finish at all.

It would also pick up a lot of stuff that people have put in javascript because they don't want the search engines to index it.

And this is exactly why they SHOULD do it. The page should be indexed based upon how it looks to the visitor, which means with JavaScript's side effects and all.
Re:How did this make the front page? by EvanED · 2007-03-12 02:25 · Score: 1

Did you know that 99% of all statistics are made up?

Yes, I made up the 99%. I was using it as a synonym for "almost all". And 94% is still plenty close enough to "all" to fall into that category.

Meaning that you should always code things so that they work with HTML/CSS, then use javascript to make it look/act nicer.

I don't care what people should do, I care what people actually do. If that means intentionally fooling with search results by manipulating pages with Javascript, then the robot should run Javascript. (I don't know how common this is at all, but Google should at least look at it periodically to see if it's worth it.)
Re:How did this make the front page? by zobier · 2007-03-12 08:50 · Score: 1

From memory, setTimeout forms a time-delayed but synchronous entry into the execution stream, you will not get two threads in the same javascript code pile running simultaneously, the timeout will not fire until the execution stream is idle. Uh-uh, it has its own execution context. You can absolutely run timed out functions concurrently. Try this:
<script type="text/javascript">  </script>

--
Me lost me cookie at the disco.
Re:How did this make the front page? by heinousjay · 2007-03-12 13:11 · Score: 1

Without any output it doesn't prove concurrent execution. It could just be a garden variety infinite loop, albeit one with an indirect execution path.

--
Slashdot - where whining about luck is the new way to make the world you want.
Re:How did this make the front page? by zobier · 2007-03-12 16:18 · Score: 1

Without any output it doesn't prove concurrent execution. It could just be a garden variety infinite loop, albeit one with an indirect execution path. You want output, how's this for output?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transition al.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <me ta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title></title> </head> <body> <script type="text/javascript">  </script> </body> </html>

--
Me lost me cookie at the disco.

Google request external JavaScript file? by JAB+Creations · 2007-03-11 17:22 · Score: 4, Insightful

Check your access log to see if Google actually requested the external JavaScript file. If it didn't there would be no reason to assume Google is interested in non-(X)HTML based content.

--
- John
http://www.jabcreations.com/

Re:Google request external JavaScript file? by The+Amazing+Fish+Boy · 2007-03-11 22:49 · Score: 2, Informative

I have actually seen some reports of a "new" Googlebot requesting the CSS and Javascript. The rumour I heard was that it was using the Gecko rendering engine or something along those lines. This was some time ago. I'm not sure what ever became of this.
Re:Google request external JavaScript file? by JAB+Creations · 2007-03-15 15:12 · Score: 1

There are plenty of spammers spoofing as Google.

--
- John
http://www.jabcreations.com/

Doesn't work; Good (kind of) by The+Amazing+Fish+Boy · 2007-03-11 17:29 · Score: 5, Insightful

FTFA:

Why was I interested? Well, with all the "Web 2.0 technologies that rely on JavaScript (in the form of AJAX) to populate a page with content, it's important to know how it's treated to determine if the content is searchable.

Good. I am glad it doesn't work. Google's crawler should never support Javascript.

The model for websites is supposed to work something like this:

(X)HTML holds the content
CSS styles that content
Javascript enhances that content (e.g. provides auto-fill for a textbox)

In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.

Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.

The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?

Re:Doesn't work; Good (kind of) by milo317 · 2007-03-11 17:32 · Score: 1

Thought so, as G won't follow java links, as it's stated in their webmaster codex.
Re:Doesn't work; Good (kind of) by catbutt · 2007-03-11 17:37 · Score: 1

Who's talking about Java?
Re:Doesn't work; Good (kind of) by doormat · 2007-03-11 17:46 · Score: 2, Informative

I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too.

Regardless, I'm pretty sure you'd get banned from the search engines for using such tactics.

--
The Doormat

If you're not outraged, then you're not paying attention.
Re:Doesn't work; Good (kind of) by Tablizer · 2007-03-11 17:54 · Score: 1

You are basically saying that "dynamic content should go to hell". Dynamic content is the result of automation. Do you propose we stick with outmoded "flat" technologies like flat files? Databases and cross-server-content-grabbing be damned? I find this disturbing.

--
Table-ized A.I.
Re:Doesn't work; Good (kind of) by Anonymous Coward · 2007-03-11 17:54 · Score: 0

Well, we're talking more generally about dynamically created content, and as Java is used for such, I think the point is valid.
Re:Doesn't work; Good (kind of) by Rakishi · 2007-03-11 18:00 · Score: 2, Insightful

Huh? He's talking about browser generated content, most dynamic content is server side generated (like slashdot but I think slashdot may have flat files as cache for speed reasons). No one said that nice xml file can't be generated by the server when the page is called.
Re:Doesn't work; Good (kind of) by fbartho · 2007-03-11 18:10 · Score: 1

So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page? Now the only ones I have made or seen that are like that require login's to protect the data they are providing via an active interface; live examples including gmail and others, but I don't think that means that providing the searchable data only via javascript is neccessarily inappropriate.

--
Gravity Sucks
Re:Doesn't work; Good (kind of) by cgenman · 2007-03-11 18:12 · Score: 2, Insightful

In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.

Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer. To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.

Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry. With so much happening dynamically based upon end-user created pages, along with the somewhat annoying usage of Flash, Powerpoint, or PDF for important information, you really can't create a comprehensive index without being a little flexible.

Saying that Google shouldn't take into account scripting when scanning pages is like saying they shouldn't index the PDF's that are online. Sure, it may not conform to what you believe is "good web coding standards," but the reality is that they're out there.

--
The ______ Agenda
Re:Doesn't work; Good (kind of) by zobier · 2007-03-11 18:15 · Score: 1

I thought I remember a while ago about some search engine using intelligence to ignore hidden text (text with the same or a similar color as the background). Of course the easy work around for that is to use an image for your background and then that may fool the bot, but who knows, they could code to accomidate that too. You could use OCR to detect that (and to index images used for text content).

--
Me lost me cookie at the disco.
Re:Doesn't work; Good (kind of) by WNight · 2007-03-11 18:59 · Score: 2, Insightful

I don't know about you, but I write my webpages so that when the style goes away, the page still views in a basic 1996 kind of style. Put the content first and your index bars and ads last then use CSS to position them first, visibly. This way if a blind user or someone without style sheets sees the site it at least reads in order.
Re:Doesn't work; Good (kind of) by The+Amazing+Fish+Boy · 2007-03-11 19:11 · Score: 2, Insightful

Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web developer.
How's this? Disable CSS on Slashdot. First you get the top menu, then some options to skip to the menu, the content, etc. Then you get the menu, then the content. It's very easy to use it that way.
To them, it's just going to be one big, messy looking freak out... akin to a television show whose cable descrambler broke. Sure all the "information" is there, somewhere, but in such a horrible format that a human being can't use it.
Well, for one thing, we are talking about a search engine here, which isn't a human being. So, there's one client that can "use" the information better in XHTML format. Then there's the visually impaired (who use screen readers as their clients), and those using a non-graphical client. Additionally, I would imagine it would be easier to screen scrape XHTML to get just the part you want (since a lot of content would be assigned an ID and/or a class.)
Web pages are dynamic these days. Saying that the only acceptable model is staticly defined strict XHTML mixed with an additional layer of tableless CSS is foolish zelotry.
Your first sentence is true, the second isn't. Web pages are dynamic, yes. I outlined how dynamic pages should be designed. That is, they should be made to work as static (X)HTML, then dynamically updated with Javascript. I don't see how your second sentence follows from the first at all. Web pages are dynamic... so we shouldn't follow standards? We shouldn't accommodate search engine crawlers, the blind, those using older browsers, or those who have Javascript support disabled?

Notice I keep putting the X in (X)HTML in brackets. That's because I'm not convinced strict XHTML is the only viable method (though I'm not convinced it's not -- I'm on the fence).
Re:Doesn't work; Good (kind of) by Animats · 2007-03-11 20:03 · Score: 2, Insightful
The model for websites is supposed to work something like this:
If only. Turn off JavaScript and try these sites:
Re:Doesn't work; Good (kind of) by VGPowerlord · 2007-03-11 20:50 · Score: 3, Informative

In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." – Webmaster Guidelines, Technical Guidelines section, bullet point 1.

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:Doesn't work; Good (kind of) by caluml · 2007-03-11 21:03 · Score: 1

View, Page Style, No Style in Firefox will show you what your page looks like to browsers/spiders.

--
Get your own free personal location tracker
Re:Doesn't work; Good (kind of) by Stooshie · 2007-03-11 21:47 · Score: 1

You are basically saying that "dynamic content should go to hell"
Huh? He's talking about browser generated content, most dynamic content is server side generated

He is talking about AJAX sites that use JavaScript to dynamically load content from the server based on user actions (such as Google suggest)

Interesting that Google themselves use AJAX but don't index it.

--
America, Home of the Brave. ... .and the Squaw.
Re:Doesn't work; Good (kind of) by maxwell+demon · 2007-03-11 21:49 · Score: 1

And how do you bookmark a certain view of that page (which, to you as page user, is a separate page after all)?

--
The Tao of math: The numbers you can count are not the real numbers.
Re:Doesn't work; Good (kind of) by Rakishi · 2007-03-11 22:18 · Score: 1

Yes, I'm well aware of that but he's more specifically talking about certain uses of AJAX. The poster I replied to said dynamic content which is a whole lot more than either the original poster meant or even what AJAX encompasses. Databases and non-flat things are not AJAX and not in any way what the original poster meant. I was simply pointing out that the person I replied to can't read.

AJAX should degrade gracefully, if you don't have javascript things should still work which means that search spiders should have no trouble on AJAX websites.
Re:Doesn't work; Good (kind of) by Lord+Ender · 2007-03-12 02:56 · Score: 1

The old model is dying. Simple web pages are on the way out. Web applications are the future.

A search engine that indexes web applications is more useful to me than one that can not.

Google realizes that, and you don't.

--
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
Re:Doesn't work; Good (kind of) by foniksonik · 2007-03-12 03:07 · Score: 1

The model should really be

DOM holds the content (whether HTML/XHTML/XML or plain text; static/dynamic or mixed)
CSS styles that content
Javascript enhances that content (e.g. provides auto-fill for a textbox)

Google should be indexing the DOM and it's contents, not the code in the file. That's like indexing the english Dictionary and saying you've indexed the english language.

Websites are going to be more and more dynamic. Content is going to be added directly to the page from an amalgamation of sources with the basic structure defined by static code in the file output of the webserver itself.

Those making screenreaders need to learn this as well. It's called progress and legislation has never been able to stop it (though they keep trying).

--
A fool throws a stone into a well and a thousand sages can not remove it.
Re:Doesn't work; Good (kind of) by suv4x4 · 2007-03-12 03:34 · Score: 1

The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?

Google has a bot that understands CSS and JavaScript, based roughly on the Mozilla source code (wondered why they hire so many Firefox developers?). They won't use it for indexing your site's content, but they run it in parallel to their old "Lynx"-style bot, to detect black hat SEO and scams which abuse CSS/JS.
Re:Doesn't work; Good (kind of) by fbartho · 2007-03-12 04:33 · Score: 1

basically what you do is parse the anchor extension:

http://sub.site.dom/file.html#BookMarkableTag

you take that tag and have a function that decides how to react to it... and tahdah! bookmarked!

--
Gravity Sucks

How does document.write mess up your DOM tree? by catbutt · 2007-03-11 17:33 · Score: 1

I don't believe you.

Re:How does document.write mess up your DOM tree? by XanC · 2007-03-11 17:58 · Score: 4, Informative

If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.

In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.
Re:How does document.write mess up your DOM tree? by jesser · 2007-03-11 18:40 · Score: 4, Insightful

Perhaps more importantly, document.write can't be used to modify a page that has already loaded, limiting its usefulness for AJAX-style features.

--
The shareholder is always right.
Re:How does document.write mess up your DOM tree? by CastrTroy · 2007-03-12 00:57 · Score: 1

What if you need to insert large amounts of HTML into a page? What if you don't have all your HTML that you want to insert laid into a perfect, XML compliant document? I realize that in most cases document.createElement is the better of the 2 methods, but it isn't always possible to not use document.write. There are some instances where it is unavoidable.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:How does document.write mess up your DOM tree? by CarpetShark · 2007-03-12 00:59 · Score: 1

because there's no way to guarantee the document will continue to be valid.

Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;)

Not that it's wrong to have failsafes in place, and not that XHTML isn't fine without document.write, but this "validity guarantee" argument is a little worrying.
Re:How does document.write mess up your DOM tree? by hackstraw · 2007-03-12 01:14 · Score: 3, Funny

One of the most clever uses of document.write I've seen was something like: document.write("<--") YOU NEED JAVSCRIPT FOR THIS PAGE document.write("--&gt")
Re:How does document.write mess up your DOM tree? by jcuervo · 2007-03-12 01:14 · Score: 1

Hmm. I just do or and document.getElementById('whatever').innerHTML = "...";

Am I wrong?

--
Assume I was drunk when I posted this.
Re:How does document.write mess up your DOM tree? by thePowerOfGrayskull · 2007-03-12 01:47 · Score: 1

Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;) Not that it's wrong to have failsafes in place, and not that XHTML isn't fine without document.write, but this "validity guarantee" argument is a little worrying. That's right up there with saying it's a little worrisome that an invalid cast in a strictly typed language generates a compiler error. After all, can't we trust humans to know what they're doing?
Re:How does document.write mess up your DOM tree? by Anonymous Coward · 2007-03-12 01:53 · Score: 0

wrong.
Re:How does document.write mess up your DOM tree? by General+Wesc · 2007-03-12 02:13 · Score: 1

innerHTML is nonstandard, but supported in most modern browsers, so I often do it anyway. There's node.appendChild(document.createRange.setStartBefo re(node).createContextualFragment(HTML)) but last I tried that in IE, it didn't work. Maybe IE7 gets it. Maybe I should use support-sniffing. But innerHTML is just so simple and convenient, so I don't.
Re:How does document.write mess up your DOM tree? by Sancho · 2007-03-12 02:48 · Score: 1

innerHTML is nonstandard, but supported in most modern browsers, so I often do it anyway. Ah, the mentality that makes so many web pages crap out in Firefox or Opera...

Code to the standard and let your client decide what obscure browser to use.
Re:How does document.write mess up your DOM tree? by aymanh · 2007-03-12 02:55 · Score: 1

Whatever happened to the <noscript> element?

--
python>>> q="'";s='q="%c";s=%c%s%c;print s%%(q,q,s,q)';print s%(q,q,s,q)
Re:How does document.write mess up your DOM tree? by suv4x4 · 2007-03-12 03:26 · Score: 1

If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.

In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.

element.innerHTML works even on XHTML MIME documents however (Firefox, Opera etc), and there's no significant hurdle to support document.write either.

To support progressive rendering (which some yet don't but future releases are planned to), browsers *must* start processing a document before they know it's valid. So they hit this issue one way or another. If they find a document is not valid in the middle of displaying a page, they can always stop processing, flush all buffers and declare the XML not well formed. Simple as that.

And yes, that means they won't use a typical XML DOM parser, which validates and preparses the whole tree in advance.
Re:How does document.write mess up your DOM tree? by ultranova · 2007-03-12 03:49 · Score: 2, Insightful

Except that the programmer might know what they're doing. But I guess we're getting past the point of trusting people more than machines ;)

Based on all the segfaults, blue screens of death, X-Window crashes, Firefox crashes, code insertion bugs et cetera I've seen, I'd say that no, in general programmers don't know what they're doing, and certainly shouldn't be trusted to not fuck it up. The less raw access to any resource - be it memory or document stream - they are given, the better.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:How does document.write mess up your DOM tree? by XanC · 2007-03-12 04:15 · Score: 1

There are some instances where it is unavoidable.

Can you give an example?
Re:How does document.write mess up your DOM tree? by mha · 2007-03-12 06:10 · Score: 1

innerHTML is non standard - but the standard sucks. As Douglas Crockford says in one of his GREAT videos about Javascript and the DOM on Yahoo (http://developer.yahoo.com/yui/theater/), why does the W3C want you to write an HTML parser in Javascript (to create all those DOM nodes manually), when the webbrowser in which this code is running has a very good HTML parser optimized to do just that already?
Re:How does document.write mess up your DOM tree? by Raenex · 2007-03-12 07:02 · Score: 1

Code to the standard and let your client decide what obscure browser to use. Actually you should code to what works in all the major browsers. If that's the standard, great. Even better, as much as possible use a library that takes care of it for you.
Re:How does document.write mess up your DOM tree? by Sancho · 2007-03-12 11:55 · Score: 2, Interesting

How should 'major', though? When most Firefox-borked sites were coded, Firefox probably had less than 5% (around what Safari had, last I heard). Is 5% enough to overlook? What about 3%? 1%?

If you code to the standard, at least you can blame browsers for their broken implementation.
Re:How does document.write mess up your DOM tree? by General+Wesc · 2007-03-12 15:29 · Score: 1

Code to the standard

I do with important projects. But when coding to the standard is harder and means it only works in one browser while using non-standard tags is easier and works all over, I'll go to the easy-and-works method when I'm not doing something of great importance. If I am doing something of great importance, I'll code to the standard and add some needed non-standard hacks. Right now (and, yes, this will change), if a browser doesn't support innerHTML, it probably won't support much ECMAScript at all, so it'll be using the plain HTML page anyway.
Re:How does document.write mess up your DOM tree? by Raenex · 2007-03-12 16:39 · Score: 1

Back before there was Firefox there was Netscape/Mozilla. Firefox inherited from that. So if you were coding for the "major" browsers you would probably have been ok when Firefox came out.

There are also standards which neither browser support. Bottom line is you have to code to what works, and preferably the standards if the browsers support them. I recommend watching the lecture: An Inconvenient API: The Theory of the DOM (three parts, downloadable here).

I wish it was just as simple as "follow the standards", but if you look at the history and reality of the situation, you see why that isn't possible. The best advice is to code to a well maintained library that shields you as much as possible from browser incompatibilities.
Re:How does document.write mess up your DOM tree? by CarpetShark · 2007-03-13 01:20 · Score: 1

That's fine up to a point, but there should be a way around these limitations. In C, it's all too easy to screw things up with null pointers etc., but if we didn't have those low-level features, a lot of important software would be impossible to write.

I'm not saying that Javascript should ENCOURAGE low-level access to the document, but to flatly deny those things is to falsely limit a language. Languages, after all, are supposed to allow you to express ANYTHING.
Re:How does document.write mess up your DOM tree? by ultranova · 2007-03-13 02:53 · Score: 1

That's fine up to a point, but there should be a way around these limitations.

No. If there's a way around these limitations, then most programmers will simply turn them off because they are experts and know what they're doing. And then the user has to suffer the consequences of the expert's ego and laziness.

No, bounds checking needs to be mandatory, not voluntary; otherwise it goes unused and the problems continue.

In C, it's all too easy to screw things up with null pointers etc., but if we didn't have those low-level features, a lot of important software would be impossible to write.

Null pointers aren't really a problem, since trying to dereference one will fail cleanly and immediately with a segmentation fault, altought I personally like Java-style "exception propagates up through the stack" error handling better. It's the ability reference arbitrary memory locations with pointers, as well as use tables without automatic bounds checking, which causes C to be such a source of problems.

I'm not saying that Javascript should ENCOURAGE low-level access to the document, but to flatly deny those things is to falsely limit a language. Languages, after all, are supposed to allow you to express ANYTHING.

Any valid document state can be reached by inserting and removing tags with appripriate functions; the only thing that can't be done that way is creating invalid documents.

There is no reason why a programming language should be able to express a way to an invalid state. Being unable to fuck up the document is not a limitation by any sensible definition of the word.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:How does document.write mess up your DOM tree? by jesser · 2007-03-13 10:41 · Score: 1

That's horrible. How is the HTML parser supposed to know that the second [script] is not just part of the comment started by the first [script]?

--
The shareholder is always right.
Re:How does document.write mess up your DOM tree? by jcuervo · 2007-03-16 08:33 · Score: 1

I feel like I should mention that I started doing innerHTML because that's what I saw Google doing.

--
Assume I was drunk when I posted this.

True by XanC · 2007-03-11 17:34 · Score: 1

I doubt Google will notice DOM-created elements, either. But the author should re-test with that. And I would suggest that he post the result only if it turns out Google can see that, because we all assume it can't.

Accessibility? by BladeMelbourne · 2007-03-11 17:38 · Score: 2, Informative

The bottom line is your web sites should probably degrade nice enough when JavaScript is not enabled. It might not flow as nice, the user may have to submit more forms, but the core functionality should still work and the core content should still be available.

DDA / Section 508 / WCAG - the no JavaScript clause makes for a lot of extra work - but it is one that can't be avoided on the (commercial) web application I architect. (Friggin sharks with laser beems for eyes making lawsuits and all.)

Document.write() is not the way to go by Max+Romantschuk · 2007-03-11 17:50 · Score: 1

Document.write() is executed as the page loads. Most AJAX-style implementation rely on either the innerHTML-property or creating nodes through the DOM. Testing these would tell us much more than testing Document.write().

--
.: Max Romantschuk :: http://max.romantschuk.fi/

Re:Document.write() is not the way to go by palinurus · 2007-03-11 18:08 · Score: 1

well -- document.write() gets executed when it's called, not just during page load. you can call document.write() as a side-effect of an AJAX request and it will work (though i think you're right -- DOM manipulation is the idiom for dynamic web programming these days).

but really i don't think google should index either. there's a difference between a document and an application. the guy mentions the annoying buzzword of the year, 'Web 2.0', which (i think) is really about how web browsers now give you applications in addition to documents. it's useful to have an index of documents; not so useful to have an index of every state reachable by an application (see e.g. MS Word help).

it would be interesting if at some point crawlers could distinguish between the two.
Re:Document.write() is not the way to go by Anonymous Coward · 2007-03-11 23:23 · Score: 0

They already do by not indexing AJAX-generated content...

From TFA: by bennomatic · 2007-03-11 17:54 · Score: 1

So, some friends and I have been bantering back and forth about how Google treats content that has been inserted into a page using Javascript. So I decided to do an experiment. This page has six nonsense words. Two are hardcoded into the page via straight HTML. Two are inserted via Javascript, but the script is part of the page HTML. The last two are inserted via Javascript, but the script is on a remote server. The purpose of the test is to see three things... * The time lapse between when the words appear in a Google alert and when they're searchable on the main Google site. * Which words return search results. * If the words from the remotely sourced script return search results, do they point to this page, the .js file on the remote server, or both? Here are a couple of nonsense words that turn up no hits in Google. They are hardcoded into the HTML. zonkdogfology and ibbytopknot I'll repeat them for emphasis... zonkdogfology and ibbytopknot Here are two words inserted into the page via a javascript hardcoded into the page... test words are pignoklot and zimpogrit - these have been inserted via javascript repetition: pignoklot and zimpogrit - these have been inserted via javascript And now a couple of nonsense words inserted with a remotely-sourced javascript... test words are fimptopo and biggytink - these have been inserted via javascript repetition: fimptopo and biggytink - these have been inserted via javascript And that constitutes the test. I should know within a few weeks how well it worked.

--
The CB App. What's your 20?

This leads to a decidability problem by NittanyTuring · 2007-03-11 18:18 · Score: 1

It's not easy for Google to determine how to treat text inside of Document.write(). In some cases, that line of script will never be executed. In other cases, it may be executed multiple times. What is Google to do with something like this:

if (1 == 0) { document.write("pignoklot zimpogrit"); }

Obvously, "pignoklot zimpogrit" will never be emitted, but Google's crawler might not get that. In general, you run into decidability issues, like the Halting problem. The best approach would be for Google's crawler to fully emulate the session of a user, and execute all scripts like a browser would... but that may be difficult or impossible to automate on a large-scale. It is true that Google can do a better job than it does now. It can at least search for common cases where write() definitely gets called.

Re:This leads to a decidability problem by poopdeville · 2007-03-11 20:04 · Score: 1

Decidability is a non-issue in this context. Your example falls flat, because JavaScript is an interpreted language. All Google would have to do is run an interpreter and data mine the results.

--
After all, I am strangely colored.
Re:This leads to a decidability problem by maxwell+demon · 2007-03-11 21:27 · Score: 1
Except that
- content may depend on user actions in a non-trivial way (i.e. if the page contains things like onclick or onmouseover, the dependence on the sequence of events occuring may be quite complex),
- content may be requested by callbacks to the server (after all, that's what AJAX is all about), in which case I'm not sure it's a good idea for the search engine to execute it,
- running a certain script might be expensive in time and/or memory,
- by processing JavaScript, the search engine might open up itself to exploits.
--
The Tao of math: The numbers you can count are not the real numbers.
Re:This leads to a decidability problem by poopdeville · 2007-03-11 22:57 · Score: 1

All true. However, I'd say that your first and second points are essentially a non-issue. Google already scans the "dark web" if it's accessible to the public, even if only accessible in non-trivial ways. Which is to say, they're already requesting massive numbers of documents and generating enormous data structures to mine. Dealing with JavaScript would be more of the same in this respect.

Your other points are better. Malicious JavaScript could easily tie the GoogleBot up, if Google's hypothetical JavaScript interpreter didn't have built-in runtime limits. Time limits are one option. Disallowing certain JavaScript constructs is another possibility.

--
After all, I am strangely colored.
Re:This leads to a decidability problem by EvanED · 2007-03-12 02:32 · Score: 1

Decidability is a non-issue in this context. Your example falls flat, because JavaScript is an interpreted language.

What? And you said that decidability is a non-issue?

Decidability doesn't depend on the non-interpred nature of JavaScript. You could imagine a browser would compile the incoming JavaScript and execute it directly. Boom, not interpreted. Or you could imagine an interpreted for C. Here is a turing machine simulator -- it's more or less acting as an interpreter. Does that suddenly make it decidable?

So what does the interpreted bit have to do with this discussion?
Re:This leads to a decidability problem by poopdeville · 2007-03-12 05:21 · Score: 1

So what does the interpreted bit have to do with this discussion?

I was unclear. I hoped that the relation would be clear, but it's my fault it wasn't. There are two issues: the one I was originally responding to, and my smart one. First, the code the GGP posted is trivially "decidable" in the sense the GGP meant. The inner block is not going to run, and we can know that very quickly just by running/reading it. Obviously, Rice's theorem stops us from being able to always rely on being able to decide if a program has a non-trivial property. But Google doesn't have to decide that.

JavaScript is an interpreted language with full, open specifications (barring browser silliness). Decidability is irrelevant in this context because Google is in a position where they can implement a JavaScript interpreter with hard limits on the resources available to the JavaScript script. After all, if a .js fails to satisfy Google, it's the author's loss and not Google's. So it shouldn't matter if there's code out there with while (1) { sleep 30; } (or whatever the JavaScript equivalent is).

Granted, Google could do this even if JavaScript was a compiled language. But the solution would either be brittle (making changes to the compiler to build hard limits into each executable, write files to create a file system based data structure, etc) or messy and slow (using some kind of scripting and IPC to control parallel tasks) or both. Perhaps other schemes are possible. Basically, the "interpreted bit" is relevant to the discussion because it facilitates Google's hypothetical implementation.

--
After all, I am strangely colored.
Re:This leads to a decidability problem by NittanyTuring · 2007-03-12 15:17 · Score: 1

After all, if a .js fails to satisfy Google, it's the author's loss and not Google's.
It can also be a grave loss to Google's user, if the ultimate result leads to a gross misrepresentation of the site's content.
Decidability is irrelevant in this context because Google is in a position where they can implement a JavaScript interpreter with hard limits on the resources available to the JavaScript script.
We both agree that this approach will not turn up the best possible results. However, if we assume that it does fairly well (and I'm not sure), look at the cost. This is an inherently stateful process... and that could a massive reduction in performance. I think crawling right now is probably stateless, leading to very fine-grained load balancing and redundancy. That's why, as I said in my post, that they should look at common cases that are easy to analyze.

google.com/?q=slashdotting+in+google+dollars by kale77in · 2007-03-11 18:30 · Score: 5, Insightful

I think the actual experiment here is:

Create a 6-odd-paragraph page saying what everybody already knows.
Slashdot it, by suggesting something newsworthy is there.
Pack the page with Google ads.
Profit.

I look forward to the follow-up piece which details the financial results.

Re:google.com/?q=slashdotting+in+google+dollars by Scarblac · 2007-03-11 19:46 · Score: 4, Insightful

Exactly, this is the typical sort of fluff that Digg seems to love. As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.

--
I believe posters are recognized by their sig. So I made one.
Re:google.com/?q=slashdotting+in+google+dollars by tijmentiming · 2007-03-11 20:54 · Score: 1

Mod parent up, there is indeed to much of digg-like posts here.
Re:google.com/?q=slashdotting+in+google+dollars by caluml · 2007-03-11 21:01 · Score: 1

But with the Firehose, Slashdot will now start using the "wisdom" of crowds to produce the same pap that Digg does.
Shall we all migrate to Technocrat, anyone? It has decent stories.

--
Get your own free personal location tracker
Re:google.com/?q=slashdotting+in+google+dollars by Anonymous Coward · 2007-03-11 23:14 · Score: 0

You must be new to Slashdot then.
Re:google.com/?q=slashdotting+in+google+dollars by cmorriss · 2007-03-12 00:08 · Score: 1

There's only one problem. While it is a blog, there are no google ads or any other revenue generating advertisement. I agree that the content of the article is pretty common sense, but at least it isn't blog spam.

--
10 minutes working on a sig. What a waste.
Re:google.com/?q=slashdotting+in+google+dollars by geoffspear · 2007-03-12 01:51 · Score: 1

I'm seeing ads right after the first paragraph in the blog story. You must be blocking them.

I'm not sure if 2 of the 3 ads are for brain injury and brain tumor treatment because the name of his blog is "Brain handles" or because you'd need to have a brain injury to do the "research" he's doing.

--
Don't blame me; I'm never given mod points.
Re:google.com/?q=slashdotting+in+google+dollars by larry+bagina · 2007-03-12 02:01 · Score: 1

The wisdom of crowds beats the wisdom of zonk.

--
Do you even lift?
These aren't the 'roids you're looking for.
Re:google.com/?q=slashdotting+in+google+dollars by dr.badass · 2007-03-12 02:12 · Score: 2, Insightful

As far as I know, Slashdot had avoided this particular type of adword blog post crap until now

It used to be that the web as a whole avoided this crap. Now, it's so easy to make stupid amounts of money from stupid content that a huge percentage of what gets submitted only even exists for the money -- it's like socially-acceptable spam. Digg is by far the worst confluence of this kind of crap, but the problem is web-wide, and damn near impossible to avoid.

--
Don't become a regular here -- you will become retarded.
Re:google.com/?q=slashdotting+in+google+dollars by ColaMan · 2007-03-12 02:57 · Score: 1

As far as I know, Slashdot had avoided this particular type of adword blog post crap until now.

Two words:
Roland Piquepaille.

--

You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.
Re:google.com/?q=slashdotting+in+google+dollars by cmorriss · 2007-03-12 06:22 · Score: 1

You're right. I forgot I have adblock running.

--
10 minutes working on a sig. What a waste.
Re:google.com/?q=slashdotting+in+google+dollars by Raenex · 2007-03-12 06:44 · Score: 1

The sad thing is I bet the vast majority of crap like this earns enough to buy lunch or something. There's a lot of people running around trying to get rich doing this, but Google is the real winner.
Re:google.com/?q=slashdotting+in+google+dollars by Restil · 2007-03-12 16:17 · Score: 1

I've noticed something with regards to my own site and the few google ads I have placed on the back pages. On those occasions when I get heavy traffic from a link on a popular tech site, my average click ratio goes way down. Slashdot users aren't going to pages to search for products to buy, so it's highly unlikely more than a very few will ever click on any ads, if any at all. Now if the article was promoting a product that the average geek would be interested in, and there were ads on the page for that exact product, there might be a few ad hits, but for an article like this, the results would be negligible.

-Restil

--
Play with my webcams and lights here

Woot! by Anonymous Coward · 2007-03-11 19:11 · Score: 0

...an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser.

That's a fantastic idea!

(tagging beta) by Jack+Schitt · 2007-03-11 19:20 · Score: 1

I predict that from now on, zonkdogfology will be a common tag for all articles that relate to google search...

--
This message brought to you by Jack Schitt's Previously Shat Shit

Re:(tagging beta) by dotgain · 2007-03-11 20:39 · Score: 1

I predict in five years it'll be in the Oxford English Dictionary.
Re:(tagging beta) by maxwell+demon · 2007-03-11 21:29 · Score: 1

I predict in five days it will be in Wikipedia.

--
The Tao of math: The numbers you can count are not the real numbers.

I would make normal links, then use JS on top by The+Amazing+Fish+Boy · 2007-03-11 19:33 · Score: 3, Insightful

So, what do you have to say about websites that have their entire user-interfaces built with content that gets filled by javascript asynchronously from a single html page?

If I understand you, you something like this: The site has two parts, a menu and content. When you click a menu item, rather than being taken to a new URL, it executes Javascript which fetches only the new content from the web server, then replaces the content section. So the URL doesn't change.

It's a nice improvement. Less bandwidth used, and a quicker interface.

Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this:

// menuLink is the DOM element for each menu link. // (i.e. get it from document.getElementById(), etc.) menuLink.onclick = function() { getNewContent(); return false; }

(FYI, this is how I do pop-up windows, too.)

Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.

I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.

(Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)

Re:I would make normal links, then use JS on top by fbartho · 2007-03-11 20:01 · Score: 1

You do pretty much understand me. One example site involves data with a certain set of fields stored in the database. Each set of fields is a package of information, and these packages of information number around 400. There are 20 primary users for the site who are college students. They each update several of these packages, ideally on a daily basis, and leaders of subgroups track which packages have been updated every 3 days or so. These packages are used for the management of a very large student project, scheduling, directions, subtasks, contact points, costs, and progress. Thankfully for me, I can standardize on requiring Firefox, Javascript, and Cookies (for php sessions) because all of the computer labs on campus are equipped with that no matter what OS comes installed, and it's trivial for them to get those on their personal computers (if they do not already have them). My primary role on the project however is not webdeveloper, the site was just a solution to a problem, that I had a certain amount of time to alot to. I definitely don't have the time to maintain a flat html form of the site, and it would be much less functional for the average user's workflow.

Now, given the usercount, I know that site is small peanuts in the real world, but what am I supposed to do? For other projects: At what point am I required to duplicate my efforts so that systems without javascript can functionally access my site? What if I want to provide a view to the world that is only mediated through the javascript interface I have developed. Am I required to make my site accessible without javascript?

--
Gravity Sucks
Re:I would make normal links, then use JS on top by The+Amazing+Fish+Boy · 2007-03-11 21:11 · Score: 1

I agree with you that with a small enough user base (or one that is adequately controlled), you can cut some corners, especially if time is a constraint. Generally I would say whenever your users could reasonably demand their browser work. That is, if the site is going to be publicly accessible, I would not make Javascript a requirement. I'm not sure what the actual "limit" I would put on the number of users would be; I think that would vary from project to project.

It's a matter of project requirements, really. I didn't mean to come off as though what I was suggesting were absolute rules. They are best practices, and especially on the web I think they should be followed.

But I don't think it's always a duplication of efforts. You're right, sometimes it is a duplication of efforts, like if you can only present the interface you want in Javascript (e.g. if you used drag and drop or something). But a lot of the time Javascript is only used to enhance a form, so it would only be adding functionality, not replicating it.

Also,have you considered what would happen if someone sues the school (is it for a school?) if they are blind and the site is inaccessible?
Re:I would make normal links, then use JS on top by fbartho · 2007-03-12 04:38 · Score: 1

It's not directly for the school, it's by me for the student project, and all the student labor is volunteer.

Now, that last question is exactly what I'm getting at. Am I required to make publicly accessible sites of a certain size accessible to blind or otherwise visually impaired users? Auto manufacturers don't have to make their cars that way, what if my javascript userinterface only presented things to people who could finish a virtual formula 1 race in first place. In that case I'd be simulating a driving experience... the driving experience is exempt, is my site?

--
Gravity Sucks

Document.Write() not interpreted by Google by generikz · 2007-03-11 19:45 · Score: 1

... and also by SPAM spiders sneaking around for Email addresses!

I didn't want to change my contact information with additional FORM submit with visual challenge, but still wanted to leave a direct Email link obviously placed on page for quick contact/feedback.

Since I modified the mailto: with some tricky/sliced javascript Document.Write() I don't have a single SPAM coming from a semi-hidden address, which is still looking -- for the regular human visitors -- like the classic "Contact us" Email link.

I certainly hope this won't change in the future!

Rgds,
Julien

Re:Document.Write() not interpreted by Google by DrXym · 2007-03-11 21:27 · Score: 1

I do this also on my pages. I also mangle certain things like affiliate links, AdSense ids etc. not for any particular reason except I don't like the idea of any search engine inadvertantly indexing them.
Re:Document.Write() not interpreted by Google by n00kie · 2007-03-11 22:26 · Score: 1

// That's cute. Someone spam him please.
Re:Document.Write() not interpreted by Google by DrSkwid · 2007-03-11 22:27 · Score: 1

If you want people to contact you, you should provide the contact details properly and suck up the spam yourself.

Choose a non-default email i.e. not webmaster but web-master and deal with the consequences.

In my eyes, a customer/client/new friend not being able to contact you is far more expensive than dealing with some *more* spam.

--
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Re:Document.Write() not interpreted by Google by Megane · 2007-03-12 01:20 · Score: 1

Meh. I have mine double-escaped, using two unescape() calls. The first hides the e-mail address, and the second hides the HTML for the mailto link. It even has a <noscript> condition to point out that the user does not have Javascript enabled. I've been scrupulous about noscript ever since one web site that just displayed a blank black page with JS disabled.
If I was really paranoid, I'd probably come up with some sort of while loop to decode the mail address, and a skip-over condition to change the index inside the loop, to put fear into anyone trying to write a halting-problem detector/avoider.
But I'm not stupid enough to link to my page, even though the spammers have the address anyhow. It was a reaction to the time that some evil spider found my resume on my web site and immediately shot off resume-related spam (maybe one of those work at home scams, I don't remember). Just because "the spammers" already have your address doesn't mean that others aren't trying.

--
#naabhaprzrag, #sverubfr-000, #agi-fcbafberq, negvpyr[pynff*=' negvpyr-ary-'] { qvfcynl: abar !vzcbegnag; }

Google doesn't, but it's possible by Animats · 2007-03-11 19:49 · Score: 2, Informative

I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth) and extract data, and I've been considering how to deal with JavaScript effectively.

Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.

It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.

Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.

Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.

Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

Re:Google doesn't, but it's possible by Anonymous Coward · 2007-03-11 20:26 · Score: 0

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Shouldn't they have an ALT= attribute in their header image(s) anyway?
Re:Google doesn't, but it's possible by dargaud · 2007-03-11 20:28 · Score: 1

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Yes, and it should work like that too. If the background is so cluttered as to make the OCR difficult, then chances are the human will have trouble reading it too. I suggested that during a job interview witha *cough* serious search engine: use a secondary crawler reporting as a normal IE/firefox, load a page using the usual IE/firefox rendering engine, OCR the text (this way all white on white, display:none and size:1 goes away) with some color tolerance (make sure violet on red goes away too !) and compare with the normal crawler. If the result is too different, flag it as a potential spamming site.

--
Non-Linux Penguins ?
Re:Google doesn't, but it's possible by VGPowerlord · 2007-03-11 21:13 · Score: 1

Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.

Does Google run macros in Word documents? No? Then why are you even comparing this? I can parse a PDF document or a Word document without having to have a script interpreter running.

I imagine that the Googlebot crawler is a rather simplistic program that only knows how to:
1. Read robots.txt
2. Read meta tags (robot tags in particular)
3. Find text and web addresses in web pages
4. Send the text back to a larger analytical program
5. Adds the web addresses it finds to its own queue

What you're proposing would require an actual DOM tree be built up by Googlebot as well as a Javascript interpreter to be run. Also, if you use <input type="button"> or <button> controls anywhere, it would still fall flat on its face, as GoogleBot doesn't activate these elements. So, you'd either need Googlebot to press every button it encounters (a VERY bad idea) or have some sort of AI to figure out what it should do.

If you're willing to write such an AI, go ahead. I think I'll stand behind Google's method.

--
GLaDOS for President 2016! "Well here we are again. It's always such a pleasure." -- GLaDOS, 2011
Re:Google doesn't, but it's possible by maxwell+demon · 2007-03-11 21:34 · Score: 1

Sure. But since when do people do everything they should do?

--
The Tao of math: The numbers you can count are not the real numbers.
Re:Google doesn't, but it's possible by imroy · 2007-03-11 22:37 · Score: 1

Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language.

Ghostscript had to deal with what problem? Yes, PostScript is a programming language with built-in graphics primitives. What does that have to do with search engines? It doesn't have to recognise certain outlines as being text (i.e text drawn without using the PostScript primitive for drawing text), it just draws it. Ghostscript is just another implementation of a language otherwise.

OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images.

Is OCR really necessary? Odds are the business name is also in the domain name and at least the front page as text, if not included in the title and/or copyright footer of every page. Except for damn all-flash web sites, the business name is unlikely to be hidden away from a search engine.
Re:Google doesn't, but it's possible by aaronwormus · 2007-03-11 23:12 · Score: 1

If google were to index javascript, they would probably create their own interpreter which only interpreted content that was meaningful to a search engine.

If googlebot had to interpret every fade-in menu and every roll-over effect it would take substantially more resources for google to crawl the web. Googlebot would also be vulnerable to malicious scripts - or scripts built to waste its time.
Re:Google doesn't, but it's possible by shish · 2007-03-12 00:18 · Score: 1

Ghostscript had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language.
Ghostscript had to deal with what problem? Yes, PostScript is a programming language with built-in graphics primitives. What does that have to do with search engines?

Postscript is a programming language, not a page description language; you need to write a language interpreter, not just a data parser, to get the most from it. HTML + Javascript also requires an interpreter, not just a parser.

--
I mod down anyone who says "I will be modded down for this", regardless of the rest of their comment
Re:Google doesn't, but it's possible by Anonymous Coward · 2007-03-12 02:27 · Score: 0

Problem is, google indexes alt attributes, and may also index title attributes as well.

Some purely graphical sites are coming to mind (adult TGP sites for example) that have absolutely no text content but they have important descriptions in alt and title attributes. They can also display useful information to the surfer in tooltips using onmouseover (obviously that wouldn't get indexed though).

I'm not really sure how that would affect your idea. The main issue is that the OCR output would have little to no text while the regular index would have tons simply because of the text in the alt and title attributes.

Another potential problem is where a web site has all of their content in xhtml but for modern web browsers it uses CSS to set background images behind each text and hide the text to make the site entirely/mostly graphical and javascript to do whatever with the content (make it clickable, use onmouseovers etc.) The idea is that for older browsers or mobile devices etc. the site content will be accessible but for modern browsers the site is graphical and interactive. The OCR would pull up absolutely no text but the "regular" spider will pull up tons. Site gets falsely flagged as spam.

Working example by PietjeJantje · 2007-03-11 20:16 · Score: 0

Still, with a different approach my AJAX generated site:
http://dutchpipe.org/
is indexed perfectly:
http://66.102.9.104/search?q=cache:kvnpKdmDxwUJ:du tchpipe.org/+dutchpipe&hl=en&ct=clnk&cd=1

Re:Working example by Anonymous Coward · 2007-03-11 22:12 · Score: 0

Dude, I'm not clicking anything named "Dutch Pipe"... Dutch people are fucking scary as hell

Most contentless story for ages by Sam+H · 2007-03-11 20:58 · Score: 1

You can tell there's nothing interesting in the link from the fact that not even a summary of the results is given in the story. It looks like the average pay-to-get-diggs story, except you don't have to pay anything to be on Slashdot. Well done, and enjoy your Google Ads revenues!

--
God, root, what is difference ?

Re:Most contentless story for ages by jackv · 2007-03-11 21:35 · Score: 1

I agree, the results are extremenly obvious, apart from the fact they've been documented plenty of times before. You only have to spend 20 minutes reading an SEO handbook to glean this basic info.

--
Jack V All the IT vacancies in one place
Re:Most contentless story for ages by Anonymous Coward · 2007-03-11 22:00 · Score: 0

Besides, he published the results MUCH too quickly!
It can take several weeks for Google to get into a stable state indexing a site.

If you want to see by BrynM · 2007-03-11 21:28 · Score: 3, Funny

If you want to see through a search engine's eyes, open the page in Lynx. The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.

(method does not account for image crawlers)

--
US Democracy:The best person for the job (among These pre-selected choices...)

Re:If you want to see by Anonymous Coward · 2007-03-11 23:47 · Score: 0
The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.

Been there and done that, it's not funny. No matter how many times you explain it, the excuses keep coming.
- The browser sucks...
- Hardly anybody uses that browser...
- Dynamic, all-singing, all-dancing documents are the future...?
For some reason the obvious conclusion (that they don't understand web technology) always escapes them.

Pitiful by Mathinker · 2007-03-11 21:32 · Score: 1

The moderator meant to mod this +1 Funny (I would!) but forgot to actually try to understand the post.

Or perhaps this is a mod of a new experimental viral moderation system, but the viruses haven't evolved enough yet?

Problem for web apps by Wienaren · 2007-03-11 21:48 · Score: 1

Web apps these days consist nearly entirely of dynamic content invisible to googlebot. If you try to make your page visible on the web, this is really a problem. But think twice before adding invisible div's or alike in order to achieve proper seach results: Google might as well ban you (since they don't check whether or not the keywords you name in your invisible divs do in fact relate to the page's purpose or contents).

--
-- The Online Photo Editor - http://www.phixr.com

Re:Problem for web apps by julesh · 2007-03-12 00:02 · Score: 1

Web apps these days consist nearly entirely of dynamic content invisible to googlebot. If you try to make your page visible on the web, this is really a problem. But think twice before adding invisible div's or alike in order to achieve proper seach results: Google might as well ban you (since they don't check whether or not the keywords you name in your invisible divs do in fact relate to the page's purpose or contents).

OTOH, there's nothing wrong at all with having static content that is only displayed to people who do not have javascript active. In fact, this is positively encouraged. AFAIK, google does index the content of tags.
Re:Problem for web apps by julesh · 2007-03-12 00:05 · Score: 1

AFAIK, google does index the content of tags.

Erm. "of <NOSCRIPT> tags". Sorry.

Google holds back the web! by mumblestheclown · 2007-03-11 22:10 · Score: 1, Insightful

this is a pretty straightforward example of how google holds back the web. this is not google's fault, per se, but it definitely is true. We routinely resort to older, inefficient technologies for our websites simply to please google. it works well for us from an advertising standpoint, but is often incredibly stupid technologically.

Re:Google holds back the web! by Nappa48 · 2007-03-12 03:14 · Score: 0

No, its technically the web-developers fault for using Javascript in this way.
The easiest way to get around this "problem" is by initially having content on the page (keywords etc) then hiding it using Javascript and re-populating/injecting content and so on.
Its not that hard a thing to do and takes like 5-10 lines extra code at least (for most simpler Javascript-based sites)

This method is used by spam-sites currently, so i don't know why web-developers don't learn from the bad boys of the web since they usually come up with the smartest ways to do/get around things (same goes for crackers/phishers and loads of other similar kinds of people... kind of a sad truth in some cases)
Re:Google holds back the web! by mumblestheclown · 2007-03-12 05:59 · Score: 1

As always, I'm underwhelmed by idiot slashdot moderators who mark my comment as 'flamebait.' You may not agree with it, but 'flamebait?' It's not even CLOSE to flamebait. / most of the comments i make on slashdot that get moderated get BOTH {troll/flamebait} AND {insightful/interesting}. If that doesn't suggest that slashdot's moderation system isn't heavily broken, then I don't know what does. I'd understand {overrated}+{interesting} or something like that, but my comments show rather clearly that moderators are not 'moderating' but rather too often using the system to simply squelch alternative opinions they don't like at the expense of furthering intelligent discussion.
Re:Google holds back the web! by mumblestheclown · 2007-03-12 06:05 · Score: 1

Your whole argument rests on the belief that the developers are doing something wrong. If you start from that premise, of course you are going to come to that conclusion. Just like we do things that we wouldn't otherwise do because of google, you encourage others to. The point is that a better search engine would not force people to change their technologies / add 5-10 lines of "extra" code. extra means extra. A better search engine would do this for you and allow people to use the full extent of tools to create web content and not need to try to manipulate themselves into some format that the dominant search engine happens to like. your proposal adds meaningless work to millions of webmasters jobs rather than simply having google do a better job actually indexing the content of the web.

mod parent up. morning chuckle. by Anonymous Coward · 2007-03-11 22:42 · Score: 0

Cruel, dude.

funny as hell, but cruel!

Luckily blind people don't drive! by Anonymous Coward · 2007-03-11 23:15 · Score: 1, Insightful

Javascript redirects are a trait of the incompetent, I bet Ford payed some cowboy a whole lot of money for a site that doesn't work.
On the jeep site I can get to a few 'pages' that are actually just images with an image map and empty alt attributes for the html links. The HTML URLs are clean but not informative and the others don't work (unsupported URL scheme in lynx).
The credit Suisse site is reachable via a mislabeled link, "If you are a PALM, PSION, WINDOWS CE or NOKIA user click here". They even offer a sitemap for navigation. Tell-tale signs indicate this site was valid, accessible XHTML before some monkey was set loose on it.

Those selling professional web services should be liable under ADA and similar laws, that's how we fix the web.

Wrong by Anonymous Coward · 2007-03-11 23:52 · Score: 0

This is an example of content that shouldn't be indexed by a search engine. I suppose you'd also like Google to auto compile, link and offer for download other forms of program source code? Don't blame Google because a small percentage of web developers are incompetent.

grow up, slashdot by Anonymous Coward · 2007-03-12 00:23 · Score: 0

Friggin' vandals. Grow up, slashdot.

Looking for good/current Lynx for Windows/XP by martyb · 2007-03-12 00:25 · Score: 1

In actuality, it says "Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site." - Webmaster Guidelines, Technical Guidelines section, bullet point 1. (emphasis added)

Can anyone here recommend a good place to download a current port of Lynx for Windows/XP? I'd like to be able to get formatted text out of a web page. I'm thinking along the lines of:

lynx foo.htm > foo.txt

(I'd prefer a version that DOES NOT require Cygwin, as I use the GNU file/text/etc. utilities.) With all the web exploits that are out there, I'm relucant to download an out-of-date, vulnerable, and/or or poorly-ported port.

I'd appreciate knowing what YOU have found that works well, is up-to-date, and actively developed.

Re:Looking for good/current Lynx for Windows/XP by Anonymous Coward · 2007-03-12 02:13 · Score: 0

I have used this one in the past:

http://www.fdisk.com/doslynx/wlynx/lynx_w32.2.8.2r el.1.zip
from this page:
http://www.fdisk.com/doslynx/lynxport.htm

No Cygwin libraries required. It worked fine for me, though it has not been updated in some time. I doubt you need to worry much about vulnerabilities in a text browser, especially if you only use it to examine your own pages. If you simply intend to scrape text from other people's web pages using a windows box, might I recommend using the QueryTables.Add method in an Excel macro, which has worked fine for scraping sites for me in the past and allows for relatively easy manipulation of the results.

Alternative Lynx windows binaries are posted here. The current release will compile with Borland C or Visual C++ 6 (with some tweaks), though I imagine it would take some major edits to get it to compile properly with the newer Visual C++ compilers.

Though for the command line usage you desire, Netcat would probably get the job done with a little fiddling. The official page is here, though the latest source release is no newer than Vulnwatch's WinNT binary.

Another alternative is simply to turn off images, javascript, java and css in Firefox, though I don't think there is any command line option for non interactive operation, but scripting acquisition of text from it wouldn't be that hard to do.

But being a GnuWin32 guy then a scripted combination of Wget and Sed or Gawk might be the best solution for you.

You could also just write a PHP or Perl script to do the job just fine, which might be the most sensible approach.

Anyway, the version of Lynx I mentioned above worked fine for me and did not result in any attacks, though I have only visited totally legit sites with it.

AJAX is for writing applications not Documents by e-Trolley · 2007-03-12 00:44 · Score: 2, Interesting

AJAX is for writing applications not Documents. Why and how should an application be indexed?

Re:AJAX is for writing applications not Documents by Raenex · 2007-03-12 07:49 · Score: 1

The line between application vs document gets blurry, fast. Consider a site like Try Ruby!. There's definitely content hidden inside the tutorial, yet a search engine will never see it.
Re:AJAX is for writing applications not Documents by e-Trolley · 2007-03-13 03:30 · Score: 0

Interestingly Google just crawled my *.js files, did they read this thread?

err news? by Vexorian · 2007-03-12 00:59 · Score: 1

Hey I didn't think that after read-skipping the first paragraphs the article would actually state the obvious, that javascript generated content is not indexed... I would have expected an article to appear in case he found out it WAS INDEXED...

In other news: Most plants are green!

--

Copyright infringement is "piracy" in the same way DRM is "consumer rape"

Re:err news? by Nappa48 · 2007-03-12 03:27 · Score: 0

Are you sure?
I mean, shouldn't we like, do a massive test involving alot of people to really make sure that most plants are green?

Why bother - just check user agent by Isochrome · 2007-03-12 02:25 · Score: 1

Google doesn't hide its identity when it crawls. Just check for user agent Googlebot and serve different pages. This won't help with Yahoo, ask or Windows Live, but neither will targeting specific capabilities.

The answer has always been no by Anonymous Coward · 2007-03-12 03:31 · Score: 0

This was the first thing I noticed about Google when the first big buzz about this search engine came: No they do also NOT index any JavaScript!

HTML vs. Javascript: W3C analysis by TwobyTwo · 2007-03-12 03:50 · Score: 1

The W3C's technical architecture group (TAG) has published a document exploring the tradeoffs in creating Web content using imperative languages, such as JavaScript, vs. declarative languages such as HTML. See The Rule of Least Power, edited by Tim Berners-Lee and Noah Mendelsohn (yours truly).

From that document:

"There is an important tradeoff between the computational power of a language and the ability to determine what a program in that language is doing ... Good Practice: Use the least powerful language suitable for expressing information, constraints or programs on the World Wide Web." If you're interested, I suggest you read the whole document, which is quite short, and which discusses a number of related issues.

While it's not impossible that Google or other search engines would use some heuristic to extract information from the JavaScript source, actually running the program would involve many complexities, some of which have been mentioned by other commentors. Not the least of these relates to the famous halting problem. As paraphrased for the purposes of the TAG finding:

"The tradeoff for such power is that you typically cannot determine what a program in a Turing-complete language [I.e. such as JavaScript] will do without actually running it. Indeed, you often cannot tell in advance whether such a program will even reach the point of producing useful output. Of course, you can easily tell what a simple program such as print "2+2" will do, but given an arbitrary program you'd likely have to run it, and possibly for a very long time. Conversely, if you capture information in a simple declarative form, anyone can write a program to analyze it in many ways."

Re:HTML vs. Javascript: W3C analysis by leighklotz · 2007-03-12 09:22 · Score: 1

This TAG finding is all the more reason for the W3C to support declarative approaches to markup, which allow you to express intent in markup, and leave another level to convert that intent to presentation. This approach starts at the top with technologies such as CSS, but the need for dynamic pages is better addressed by recent additions such as XBL (here's an example in mozilla -- think of it as like CSS but binding to script instead of to a fixed set of attributes) and XForms (think of it as a 3-layer model for the web page -- data, logic, and presentation).

Comment removed by account_deleted · 2007-03-12 05:14 · Score: 1

Comment removed based on user account deletion

OCR and web sites by Animats · 2007-03-12 05:32 · Score: 1

If the background is so cluttered as to make the OCR difficult, then chances are the human will have trouble reading it too.

Web site images with logos against faint but busy backgrounds are moderately common. I'm talking about stuff like this. Commercial OCR programs interpret that as "a picture". Because we're working to automatically extract business identities from uncooperative websites, we sometimes need heavier technology than the search engines.

Content creators vs. indexers by dinomite · 2007-03-12 05:51 · Score: 1

Given that search-driven traffic is so important to sites indexers such as Google have a huge amount of power. If the search engines choose to not index Javascript-generated content (or other dynamic content), will content creators avoid putting real content within such elements?

--
Dinomite.net

Intent-based markup -- a look ahead by leighklotz · 2007-03-12 06:18 · Score: 1

Take a look at the Web Accessibility roadmap from the W3C, and in particular the section on intent-based markup.

Translation for the Digg Crowd by JiveBay · 2007-03-12 12:09 · Score: 1

Your comment on Digg would of been:

"blogspam, no digg"

What a dumb idea by Sloppy · 2007-03-12 13:07 · Score: 1

Now, its too early to say conclusively that Google will never index the JavaScript-generated content..

..but still, we can hope Google doesn't completely cave in to useless trendy bullshit.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.

Modern AJAX Driven Sites Lose by Siker · 2007-03-12 13:35 · Score: 1

There is at least one site I know of which is entirely AJAX based for its contents: http://www.spotplex.com/ . Presumably as a bandwidth limiting measure they populate all their content using AJAX calls. When you load the page in Lynx it's basically empty.

zonkdogfology is a GooglePig by jetcityorange · 2007-03-14 10:54 · Score: 1

"Zonkdogfology" is a GooglePig. That is zonkdogfology is a Google probe, designed to plumb the depths of the plumbing and report back its results. Just like the pigs used in pipelines. You know, the real world pipes.

180 comments