Google Now Searches JavaScript
mikejuk writes "Google has been improving the way that its Googlebot searches dynamic web pages for some time — but it seems to be causing some added interest just at the moment. In the past Google has encouraged developers to avoid using JavaScript to deliver content or links to content because of the difficulty of indexing dynamic content. Over time, however, the Googlebot has incorporated ways of searching content that is provided via JavaScript. Now it seems that it has got so good at the task Google is asking us to allow the Googlebot to scan the JavaScript used by our sites. Working with JavaScript means that the Googlebot has to actually download and run the scripts and this is more complicated than you might think. This has led to speculation of whether or not it might be possible to include JavaScript on a site that could use the Google cloud to compute something. For example, imagine that you set up a JavaScript program to compute the n-digits of Pi, or a BitCoin miner, and had the result formed into a custom URL — which the Googlebot would then try to access as part of its crawl. By looking at, say, the query part of the URL in the log you might be able to get back a useful result."
Googlebot will have a very quick timeout on scripts and probably wont be more powerful than a standard home computer. How would that be useful for calculating digits of pi or bitcoin mining? It would take far longer than doing it the conventional way.
You can always cut the whole process into smaller steps, each providing URL that will initiate the next step. Or you can provide several URLs and have the Google cloud compute a problem for you in parallel...
why having other parties fetch your arbitrary code and execute it is such a wonderful idea.
Send Google JavaScript which generates different results for Google than for normal visitors, in order to rank up the site.
The Tao of math: The numbers you can count are not the real numbers.
As a programmer, this still sounds like an extremely dirty hack to me. For the time being, I'll stick to creating gracefully degrading sites, thank you very much.
When I was looking at the page previews (in google) of my JavaScript network scanner, I noticed it listed some IP's, indicating that it was running the script. Just google "http://bwns.be/jim/scanning_printing/detect_range.html" and look at the preview. (Also, most of those IP's probably exist, as my script indicates it is sure about them).
I remember once I was going to try use Google Cache to see if I could store backups on it.
I still haven't actually bothered doing it to be honest.
And Caching in general doesn't seem to show up as often as it used to on websites.
I feel they are only caching larger or active websites, or social.
It would probably require a bit of trial and error most likely.
using javascript to hide or obfuscate email addresses to help protect them from spammers, scammers and bots.
thanks fer nuttin, google.
Now Google controls the client, the search engine and the analytics it should not be too difficult for them to see how traffic is flowing between sites. Pages need not even be physically linked for Google to see a connection. E.g. reading an article on the BBC may cause people to search for a company. With people signing into Chrome Google Google must have some very rich logs.
Although maybe not quite in the same context. Google used to display javascript-munged email addresses in their search results until some of the larger sites involved, such as Rootsweb, complained.
I really hope website developers and web application developers know the difference between GET and POST requests.
Else, this could turn ugly.
That's a silly idea anyway.
What I expect from Google is to basically download the page and process it, as if it was Chrome, and then diff it against the unprocessed page to figure out which section of content is changed.
Secondly, to avoid stupidity, keep a whitelist and blacklist of scripts that should, and should not be processed. For example, Whitelist scripts on Twitter, Facebook, and Disqus to read comments, but blacklist login pages, local storage, and advertisements. This would let google figure out which content is part of the page and which content is dynamically added to the page that's of contextual value. Google can do it's own oAuth and login to sites that allow authentication from G+ as "The GoogleBot" which would also allow users to ban it from accessing any data they don't want it to see.
I can already picture hackers drooling at the idea of turning Google's cloud into the ultimate zombie network.
If you check out some of the thumbnails, it looks like Googlebot is using a customized version of Chrome now. You can see it blocking plugins.
It's inevitable. Someone will figure out a way to abuse the system that google hasn't thought to make contingencies for yet. I'm on the fence as to whether this is a good idea. I just hope they know what they're doing.
This signature intentionally left blank.
I thought that every (yet unknown) url that is visited by a user from inside Google Chrome is reported back to Google. I guess that could also be used for crawling javascript by using the client's computer for that.
You don't need to actually run the scripts, most of the time it's enough to just scrape the strings and links out of them.
Oh yeah, fuck accessibility. Fuck the web in general. "It's better for everybody". That's literally all you need to know. "Just go ahead and remove that from your robots.txt".
I'm not saying there may not be good reasons (e.g. having the CSS and Javascript actually makes it possible to detect invisible text and whatnot, without that search engines may not even have a chance), but I really would appreciate some good reasoning, not being talked to like a fucking 5 year old.
Or hey, how about adding that "of course, not having a unique URL for relevant content is a noob fucking mistake, and generally a cancer everybody is looking forward to eradicate, and irrelevant gimmick content is hardly interesting for search, so if you just went ahead and made a site that doesn't suck butt, that would be fine, too." --- *something* to indicate he isn't fucking clueless.
From the preceeding link: "Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler. Visit http://code.google.com/web/controlcrawlindex/docs/faq.html to learn how to instruct robots when they visit your site. You can test your robots.txt file to make sure you're using it correctly with the robots.txt analysis tool available in Google Webmaster Tools.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
Right. That is exactly what I said. The standard for the internet is well defined. You should read about it. If you make a web page available to the internet without a password, captcha or firewall, etc. you are making it available to all. You have already purposely accepted the condition ahead of time. This is opting in. The robots.txt allows you to opt-out instead. If you opt in by placing it on the internet available to web crawlers and not opting-out with a robots.txt entry, you opt-in to having that data accessible to all, including but by no means limited, to Googlebot.
Guns don't kill people; Physics kills people! - John Lithgow as Dick Solomon on Third Rock From The Sun
I wonder if they're using a standard browser to load pages now or if they have incorporated V8 and WebKit (or something similar) into Googlebot?
Imagine if you could get this on your local machine as a web crawler app, but with filtering capabilities. Traditional web crawlers only work with static content for the very reason that they're not advanced enough to load the entire page, including running javascript, plus there is the overhead of that additional processing, which can be a real kill to your crawling time.
I really hope more details are released on how they're doing this (but not on how they're ranking anything since Google is protective of that).
They've been testing this for a while - We've already had the first complaints against someone spamming an email that only exists in exactly one place: Online as the result of some (trivial) javascript. Turned out that if you Googled the page, the result snapshot included the javascript generated email... In other words - it's already there and this will effectively kill javascript as a way of hiding functioning mailto links. Okay it would be fairly simple to add a condition based on the User Agent as GoogleBot is easily identified but it will make things a bit more complicated for the average user.
"For every complex problem, there is a solution that is simple, neat, and wrong." -- H.L. Mencken (1880-1956) --