Trending Low-Volume Google Searches with Gootrude
michaelrash writes "The Google Trends project provides some visibility into how popular search terms like 'Myspace' or '2008 Election' change over time and points out relevant news articles that create jumps in search volume. This is a handy tool, but there are many search terms that Google Trends does not display any results for. Such terms (such as 'Linux Firewalls' — with the quotes) have insufficient search volumes to display graphs according to the error message that Google Trends generates. Fair enough. Google sets an internal threshold on search volume, and this threshold could be set for reasons that range anywhere from Google Trends is still experimental to Google not wanting to provide data on how it builds its massive search index for emerging search terms. Either way, I would like a way to see search term trends that Google doesn't currently make available to me. So, I've released an open source project called 'Gootrude' to do just this. For the past year Gootrude has collected a set of low-volume search terms and interfaced with Gnuplot to visualize them."
wow, um...congrats I think? I mean, after you get over your pat on the back, can anyone explain why this matters?
Modding Trolls +1 inciteful since 1999
I took the time to look through the work - looks impressive for a "hobby project".
The only thing I feel is missing is more options to narrow the searches and statistics on geographical information.
Does anybody have some thoughts on how reliable this tool is? And what the terms for using (read: distributing the data/results) the data is?
- Jesper
My security clearance is so high I have to kill myself if I remember I have it...
... or does the author of this tool seemingly not realize that Google Trends reports volume of searches, while what he's tracking is amount of documents indexed for a search term, and that there's no basis for assuming the two are correlated in a meaningful way?
Google trends plots how popular a search phrase is. This mashup of google results is not that at all. it is nothing more than a mashup of the count of pages in google's database. it has nothing to do with how often a phrase is searched for.
Modding Trolls +1 inciteful since 1999
Google Trends plots the frequency of queries, i.e. the number of times information is asked about a subject. Gootrude plots the number of pages found, or the quantity of information google can retrieve on this subject. These are completely different.
512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
Have you noticed how "spore demo" is the 77th top search? On the WHOLE INTERNET! :)
Google trends measures what people are seaching for, while Gootrude measures how many results are in the google database for a given term. These are not even remotely the same thing.
-molo
Using your sig line to advertise for friends is lame.
Besides not being the same as google trends, this tool is not allowed by the TOS of google. Automatic querying of their services without prior permission is forbidden by google. But since it probably won't put any noticeable load on their network they most likely won't care
to do something similar with my parody of google where search terms can be looked at in real time (empty or spammy search terms are replaced with fake words on display, but not in the history).
Votez ecolo : Chiez dans l'urne !
So, nobody really likes the amount of data that Google collects on everybody, and there's a constant trickle of scandal about "anonymized" search results not being anonymous enough. I myself have stopped using Google as much as possible due to these shenanigans....
But then stuff like this gets written AND slashdotted. What's the deal? I'd much rather know NOTHING about Google's web search trends than inch even one micrometer closer to living in a panopticon.
In a funny coincidence, my CAPTCHA was "lynched", so, flame on!
Everyone has already noted that this only tracks hits, not searches. I'd like to suggest a few code improvements.
At a high level, use RRD (http://search.cpan.org/~nicolaw/RRD-Simple-1.43/lib/RRD/Simple.pm) for the underlying database. RRD is used by MRTG to track time-varying data over multiple time scales, keeping details for recent data and summaries for historical data. RRD also comes with its own plotting module, although you could keep using Gnuplot if you wish.
In the code itself, there are places where there are "elsif" clauses without an "else" clause. One seems alright, but should have a null "else" so document that fact. The other, however, is testing keywords from the config file, and should flag any that are unrecognized.
Finally (and this is probably nit-picking), instead of this:
return unless EXPR;
do something;
return;
I'd use this:
if EXPR {
do something;
}
return;
Nothing for 6-digit uids?
This article has been on /. for almost 3 hours and "Linux Firewalls" still isn't a significant enough search query for Google Trends? Well THAT is surprising.
Just did searches on all of the terms the author mentions and got a few different numbers:
... anyone know why? I thought caps were supposed to be ignored?
1. "iptables attack visualization" -- 19 results (~35) (close)
2. "single packet authentication" -- 93 (1,300) -- off by more than 1 magnitude
3. "linux firewalls attack detection" - 9290
3a. "Linux Firewalls Attack Detection" - 9240 (~9000) (close)
4. cipherdyne -- 85,200 (~70,000) ~off a bit
4a.Cipherdyne -- 84,500 (~70,000)
5. gpgdir (same)
6. fwsnort (same)
-------
Note...caps vs. no caps made no difference on 1, 2 and 5. But for terms 3 & 4, caps made a slight difference
Most were close, but cipherdyne had about a 15% difference, but the worst was "single packet authentication" -- That one was off by more than 10x! Wonder what's up with that.
Interesting curiosities...
Everytime I see graphs with a moving average, be it in TFA or some stock market graph it makes me cringe. OK, the moving average isn't the best filtering out there, there's a whole range of finite impulse response filters that have a more desirable frequency response than a moving average (which is convolution a rectangle, which means its frequency response is essentially a sinc function, which means a shitload of ripples), but why on Earth don't they compensate for the delay induced by the convolution?
Why do they let it have half the rectangle's width in delay when they could just compensate it so that the curve wouldn't look offset compared to the original data. And most mind-blogglingly, why on Earth do the same sort of people add another curve that is the difference between the original data and the delayed moving average?? Why oh why? It's senseless, as if the moving average was compensated then you could call it a high-pass filter and directly look at the high frequency components of the original data without adding any parasite low frequency component which doesn't match to anything desirable.
Someone enlighten me please.
You just got troll'd!
How can this be Redundant? It's the first post and it damn well makes some sense. My vision is fine and the red/green color scheme burns my retinas.
But I'm sure some fanboys of this project/Google will mod me down because the project is ignoring little things like aesthetics or making the data viewable to someone without sunglasses.
Random Thoughts From A Diseased Mind (Not For Dummies)
I have updated my original post to address some of the comments made here on Slashdot. Peer review is always good, and thank you all for the insights.