Trending Low-Volume Google Searches with Gootrude

← Back to Stories (view on slashdot.org)

Trending Low-Volume Google Searches with Gootrude

Posted by ryuzaki0 on Monday June 16, 2008 @02:01AM from the stuff-to-play-with dept.

michaelrash writes "The Google Trends project provides some visibility into how popular search terms like 'Myspace' or '2008 Election' change over time and points out relevant news articles that create jumps in search volume. This is a handy tool, but there are many search terms that Google Trends does not display any results for. Such terms (such as 'Linux Firewalls' — with the quotes) have insufficient search volumes to display graphs according to the error message that Google Trends generates. Fair enough. Google sets an internal threshold on search volume, and this threshold could be set for reasons that range anywhere from Google Trends is still experimental to Google not wanting to provide data on how it builds its massive search index for emerging search terms. Either way, I would like a way to see search term trends that Google doesn't currently make available to me. So, I've released an open source project called 'Gootrude' to do just this. For the past year Gootrude has collected a set of low-volume search terms and interfaced with Gnuplot to visualize them."

37 comments

Min score:

Reason:

Sort:

wow by Gewalt · 2008-06-16 02:09 · Score: 2, Insightful

wow, um...congrats I think? I mean, after you get over your pat on the back, can anyone explain why this matters?

--
Modding Trolls +1 inciteful since 1999
1. Re:wow by Gewalt · 2008-06-16 02:21 · Score: 0, Redundant
  
  It's not a troll. His data is not what google trends reports, and isnt even remotely comparable to what google trend reports. In short, his results do not have any use at all. So really, can anyone explain why this matters?
  
  --
  Modding Trolls +1 inciteful since 1999
2. Re:wow by Anonymous Coward · 2008-06-16 06:59 · Score: 0
  
  I guess the work is informative in that it shows that the number of results reported by Google is rather inaccurate. My guess is that there is a deliberate part of the number coming from /dev/random as well.
  
  Why the author of TFA fails to reach this conclusion is strange. Maybe somebody should tell him.
  
  HB.
Impressive by SplatMan_DK · 2008-06-16 02:11 · Score: 0

I took the time to look through the work - looks impressive for a "hobby project".

The only thing I feel is missing is more options to narrow the searches and statistics on geographical information.

Does anybody have some thoughts on how reliable this tool is? And what the terms for using (read: distributing the data/results) the data is?

- Jesper

--
My security clearance is so high I have to kill myself if I remember I have it...
It it only me.... by vidarh · 2008-06-16 02:11 · Score: 4, Insightful

... or does the author of this tool seemingly not realize that Google Trends reports volume of searches, while what he's tracking is amount of documents indexed for a search term, and that there's no basis for assuming the two are correlated in a meaningful way?
1. Re:It it only me.... by Gewalt · 2008-06-16 02:15 · Score: 5, Interesting
  
  I find it highly unlikely that someone who can make the page in question would not be smart enough to also understand what it is that google/trend is really doing, and as such, I choose to believe instead that the author is being intentionally deceptive.
  
  --
  Modding Trolls +1 inciteful since 1999
2. Re:It it only me.... by aleph42 · 2008-06-16 02:27 · Score: 3, Insightful
  
  Agreed, the summary is misleading, as is the comparaison (from TFA) to googletrends.
  
  This aside, the interest of "gootrude" is that it's not porvided by google, and so it's part of the many efforts to reverse engineer how goole comes up with his numbers.
  
  Specificaly, it appears from TFA that the "number of results" stated by google is a wild guess for low numbers (1,000-10,000), with very sharp variations which hint at an iterative process.
  
  So as I get it, it's not a tool for you and me, rather for google specialists.
  
  --
  Don't take my posts literally; it's just code to control my botnet.
3. Re:It it only me.... by Idimmu+Xul · 2008-06-16 02:42 · Score: 1
  
  The perspective he seems to be taking is not so much 'what users search for' but more 'what users post about or publish' with a view to studying the correlation of a large site publishing something and then the number of other websites or pages picking it up and running with it.
  
  I'm pretty sure he understands what he's doing, the article summary is just a bit twisted.
  
  --
  Free Playstation 3, XBox 360 and Nintendo Wii
  
  --
  The problem with slashdot is that most of its users were bullied and stuffed into lockers as kids!
4. Re:It it only me.... by kestasjk · 2008-06-16 04:12 · Score: 1
  
  I find it highly unlikely that someone who can make the page in question would not be smart enough to also understand what it is that google/trend is really doing, and as such, I choose to believe instead that the author is being intentionally deceptive. It's a trap!
  
  --
  // MD_Update(&m,buf,j);
But wait... thats not it at all by Gewalt · 2008-06-16 02:13 · Score: 0, Redundant

Google trends plots how popular a search phrase is. This mashup of google results is not that at all. it is nothing more than a mashup of the count of pages in google's database. it has nothing to do with how often a phrase is searched for.

--
Modding Trolls +1 inciteful since 1999
Different data by UnHolier+than+ever · 2008-06-16 02:17 · Score: 2, Informative

Google Trends plots the frequency of queries, i.e. the number of times information is asked about a subject. Gootrude plots the number of pages found, or the quantity of information google can retrieve on this subject. These are completely different.
1. Re:Different data by alnicodon · 2008-06-16 07:03 · Score: 1
  
  Many thanks for making this clear : this is also what I had fathomed from the very clear summary, but wasn't too sure.
  Well.. we might actually be the two wrong ones :)
  Al.
Singular works okay. by palegray.net · 2008-06-16 02:17 · Score: 1, Informative

Such terms (such as "Linux Firewalls" â" with the quotes) have insufficient search volumes to display graphs according to the error message that Google Trends generates. Try Linux Firewall in quotes as the search term for some results.

--
512 MB RAM, 20 GB disk, 200 GB transfer, five datacenters. $19.95/month.
Spore by Chemisor · 2008-06-16 02:18 · Score: 0, Offtopic

Have you noticed how "spore demo" is the 77th top search? On the WHOLE INTERNET! :)
1. Re:Spore by Anonymous Coward · 2008-06-16 02:59 · Score: 0
  
  have you noticed that "hot asian luv" is 99th? with 99% of the searches coming out of Mcallen, TX?
Not at all the same! by molo · 2008-06-16 02:19 · Score: 0, Redundant

Google trends measures what people are seaching for, while Gootrude measures how many results are in the google database for a given term. These are not even remotely the same thing.

-molo

--
Using your sig line to advertise for friends is lame.
1. Re:Not at all the same! by Anonymous Coward · 2008-06-16 03:23 · Score: 0
  
  Right. Not only is it not the same, the data is clearly teaching us very little except the quirks of Google. For example, he shows us a graph of number of search results for a particular query, going up and down in a periodic fashion. This makes absolutely no sense as real data, because on today's Internet, pages mostly get added and very few are deleted. The one example with the huge spike makes even less sense (and he admits to it) - what, 50,000 pages got created on some topic and then got deleted? No - this just probably shows us Google quirks: the already famous "Google Dance" (Google switching between indices), crawler bugs, and so on. Maybe it's an interesting topic to discuss, but it has nothing to do with Google Trends, and doesn't tell me much about the trend of a given word (well, except being able to tell us when a hype on a given word *started*).
Not allowed by google by swarsron · 2008-06-16 02:39 · Score: 3, Informative

Besides not being the same as google trends, this tool is not allowed by the TOS of google. Automatic querying of their services without prior permission is forbidden by google. But since it probably won't put any noticeable load on their network they most likely won't care
1. Re:Not allowed by google by Vectronic · 2008-06-16 03:20 · Score: 4, Insightful
  
  Until there was an article posted on Slashdot that is.
2. Re:Not allowed by google by vrmlguy · 2008-06-16 03:56 · Score: 0, Offtopic
  
  Since I'm always forgetting to log my business driving, I've got a program that uses Google maps to figure out the driving distance between various pairs of points. It uses two files, one consisting of about 250 lines like this: home, office, client-a, restaurant-x, client-b, home home, client-b, restaurant-y, client-b, homeand the other listing street addresses for everyone. I'm sure it's a big violation of Google's ToS, but it tries to play fair: it caches the distances that it discovers (e.g. so that the distance from client-b to home is only requested once), it waits one-to-two minutes between queries, and I only use it once a year at tax time when I'm calculating my business expenses.
  
  --
  Nothing for 6-digit uids?
3. Re:Not allowed by google by icyslush · 2008-06-16 05:26 · Score: 1
  
  Google has a relatively simple API you can apply for to allow for a fixed number of automated queries of their system. It doesn't actually give you new functionality but does make automated queries of their databases "authorized". Without the API license key, you run the risk of getting noticed by them and ban-hammered if they think your just a bot scraping their data, something they do NOT like. I think this article just got in because it had both Google and Open Source as subjects. If they have figured a clean way to find SEARCH volume (which is hard) as opposed to RESULTS volume (which is stupidly easy), get back to me. :)
4. Re:Not allowed by google by swarsron · 2008-06-16 06:32 · Score: 2, Informative
  
  Google doesn't give out any more keys for this api, only old keys continue to work. So if you don't already have a key you're out of luck
5. Re:Not allowed by google by icyslush · 2008-06-16 08:00 · Score: 1
  
  Really? Whoops! [hides google key in lead lined safe]
6. Re:Not allowed by google by bobbozzo · 2008-06-17 14:31 · Score: 1
  
  That sounds really useful; got the code posted anywhere?
  
  thanks
  
  --
  Nothing to see here; Move along.
7. Re:Not allowed by google by vrmlguy · 2008-06-18 02:27 · Score: 1
  
  I'll try to post it when I get home tonight. Ironically, I'll probably post it on my Googlepage.
  
  --
  Nothing for 6-digit uids?
Time for me... by jalet · 2008-06-16 03:19 · Score: 1

to do something similar with my parody of google where search terms can be looked at in real time (empty or spammy search terms are replaced with fake words on display, but not in the history).

--
Votez ecolo : Chiez dans l'urne !
Privacy anyone? by Anonymous Coward · 2008-06-16 03:30 · Score: 0

So, nobody really likes the amount of data that Google collects on everybody, and there's a constant trickle of scandal about "anonymized" search results not being anonymous enough. I myself have stopped using Google as much as possible due to these shenanigans....

But then stuff like this gets written AND slashdotted. What's the deal? I'd much rather know NOTHING about Google's web search trends than inch even one micrometer closer to living in a panopticon.

In a funny coincidence, my CAPTCHA was "lynched", so, flame on!
Suggestions for improvments by vrmlguy · 2008-06-16 03:32 · Score: 0

Everyone has already noted that this only tracks hits, not searches. I'd like to suggest a few code improvements.

At a high level, use RRD (http://search.cpan.org/~nicolaw/RRD-Simple-1.43/lib/RRD/Simple.pm) for the underlying database. RRD is used by MRTG to track time-varying data over multiple time scales, keeping details for recent data and summaries for historical data. RRD also comes with its own plotting module, although you could keep using Gnuplot if you wish.

In the code itself, there are places where there are "elsif" clauses without an "else" clause. One seems alright, but should have a null "else" so document that fact. The other, however, is testing keywords from the config file, and should flag any that are unrecognized.

Finally (and this is probably nit-picking), instead of this: return unless EXPR; do something; return;I'd use this: if EXPR { do something; } return;

--
Nothing for 6-digit uids?
Over 2 hours by TheCycoONE · 2008-06-16 04:41 · Score: 1

This article has been on /. for almost 3 hours and "Linux Firewalls" still isn't a significant enough search query for Google Trends? Well THAT is surprising.
1. Re:Over 2 hours by El_Oscuro · 2008-06-16 12:41 · Score: 1
  
  Just did it a few times. If everyone on /. does it, maybe we can hose their statistics...
  
  --
  "Be grateful for what you have. You may never know when you may lose it."
Re:a few different results... by lpq · 2008-06-16 06:37 · Score: 2, Informative

Just did searches on all of the terms the author mentions and got a few different numbers:

1. "iptables attack visualization" -- 19 results (~35) (close)
2. "single packet authentication" -- 93 (1,300) -- off by more than 1 magnitude
3. "linux firewalls attack detection" - 9290
3a. "Linux Firewalls Attack Detection" - 9240 (~9000) (close)
4. cipherdyne -- 85,200 (~70,000) ~off a bit
4a.Cipherdyne -- 84,500 (~70,000)
5. gpgdir (same)
6. fwsnort (same)
-------
Note...caps vs. no caps made no difference on 1, 2 and 5. But for terms 3 & 4, caps made a slight difference ... anyone know why? I thought caps were supposed to be ignored?

Most were close, but cipherdyne had about a 15% difference, but the worst was "single packet authentication" -- That one was off by more than 10x! Wonder what's up with that.

Interesting curiosities...
OT : Moving average and graphs by 4D6963 · 2008-06-16 06:38 · Score: 1

Everytime I see graphs with a moving average, be it in TFA or some stock market graph it makes me cringe. OK, the moving average isn't the best filtering out there, there's a whole range of finite impulse response filters that have a more desirable frequency response than a moving average (which is convolution a rectangle, which means its frequency response is essentially a sinc function, which means a shitload of ripples), but why on Earth don't they compensate for the delay induced by the convolution?
Why do they let it have half the rectangle's width in delay when they could just compensate it so that the curve wouldn't look offset compared to the original data. And most mind-blogglingly, why on Earth do the same sort of people add another curve that is the difference between the original data and the delayed moving average?? Why oh why? It's senseless, as if the moving average was compensated then you could call it a high-pass filter and directly look at the high frequency components of the original data without adding any parasite low frequency component which doesn't match to anything desirable.
Someone enlighten me please.

--
You just got troll'd!
1. Re:OT : Moving average and graphs by Anonymous Coward · 2008-06-16 08:19 · Score: 0
  
  I was told there would be no math.
Re:Graph colors by Ihmhi · 2008-06-16 09:20 · Score: 0, Flamebait

How can this be Redundant? It's the first post and it damn well makes some sense. My vision is fine and the red/green color scheme burns my retinas.

But I'm sure some fanboys of this project/Google will mod me down because the project is ignoring little things like aesthetics or making the data viewable to someone without sunglasses.

--
Random Thoughts From A Diseased Mind (Not For Dummies)
Privacy? by Temporal · 2008-06-16 11:11 · Score: 2, Insightful

Google sets an internal threshold on search volume, and this threshold could be set for reasons that range anywhere from Google Trends is still experimental to Google not wanting to provide data on how it builds its massive search index for emerging search terms.
Or maybe for privacy reasons? Some search queries implicitly reveal the identity of the person making them. Such queries are naturally low-volume, so refusing to show low-volume queries is an effective way to protect the privacy of the searchers.
michaelrash by michaelrash · 2008-06-16 16:24 · Score: 1

I have updated my original post to address some of the comments made here on Slashdot. Peer review is always good, and thank you all for the insights.