Why Good Data Can Be Hard to Find Online

← Back to Stories (view on slashdot.org)

Why Good Data Can Be Hard to Find Online

Posted by ScuttleMonkey on Friday April 18, 2008 @01:14PM from the still-don't-trust-alexa dept.

WSJdpatton writes to mention that Carl Bialik has an interesting look at why good data can be hard to find, much less understand, online. He cites a couple of examples, both Google's first-quarter performance numbers and Alexa's revamp of their number-tracking process. "Now Alexa is incorporating other sources of data -- though it says the prior ranking 'wasn't wrong before, but it was different.' Some sites saw big changes in their rankings following Alexa's move: The tech blog TechCrunch said it fell far from its prior position in Drudge Report territory (rarefied air in Web-traffic terms). On Friday afternoon, Drudge Report ranked 545th, compared with TechCrunch's ranking of 1,784th, according to Alexa's new math."

39 comments

Min score:

Reason:

Sort:

Alexa? No. by Slashdot+Suxxors · 2008-04-18 13:17 · Score: 4, Informative

This isn't exactly on topic, but I think you should give it a read before you make a final opinion on what the article is trying to stay.
1. Re:Alexa? No. by jd · 2008-04-18 13:24 · Score: 5, Insightful
  
  The article and the slashdot story seem to say the same thing - the numbers produced are just numbers out of a hat. They don't represent anything meaningful and indeed can't because the participants are self-selecting and therefore not a random sample of the population. This is obvious and always has been. The popularity of a site (or a TV show or anything else) cannot be measured by any simple means, if it can be measured at all.
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
2. Re:Alexa? No. by TubeSteak · 2008-04-18 13:33 · Score: 4, Interesting
  
  The article and the slashdot story seem to say the same thing - the numbers produced are just numbers out of a hat. They don't represent anything meaningful and indeed can't because the participants are self-selecting and therefore not a random sample of the population. Even with a random, statistically relevant sample size... the saying "lies, damn lies, and statistics" still applies.
  The popularity of a site (or a TV show or anything else) cannot be measured by any simple means, if it can be measured at all. Tivo & other DVRs would suggest otherwise.
  
  --
  [Fuck Beta]
  o0t!
3. Re:Alexa? No. by Firehed · 2008-04-18 14:33 · Score: 4, Informative
  
  Maybe relative tracking can't be done by simple means since it requires participation on everyone's part, but absolute local tracking is trivially easy on any server that supports server-side scripting and has some sort of database access. A couple lines of code at the bottom of your page to insert a new row on a page load and you've got nearly perfect visitor logs that can easily go beyond your standard server logs.
  
  Again, useless for relative popularity unless you have everyone's data. But it still tells you how popular your site is which is great for ego boosting and advertiser stats if nothing else.
  
  (I'd suggest that Google Analytics is going to be a lot more useful in the long run and at least has the potential to provide relative data in addition to the absolute, but anything that relies on client-side scripting is going to give less accurate numbers since clients can disable or screw around with scripting)
  
  --
  How are sites slashdotted when nobody reads TFAs?
4. Re:Alexa? No. by Anonymous Coward · 2008-04-18 18:24 · Score: 0
  
  Tivos and DVRs are still comparatively rare. The stats garnered from such sources would be highly influenced by the demographic with access to, and the smarts to use, high-end electronics.
  
  You'd get no less squewed results than you would through Alexa's, who generally get their results from the technically unsavvy.
70% of good data...? by JadeAuto · 2008-04-18 13:43 · Score: 5, Funny

I read online somewhere that 70% of statistics online are made up. This article seems to prove the point. 4 out of 5 slashdotters agree! ;)
1. Re:70% of good data...? by evanbd · 2008-04-18 14:45 · Score: 4, Funny
  
  Four out of Five slashdotters? You must be mad. The only things slashdotters can agree on are that they want to marry CmdrTaco and what concert to see on the honeymoon.
2. Re:70% of good data...? by zotz · 2008-04-19 00:34 · Score: 1
  
  1. Register domain slashdotagrees.com
  2. Get slashdotters to agree.
  3. ???
  4. Profit!
  
  all the best,
  
  drew
  
  --
  FreeMusicPush If you want to see more Free Music made, listen to Free
Losing 1239 positions? That's nothing! by basscomm · 2008-04-18 13:48 · Score: 1

On Friday afternoon, Drudge Report ranked 545th, compared with TechCrunch's ranking of 1,784th, according to Alexa's new math.

I don't know what they're complaining about, mine went down over a million positions!

--
http://crummysocks.com
Another Example: Hitslink by Anonymous Coward · 2008-04-18 14:01 · Score: 1, Informative

Another example besides Alexa of "readjustment" is Hitslink. Last November, they revised their figures for OS share for March through October 2007. Linux went from a reported .81% share in October, to .50%. They made only a brief allusion on their site to filtering out "unrepresentative" hits from their data. Recently, they again revised their Linux share for January 2008, from the original .67% to .64%. Even though Hitslink seems to have trouble deciding how many Linux users there are, that doesn't keep people (like Westlake, who keeps posting Hitslink numbers on Slashdot) from citing them.
It's okay, I found it... by hendridm · 2008-04-18 14:26 · Score: 0, Troll

http://www.gooddata.com/
Data wants to be Free! by Anonymous Coward · 2008-04-18 14:35 · Score: 0

Well, maybe not good data.
Wow, astoundingly obtuse by zappepcs · 2008-04-18 14:45 · Score: 5, Insightful

Just observing the Internet and then reading this ... just wow.

Good data is HARD to find ANY FUCKING WHERE, never mind limiting your search to just online. Seriously!

News online? read the same story from 8 sources, form your own opinion. MSM sucks worse.

Scientific data? Well, unless it's peer reviewed, you know it's probably suspect and need to verify it with other data. Damn, even peer reviewed scientific data should be compared to other data these days.

How about Encyclopedic data.. There is wikipedia, but make sure to corroborate the data, right?

Read it in a blog? Check the data before you make up your mind.

Hmmmm this sounds a lot like trying to find good data before the Internets were active. Damn, all that data is proffered up by humans... Humans are not infallible so I'm guessing that data provided by humans is going to be a bit 'not infallible' also.

Where does the assumption that data online should be good data come from? wtf?

--
Support NYCountryLawyer RIAA vs People
1. Re:Wow, astoundingly obtuse by v(*_*)vvvv · 2008-04-18 15:26 · Score: 1
  
  Absolutely agree. In fact, data is *usually* bad, regardless of medium. Even parents give their children bad advice at times.
  
  On second thought, I suspect they had to do something to make this a story, because no one cares about Alexa really, and this wouldn't have gotten published by the WSJ of all people if it had an honest headline.
No, really. by v(*_*)vvvv · 2008-04-18 15:21 · Score: 4, Funny

The company tracks the Internet habits of users of its browser toolbar ... These rankings have long been criticized ... because Alexa users may not behave like the Internet as a whole. Ya, who in the world uses the Alexa toolbar!?
1. Re:No, really. by choongiri · 2008-04-18 15:26 · Score: 4, Insightful
  
  Nobody here. That's the point.
2. Re:No, really. by Sique · 2008-04-18 23:35 · Score: 1
  
  At least nobody visiting my homepage uses Alexa either. "No data". I guess I am somewhat unpopular.
  
  --
  .sig: Sique *sigh*
Good Data can't be found online by (TK2)Dessimat0r · 2008-04-18 15:36 · Score: 2, Funny

Only Bad Lore can.
Drudge by Anonymous Coward · 2008-04-18 15:46 · Score: 0

"good data can be hard to find online..."

Yes, especially if you're reading Drudge.
Good Data Is Not Hard To Find by netdur · 2008-04-18 16:07 · Score: 1

it is just your google skills that's sucks... that's what my boss keeps telling me!

--
"Steve Jobs invented the world" -- Bill W. GATES
1. Re:Good Data Is Not Hard To Find by neonmonk · 2008-04-18 22:59 · Score: 1
  
  Jesus H Christ why so you people insist on changing your font. Especially to a font like that! Don't you realize how fucking annoying t
  it is?
2. Re:Good Data Is Not Hard To Find by netdur · 2008-04-19 10:24 · Score: 1
  
  I did not change any font, this reply doesn't let you change font at first place, that font is there by default, ask /. not me... click "reply" and see it yourself btw! what's H? jesus mohammed christ?
  
  --
  "Steve Jobs invented the world" -- Bill W. GATES
Slashdot is dying. by e9th · 2008-04-18 16:28 · Score: 2, Funny

Alexa confirms it.
Maybe Comcast could provide stats by Animats · 2008-04-18 17:10 · Score: 1

With Comcast's monitoring of user traffic, they could provide reliable stats for their customer base. We ought to get something back from all this Big Brother stuff.
A Good Date by GalacticLordXenu · 2008-04-18 18:31 · Score: 3, Funny

I initially read this as being, "Why a Good Date Can be Hard to Find Online". Hell, I could have told you that! But alas...
And Hard Data... by Anonymous Coward · 2008-04-18 20:04 · Score: 0

... is good to find.
Counterexample via TED.com speaker... by ivi · 2008-04-18 22:35 · Score: 1

A public health expert from Sweden - Hans Rosling, who teaches at Karolinska Institutet - has (some time ago, already) announced that he was able to persuade holders of UN-collected population data to publish their data on-line for anyone wanting to analyze it (eg, using his innovative tools for displaying it: GapMinder).

I would say that the data which he managed to get put on-line for anyone's use might be a counterexample to the poster's claim.

Of course, you can decide for yourself... ;-)

See his 2nd talk at TED.com for URL's and other details regarding access.
Re:Made up statistics by jbengt · 2008-04-19 00:37 · Score: 1

From TFA:
Niall O'Driscoll, vice president of Alexa, declined to tell me the new data sources and formula, calling that "proprietary information."
It's the sooper seekrit way that they make up the statistics that make it credible.
Numbers Games Are A Loss by Jekler · 2008-04-19 02:21 · Score: 2, Insightful

The reason good data is hard to find online is chiefly a problem with perspective and the models we are using to differentiate good data from bad data. That model primarily relies on the idea that it's all about numbers, or simply that more data is the same as better data. Whenever we come up with bad data, the "quantity model" dictates we just need a larger sample.

This model is directly related to how companies measure TV show quality. The theory is, the more people who watch a show, the better that show must be. This model is so obviously faulty; almost everyone can agree that American Idol isn't even in the same qualitative ballpark as The X-Files, Arrested Development, or Star Trek. The reason the model is faulty is because of the hugely limited scope of the examination. There are a number of variable factors that aren't being considered, such as people own more TVs than when Star Trek was on, and they're mistaking curios interest with enjoyment. Average person will stop and watch a car wreck for roughly the same amount of time they'll play with a yo-yo, that doesn't mean the entertainment value of each is directly comparable, there's a whole different brain process going on in the observers of each, but the model of measuring quantities assumes that two activities which consume the same amount of time are equivalent in all ways.

Back to internet statistics. All this data mining and gathering is designed to ignore the differences in activities, it's only cataloging information for the purpose of what's the same. As the article states, Alexa is always checking for biases. Well the biggest bias in this model is the assumption "in sufficient quantity, all things are interchangeable." It's the assumption that telemarketers and scammers work on, which is why so many people go broke buying into those schemes, because they buy into an assumption which is absolutely wrong.

Many internet business models, specifically data miners, are designed on, assume that 1 million hits is the same regardless of where it comes from. When you consider real factors, having 1 million people see your hand-made chain pouches at a shopping mall is not going to generate the same level of interest as having 1 million people see them at a renaissance fair.

Of course that introduces a whole different problem with assumptions about targeting (I'm not going to get into that, only state that targeted marketing makes the assumption that timing doesn't matter).

In conclusion, you can't play people as a numbers games, People's behaviors (including their online behaviors) are complex and any model which treats people's differences as a child might divide up a bag of skittles by color is going to have a very high error rate.
1. Re:Numbers Games Are A Loss by frsmith · 2008-04-19 23:40 · Score: 1
  
  The problem here is that enough people believe in the 'figures' enough to make a whole industry out of.
  
  Marketing seems to work if you have a big enough sample!
  It's like soaps on TV, they churn over the same X million viewers so are considered a success
  The other 48 million (uk) don't watch them and I would regard that as a failure.
  Word of mouth is now becoming more powerfull as we have the web and can spread the word so much faster.
  This is what film makers are finding out, that crap film would have done the rounds in the USA/UK and still got an audience, but now the mobile phone/forums are killing them at the start.
  
  It's a big time of change for the marketing boys and girls
  Bob
  
  --
  It Seems I've developed an aversion to proprietary software
Rank Page Ranking by nightcats · 2008-04-19 17:31 · Score: 1

This is a problem I've noted before (for example, here). I have an equivalent Google page rank with sites with hundreds of times more traffic. In short, I've yet to see the metrics or analytics tool that is truly reliable.

--
Development is programmable; Discovery is not programmable. (Fuller)