Slashdot Mirror


A Statistical Review of 1 Billion Web Pages

chrisd writes "As part of a recent examination of the most popular html authoring techniques, my colleague Ian Hickson parsed through a billion web pages from the Google repository to find out what are the most popular class names, elements, attributes, and related metadata. We decided that to publish this would be of significant utility to developers. It's also a fascinating look into how people create web pages. For instance one thing that surprised me was that the <title> is more popular than <br>. The graphs in the report require a browser with SVG and CSS support (like Firefox 1.5!). Enjoy!"

4 of 294 comments (clear)

  1. what's the point of a 1 billion page sample? by ecklesweb · · Score: 3, Interesting

    I have to ask, what's the purpose of a 1-BILLION page sample? That's the beautiful thing about statistics. If you can say something about the distribution of characteristics within a population, you don't have to survey the entire population to get meaningful results. Are the study authors proposing that no standard distribution can be applied to the entire universe of web pages? If that's the case, then do the statistics they apply to their sample of one billion really say anything predictive about the entire population?

    Aside from the cool factor of saying they sampled a billion pages, I don't see what extra benefits are gained from that extra effort.

  2. Re:BR tag? by masklinn · · Score: 3, Interesting

    Small stat? are you joking?

    This is about the number of sites that use the tag, not the number of tags out in the wild, and <br> is used on more pages than <table>, there are as many pages with at least one <br> than pages with at least an <img> tag

    That's freaking huge, for a tag that should almost never be used.

    --
    "The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
  3. Heh by Z0mb1eman · · Score: 3, Interesting
    This reminds me of the old joke that there only ever was one 'make' script, and everyone else modified it.

    I wonder how much of what they found is influenced by how people learned to write HTML - which in all likelihood was to copy code from existing pages... might explain parts of what they found, such as:

    Most people (roughly 98%) include head, html, title and body elements. This is somewhat ironic, since three of those four elements are optional in HTML
    --
    ClutterMe.com - easiest site creation on the Net. Just click and type.
  4. Re:Beford's Law by EvanED · · Score: 3, Interesting

    I had an interesting run-in with Benford's law a bit ago. I had this typed up already, so here goes (description of the law omitted; read the Wikipedia link in the parent -- it's really cool):

    You see, my hard drive crashed about two weeks ago. It had three partitions on it, and two of them are still perfectly readable. The third is pretty well shot. (Fortunately, it was the most useless partition; it's main contents was Windows itself. This does mean ANOTHER Windows installation -- after having to do one a few weeks before -- but really that's no biggie compared with my actual data. And while I'm on that subject, I had two hard drives; when I got the newer one, I put all my work stuff on it as well as a new Linux installation specifically because it was less likely to fail, and I look back at that decision now with great happiness, because it is that foresight that has made this no big deal at all.)

    I've been trying to recover data off of the third partition, and it seems that if you do a full scan of the partition it appears as if the data was just deleted. Most of the time it's able to recover information, but not always: folder names are often lost. They show up in the recovery programs I tried as just Folder2393 for example. (Numbers ranged from 2 to 5 digits.)

    The folder numbers approximately follow Benford's law.

    Here is the approximate distribution:
    (M. S. Digit) (% of folders) (Ideal Benford %)
    1 32 30.1
    2 15 17.6
    3 12 12.5
    4 12 9.7
    5 19 7.9
    6 03 6.7
    7 03 5.8
    8 02 5.1
    9 02 4.6