Slashdot Mirror


A Statistical Review of 1 Billion Web Pages

chrisd writes "As part of a recent examination of the most popular html authoring techniques, my colleague Ian Hickson parsed through a billion web pages from the Google repository to find out what are the most popular class names, elements, attributes, and related metadata. We decided that to publish this would be of significant utility to developers. It's also a fascinating look into how people create web pages. For instance one thing that surprised me was that the <title> is more popular than <br>. The graphs in the report require a browser with SVG and CSS support (like Firefox 1.5!). Enjoy!"

53 of 294 comments (clear)

  1. We've come a long way by suso · · Score: 3, Funny

    if the tag isn't on the top elements list.

  2. Blink by suso · · Score: 4, Funny

    the tag.

    1. Re:Blink by mysqlrocks · · Score: 3, Funny

      the <blink> tag.

      I must have blinked, I didn't see it the first time.

    2. Re:Blink by ReverendLoki · · Score: 4, Funny
      Still, the only good use I ever saw for that tag was the line:

      Schrodinger's cat is <blink>not</blink> dead.

      Every other usage just caused me to browse elsewhere.

      --
      09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
    3. Re:Blink by Repton · · Score: 2, Funny

      All you need to do is blink at the right frequency and you'll never see it at all!

      --
      Repton.
      They say that only an experienced wizard can do the tengu shuffle.
  3. is more popular than by InsideTheAsylum · · Score: 5, Funny

    well when people talk like this and dont bother using punctuation spacekeys or any of the skills that they have been taught in school its no wonder why webpages turn out like this not to mention those long runon sentences and also all that broken code that are the fist attempt at a webpage by a twelve year old kid who tried to steal someone elses layout and replaced the word with his own then you start to look at all of those dynamically generated webpages and the layouts and the style sheets and its no wonder why the good old br tag never get a work out.

    1. Re: is more popular than by aussie_a · · Score: 5, Funny

      Never been scared your girlfriend was pregnant? Oh wait, this is slashdot. Nevermind.

    2. Re: is more popular than by Anonymous Coward · · Score: 5, Funny

      Women and Compilers... miss a period and they go wild.

  4. Finally... by RandoX · · Score: 5, Funny

    An un-slashdottable server.

  5. BR tag? by p0 · · Score: 5, Insightful

    With css power you really do not need to use br, maybe that is the reason for the small stats for the tag's use?

    --
    This is my sig. There are thousands more, but this one is mine.
    1. Re:BR tag? by masklinn · · Score: 3, Interesting

      Small stat? are you joking?

      This is about the number of sites that use the tag, not the number of tags out in the wild, and <br> is used on more pages than <table>, there are as many pages with at least one <br> than pages with at least an <img> tag

      That's freaking huge, for a tag that should almost never be used.

      --
      "The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
    2. Re:BR tag? by crumley · · Score: 2, Funny
      But don't we all use br's when we quote other people on slashdot?
      No.
      --
      Preventive War is like committing suicide for fear of death. - Otto Von Bismarck
    3. Re:BR tag? by Bogtha · · Score: 2, Insightful

      The <br> element type is kept around for a few minority uses. Things like poetry, code listings, etc, where dividing something up into lines is necessary. These things are rare, which is why masklinn said "should almost never be used" and not "should never be used".

      What SHOULD never happen, I think, is for BR to be treated as a substitute for proper block-level delineation.

      Yes, and if you take into account the idea that most pages that use the <br> element type do so in precisely this way, you'll end up agreeing with masklinn and myself.

      --
      Bogtha Bogtha Bogtha
    4. Re:BR tag? by Metasquares · · Score: 2, Insightful

      Because I don't know if the user wants to enter a paragraph. What the user entered is a line break (that's what hitting return does), thus br is the tag to use. If the user wants to enter in a paragraph, he can enter his own p tag or skip a line (which is the default p tag behavior anyway) and the p tag will be used.

      My site is XHTML, so the closing tag is required (not that that's stopping me).

    5. Re:BR tag? by Domo-Sun · · Score: 2, Interesting

      How do I accomplish that without using a br tag after each letter?

      I'd guess surrounding the name in a div and then with CSS making that div width equal to 1em and positioning it on the left of your avatar picture.


      Oh, and good luck getting it to work in all browsers. Gee whiz! What is the logic behind this? You have to wrap everything in DIVs and spans, then write a bunch of ridiculous code, for what reason, so we can hold up to an irrationally strict, un intuitive, standard.

      This is the opposite of accessibility. It's simply a waste of time for the author...

      Though, now that I think of it, this is not the best example of BR use, since screen readers would spell everyones name out.. Oh, so I'd have to say go with the 1em CSS box, maybe try a monospace font.

      And please, please try to use an existing box and try to avoid using DIV and SPAN, if you can.

      Oh, wouldn't the text just flow out, or under the box? I think you can't do this?

      I simply think some of the extreme concepts about how we should deprecate everything are failing to view the logic behind future potential uses, and forget how long it takes to actually get any tags to work universally, so let's just keep the tags we have, thank you. The same thing happened with italic, they said it should be deprecated and never used, then they came up with a few examples where it's needed.

      If you people want to crusade against something, maybe go after the people who use DIV class=heading. That annoys me when I try to make my user stylesheets. Oh, and since we're on the topic, slashdot is using .block for the slashboxes, and it's on that list of most used classes, so pleas put a unique ID in the body element, like #slashdot-front #slashdot-games. Something that I could put it into context: #slashdot .block{display:none;}. If I block .block, then my userCSS messes up the rest of the web. I worked around this by blocking each box individually.

  6. Not complete by Anonymous Coward · · Score: 5, Funny

    It didn't have everything of course. Some elements were censored on behalf of the Chinese government.

  7. what's the point of a 1 billion page sample? by ecklesweb · · Score: 3, Interesting

    I have to ask, what's the purpose of a 1-BILLION page sample? That's the beautiful thing about statistics. If you can say something about the distribution of characteristics within a population, you don't have to survey the entire population to get meaningful results. Are the study authors proposing that no standard distribution can be applied to the entire universe of web pages? If that's the case, then do the statistics they apply to their sample of one billion really say anything predictive about the entire population?

    Aside from the cool factor of saying they sampled a billion pages, I don't see what extra benefits are gained from that extra effort.

    1. Re:what's the point of a 1 billion page sample? by Anonymous Coward · · Score: 5, Informative

      You get a decrease of the variance of the mean.

    2. Re:what's the point of a 1 billion page sample? by Durinthal · · Score: 5, Insightful

      If you can have a larger sample, why not use it? It's more accurate that way.

    3. Re:what's the point of a 1 billion page sample? by shoolz · · Score: 2, Informative

      Because with statistics, increasing the sample size does not result in a uniform increase in accuracy.

      If you start with a sample size of 1000 and add an additional 10000, the accuracy will increase dramatically. But if you start with 1,000,000,000, and increase it by another 1,000,000,000, the accuracy won't go up even by as much as 0.0001%

      Yes, I'm pulling the numbers out of the air, but the point is that there exists a sweet spot where the additional effort does not pay off.

    4. Re:what's the point of a 1 billion page sample? by finelinebob · · Score: 2, Interesting

      A couple of people have pointed out that the larger the sample size, the less chance there is to attribute a meaningful difference to a situation that is actually a random fluctuation. That may be true, but I believe the point the parent is trying to make is that one of the key advantages of statistical modeling is that you can accurately model very large groups by studying very small samples of that group. If there was actually a need for this large a sample, then fine. Otherwise, the sample size is more sensational than informational.

      For example, many medical studies rely on samples of a couple thousand people. If that number is supposed to represent US citizens, then that sample size is roughly 0.001% of the population.

      To answer whether 1 billion cases is overkill or not, it would be helpful to know the size of their entire database -- how many individual web pages have they catalogued? How big was the sample size relative to the population? Another issue that might have influenced choosing such a large sample is the number of pages generated dynamically, using standardized templates. If Google has catalogued a corporate website that has several thousand pages all following the same template, do those pages act as unique, individual entries that should be given the same weight page-by-page as a site that has only 10 pages? How might the entire depository of, say, eBay.com or even Slashdot.org skew results? The large sample size may have been required to render such "cell" sizes irrelevant.

      Of course, seeing some numbers from their study would have been nice. If they reported p values of 0.00000001 then it would have been easy to say this was a case of overkill.

  8. \. shows up in the Web Authoring Statistics by digitaldc · · Score: 4, Funny

    The 'br' element

    The br element is a simple one, yet used on so many pages that it is the 8th most-used element. It is used more than the p element.

    clear, style, class, soft, id, and \.


    Wow! I never knew you guys were that popular.

    --
    He who knows best knows how little he knows. - Thomas Jefferson
    1. Re:\. shows up in the Web Authoring Statistics by shrikel · · Score: 5, Funny
      You're confused. Backslashdot is across the street.

      (sheesh)

      --
      Any sufficiently simple magic can be passed off as mere advanced technology.
  9. Best bash I've seen in a long time: by Benanov · · Score: 4, Funny
    From TFA, the classes page:

    The rest of the top 20 classes are either presentational or otherwise meaningless (msonormal, for example, which is one of the classes that Microsoft Office uses in its "HTML" output).
  10. Some of these results... by Dracos · · Score: 3, Insightful

    Prove that most people (and WYSIWYGs) don't know how to produce valid and accessible markup. The img alt attibute (an accessibility requirement) was found significantly less than width, height, and border.

    I'm working on a site now where the project owner is continually reducing usability and accessibilty of the entire site (Never mind that he secretly had a third party come up with an ugly design and ambushed the dev team with it).

    I keep telling everyone to deconstruct the adage "form follows function". It means function comes first. He doesn't care what anything *is* or how it *works*, only what it looks like. And, of course, that it's ugly.

  11. Re:No GOTOs? by the+computer+guy+nex · · Score: 4, Funny

    How about:

    IF(Post=Old_And_Tired) GOTO Mod_Down

  12. Ad for anti-IE by jamienk · · Score: 4, Insightful

    It looks like a subtle push against IE: many mantions of the HTML 5 spec (which is being written by WHAT a workgroup that includes many browser companies but not MS); use of SVG; written by a major FF developer.

    Way to go Google! Pour on the pressure!

    1. Re:Ad for anti-IE by Bogtha · · Score: 3, Informative

      written by a major FF developer

      I don't believe Ian Hickson has been involved with Firefox; if I remember correctly, he used to hack on Mozilla, but then started work at Opera before Firefox took off.

      I don't think it's a jab at Internet Explorer, it's just that he knows that the target audience is likely to have a decent browser, so he's used the features likely to be available.

      --
      Bogtha Bogtha Bogtha
  13. Bah... by Run4yourlives · · Score: 2, Interesting

    Again, properly formatted this time:

    For example, looking at what HTML ids and classes are most common, and at how many sites validate (and yes, we know that we're not leading the way in terms of validation).

    There are more <o:p> elements (from Microsoft Office) on the Web than there are <h6> elements.

    If someone can explain why so many pages would use a <table> tag and then not put any cells in it, please let us know.

    Web "professionals" (and I am one of that group) have got a long, long, long way to go before we're actually taken seriously, it seems, as coders.

  14. Opera also supports SVG by TheJavaGuy · · Score: 4, Informative

    FYI, Opera also supports SVG. I'm surprised that Ian Hickson didn't have Opera also mentioned on that Google page, after all he worked at Opera until a few months ago.

    --
    Opera Watch - An Opera browser blog.
  15. Heh by Z0mb1eman · · Score: 3, Interesting
    This reminds me of the old joke that there only ever was one 'make' script, and everyone else modified it.

    I wonder how much of what they found is influenced by how people learned to write HTML - which in all likelihood was to copy code from existing pages... might explain parts of what they found, such as:

    Most people (roughly 98%) include head, html, title and body elements. This is somewhat ironic, since three of those four elements are optional in HTML
    --
    ClutterMe.com - easiest site creation on the Net. Just click and type.
    1. Re:Heh by Blink+Tag · · Score: 2, Informative
      Most people (roughly 98%) include head, html, title and body elements. This is somewhat ironic, since three of those four elements are optional in HTML

      Somewhat true. The HEAD tag is technically optional (per spec), but TITLE is required, and must be in the HEAD. Thus HEAD is required in practice.

      From the HTML 4.01 spec:

      Every HTML document must have a TITLE element in the HEAD section.

      Though marked as "start tag optional"/"end tag optional", the BODY and HTML tags do provide useful symantec relevance.

  16. Font still popular by superflippy · · Score: 2, Interesting

    In their list of the 19 most popular elements, the font tag was #16. This element was deprecated when, back in 2000 or so?

    Of course, there may have been a lot of old pages in the sample, or pages built with older versions of HTML. But I've seen first-hand people using font tags to make an error message red, for example, even in a page that's using XHTML 1.0. I try to explain to the developers I work with why they shouldn't use them. I remove the font tags when those same developers add them to pages I've laid out for them. Zombie-like, they refuse to die.

    --
    Your fantasies contain the seeds of important concepts.
    1. Re:Font still popular by WWWWolf · · Score: 2, Insightful
      You know, there's something to be said for the straightforwardness of the "Font. Color. Red. Do it." approach.

      I don't know. I rather prefer the straightforwardness of "This is a title. You know how to format it." approach.

      With FONT tags, you need to specify the font and color on a single passage of text. Then on another. And then another. And then another. And for the good measure, just another. And by the way, one more. And that one too. And that one there, even when you just described that other one back there to have the exact same font and color. Oh, and that one too. And almost forgot that one there.

      After Netscape & IE 4 died, CSS just works.

  17. table with no by saigon_from_europe · · Score: 4, Informative
    From the article:
    If someone can explain why so many pages would use a
    <table>
    tag and then not put any cells in it, please let us know.
    I don't know if they counted dynamic pages, but I guess they did. In dynamic pages, an empty table is quite normal.

    Your code usually goes like this:
    <table>
    <% for each element in collection %>
    <tr><td> something </td></tr>
    <% end for %>
    </table>

    So it is quite easy to get the empty table if the collection is empty.
    --
    No sig today.
  18. The reason not to do this by winkydink · · Score: 4, Informative

    Capitalization makes all the difference in the sentence:

    i helped my uncle jack off a horse

    --

    "I'd rather be a lightning rod than a seismometer." -Ken Kesey

  19. Re:Not so fast - I'm pulling up mostly blank pages by stunt_penguin · · Score: 2, Informative

    Try using a SVG compatible browser. SVG graphics *tend to* work better that way.

    --
    When the posters fear their moderators, there is tyranny; when the moderators fears the posters, there is liberty.
  20. Re:Beford's Law by EvanED · · Score: 3, Interesting

    I had an interesting run-in with Benford's law a bit ago. I had this typed up already, so here goes (description of the law omitted; read the Wikipedia link in the parent -- it's really cool):

    You see, my hard drive crashed about two weeks ago. It had three partitions on it, and two of them are still perfectly readable. The third is pretty well shot. (Fortunately, it was the most useless partition; it's main contents was Windows itself. This does mean ANOTHER Windows installation -- after having to do one a few weeks before -- but really that's no biggie compared with my actual data. And while I'm on that subject, I had two hard drives; when I got the newer one, I put all my work stuff on it as well as a new Linux installation specifically because it was less likely to fail, and I look back at that decision now with great happiness, because it is that foresight that has made this no big deal at all.)

    I've been trying to recover data off of the third partition, and it seems that if you do a full scan of the partition it appears as if the data was just deleted. Most of the time it's able to recover information, but not always: folder names are often lost. They show up in the recovery programs I tried as just Folder2393 for example. (Numbers ranged from 2 to 5 digits.)

    The folder numbers approximately follow Benford's law.

    Here is the approximate distribution:
    (M. S. Digit) (% of folders) (Ideal Benford %)
    1 32 30.1
    2 15 17.6
    3 12 12.5
    4 12 9.7
    5 19 7.9
    6 03 6.7
    7 03 5.8
    8 02 5.1
    9 02 4.6

  21. What about plugins? by AndrewStephens · · Score: 2, Insightful

    I would be interested in seeing how many web pages use Java applets, Flash, Shockwave, Quicktime, ActiveX controls etc, etc. Sadly the authors did not include this information.

    --
    sheep.horse - does not contain information on sheep or horses.
  22. Re:Pretty crappy page authoring... by Bogtha · · Score: 2, Insightful

    Pretty crappy page authoring...not to tell a poor end user that he/she was missing a required viewer

    It's explicitly mentioned on the very first page ("Note: You will need a browser with SVG and CSS support to view the result graphs correctly. We recommend Firefox 1.5.").

    --
    Bogtha Bogtha Bogtha
  23. Re:Firefox 1.5 by bigbadbuccidaddy · · Score: 2, Interesting

    The black box is caused by them not using type="text/css" on the ?xml-stylesheet declaration. type is a required attribute. If I add that it renders properly on all the svg viewers I tried.

  24. Re:Dumb by Spad · · Score: 4, Insightful

    It's even dumber to state that someone is presenting pictures with Flash when they're actually using SVG.

  25. For folks does not (want) to run Firefox by Ilgaz · · Score: 3, Informative

    http://www.adobe.com/svg/viewer/install/main.html got suitable plugins for browsers/OS of choice.

    Notice that I got SVG plugin installed for ages, Safari didn't display the graphs. Is it because I am not using "a browser with CSS"? Well, nevermind really...

    This is the thing why I and others have negative views against firefox, svg and even .ogg. Rootless promotion of this kind...

  26. Wisdom by AeroIllini · · Score: 2, Interesting
    They've really hit on some wisdom here.

    There are several statistics they quoted which I have suspected for a long time, but only now can confirm with numbers.

    more than half of pages use the target attribute on the a element somewhere.


    I can't begin to describe the frustration I feel when I'm forced to use Internet Explorer and clicking links causes pages to fire up in a million new windows. Whether or not a link opens in a new window, a new tab, or the current window/tab really should be a client-side choice. Webmasters think they're being helpful by letting you separate your workspace into many windows, but they're really just slowing people down. Thank God for Firefox.

    It seems most pages use presentational attributes: the fourth most used attribute across all elements is the table element's border attribute, followed by the height and width attributes on img, followed by <table width="">, <table cellspacing="">, <img border="">, and <table cellpadding="">. Interestingly, though, the most frequently used attribute on the body element (namely bgcolor) is only used on around half of pages, with all the other presentational attributes on body being used even less. One possible explanation is that on average, colors are mostly done using CSS, while layout is mostly done using HTML tables.


    This makes perfect sense. While colors, fonts and styles are pretty much standard in a cross-browser environment, due to many various interpretations of the CSS Box Model, coding layout purely in CSS can be a terrible chore. It's usually much quicker to do a few simply layouts in tables (header, sidebar, content) and use CSS for pretty much everything else.
    --
    For security, the MD5 hash of this message and sig is 09f911029d74e35bd84156c5635688c0.
  27. Re:Worst use of SVG ever by jamesots · · Score: 2, Funny

    Yeah, and what's the point of using HTML? They could have posted an image of the text to the same effect.

    --
    Ho hum for the life of a bear
  28. Re:BR tag? is used in 7 out of 8 pages by TekGoNos · · Score: 3, Informative

    The summary got it wrong,

    the study states that there are more pages using title, than pages using br. NOT that more title tags are used than br tags.

    Approximatly 98% of all pages have a title tag and approximatly 7 out of 8 pages have (at least one, probably more) br tags.

    --
    I have discovered a truly remarkable proof for my post which this sig is too small to contain.
  29. Re:Pretty crappy page authoring... by masklinn · · Score: 2, Insightful

    Gecko fascism indeed, I mean what a bunch of bastard, using completely valid SVG files, oooh the nerve of them blokes...

    --
    "The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
  30. Set-Cookie2 insecure? by tedhiltonhead · · Score: 2, Interesting
    The linked site claims the Set-Cookie header is "considered insecure":
    The Set-Cookie header (which is one of the ten most-used headers) is present on about two orders of magnitude more pages than the Set-Cookie2 header (despite the former being considered insecure).
    After glancing over the RFC for Set-Cookie2, I can't see where it says Set-Cookie is "insecure". Google turns up nothing useful. Does anybody know more about this?
    1. Re:Set-Cookie2 insecure? by hixie · · Score: 2, Informative

      Yeah, I misspoke on this. Set-Cookie is insecure (due to domain-crossing problems -- should a cookie sent to a.b.c get sent to z.b.c? Depends on "b" and "c" in ways that depend on month-to-month political changes around the globe), but as far as I can tell, Set-Cookie2 is also insecure. I had thought it fixed this, but apparently not.

  31. Re:Poor style by Google by pbhj · · Score: 2, Insightful

    >>> "Can anyone tell me what's here that can't be visualized with GIF's?"

    I don't think that's the point ... it's about the creation of the images, not their visualisation. These images can be created on the fly from varying data with only textual manipulation of the code - the processing will be extremely light as will the data load on the servers. Presumably the xml-to-image parsing in the browser incurs a processing penalty though.

    If you view code of one of the graphs http://code.google.com/webstats/2005-12/charts/uni que-classes-per-page.svg you'll see that it is less than 10k. It also has a theoretical infinite resolution; which might be useful if the graphs are to be used for a presentation (like printing them on the moon using lasers!!?).

    Use of FF isn't too suprising as the section code.google.com is for promotion of OSS.

    It looks to be an internal project that we have just happened to be given access too ... assuming the officers of Google that need access have FF1.5 then the web devs have probably met their brief?!

  32. Fix for Firefox 1.5 by bigbadbuccidaddy · · Score: 3, Informative

    If your Firefox 1.5 doesn't display the graphs, or crashes, do the following as suggested by the Google webstats author:

    Apparently there's a problem in Firefox 1.5 regarding SVG images if you
    had SVG in the registry. Try following the steps described here:

          https://bugzilla.mozilla.org/show_bug.cgi?id=30358 1#c3

  33. Re:Not so fast - I'm pulling up mostly blank pages by hixie · · Score: 2, Informative

    It has nothing to do with "cool"; SVG happens to be easier for us to produce than bitmaps, and anyone who is going to be able to read this report and view graphics will be using an SVG-capable browser. The fact that it found bugs in every SVG browser out there is merely a bonus, it means that SVG support will get better.

    We used standards. It's not our fault if there was only one released browser that supported those standards well enough for you to be able to see the graphics.

  34. I'm feeling violated by Sontas · · Score: 2, Insightful

    1 billion pages! Talk about a violation of privacy! The justice department is only asking for a random sample of 1 million addresses and the search results for any 1 week period. This guy gets access to 1 billion pages via the google repository (whatever that is), conducts detailed analysis of the contents of those pages, and nary a word of dissent from the vast Slashdot audience.