A Statistical Review of 1 Billion Web Pages
chrisd writes "As part of a recent examination of the most popular html authoring techniques, my colleague Ian Hickson parsed through a billion web pages from the Google repository to find out what are the most popular class names, elements, attributes, and related metadata. We decided that to publish this would be of significant utility to developers. It's also a fascinating look into how people create web pages. For instance one thing that surprised me was that the <title> is more popular than <br>. The graphs in the report require a browser with SVG and CSS support (like Firefox 1.5!). Enjoy!"
I have to ask, what's the purpose of a 1-BILLION page sample? That's the beautiful thing about statistics. If you can say something about the distribution of characteristics within a population, you don't have to survey the entire population to get meaningful results. Are the study authors proposing that no standard distribution can be applied to the entire universe of web pages? If that's the case, then do the statistics they apply to their sample of one billion really say anything predictive about the entire population?
Aside from the cool factor of saying they sampled a billion pages, I don't see what extra benefits are gained from that extra effort.
Small stat? are you joking?
This is about the number of sites that use the tag, not the number of tags out in the wild, and <br> is used on more pages than <table>, there are as many pages with at least one <br> than pages with at least an <img> tag
That's freaking huge, for a tag that should almost never be used.
"The way we can tell it's C# instead of Haskell is because it's nine lines instead of two." -- wadler
Again, properly formatted this time:
For example, looking at what HTML ids and classes are most common, and at how many sites validate (and yes, we know that we're not leading the way in terms of validation).
There are more <o:p> elements (from Microsoft Office) on the Web than there are <h6> elements.
If someone can explain why so many pages would use a <table> tag and then not put any cells in it, please let us know.
Web "professionals" (and I am one of that group) have got a long, long, long way to go before we're actually taken seriously, it seems, as coders.
The most work on this, in the case of the WWW is the frequency with which pages are hyperlinked. A lot of work has been done on hyperlinking without access to the exhaustive database used by Google. I know that Google's business model started with rank ordering pages on their results by how often they were href'ed elsewhere so the data is there obviously and it wouldn't be a serious imposition on their proprietary information to publish analysis of the href power law.
Seastead this.
I wonder how much of what they found is influenced by how people learned to write HTML - which in all likelihood was to copy code from existing pages... might explain parts of what they found, such as:
ClutterMe.com - easiest site creation on the Net. Just click and type.
In their list of the 19 most popular elements, the font tag was #16. This element was deprecated when, back in 2000 or so?
Of course, there may have been a lot of old pages in the sample, or pages built with older versions of HTML. But I've seen first-hand people using font tags to make an error message red, for example, even in a page that's using XHTML 1.0. I try to explain to the developers I work with why they shouldn't use them. I remove the font tags when those same developers add them to pages I've laid out for them. Zombie-like, they refuse to die.
Your fantasies contain the seeds of important concepts.
I had an interesting run-in with Benford's law a bit ago. I had this typed up already, so here goes (description of the law omitted; read the Wikipedia link in the parent -- it's really cool):
You see, my hard drive crashed about two weeks ago. It had three partitions on it, and two of them are still perfectly readable. The third is pretty well shot. (Fortunately, it was the most useless partition; it's main contents was Windows itself. This does mean ANOTHER Windows installation -- after having to do one a few weeks before -- but really that's no biggie compared with my actual data. And while I'm on that subject, I had two hard drives; when I got the newer one, I put all my work stuff on it as well as a new Linux installation specifically because it was less likely to fail, and I look back at that decision now with great happiness, because it is that foresight that has made this no big deal at all.)
I've been trying to recover data off of the third partition, and it seems that if you do a full scan of the partition it appears as if the data was just deleted. Most of the time it's able to recover information, but not always: folder names are often lost. They show up in the recovery programs I tried as just Folder2393 for example. (Numbers ranged from 2 to 5 digits.)
The folder numbers approximately follow Benford's law.
Here is the approximate distribution:
(M. S. Digit) (% of folders) (Ideal Benford %)
1 32 30.1
2 15 17.6
3 12 12.5
4 12 9.7
5 19 7.9
6 03 6.7
7 03 5.8
8 02 5.1
9 02 4.6
The black box is caused by them not using type="text/css" on the ?xml-stylesheet declaration. type is a required attribute. If I add that it renders properly on all the svg viewers I tried.
There are several statistics they quoted which I have suspected for a long time, but only now can confirm with numbers.
I can't begin to describe the frustration I feel when I'm forced to use Internet Explorer and clicking links causes pages to fire up in a million new windows. Whether or not a link opens in a new window, a new tab, or the current window/tab really should be a client-side choice. Webmasters think they're being helpful by letting you separate your workspace into many windows, but they're really just slowing people down. Thank God for Firefox.
This makes perfect sense. While colors, fonts and styles are pretty much standard in a cross-browser environment, due to many various interpretations of the CSS Box Model, coding layout purely in CSS can be a terrible chore. It's usually much quicker to do a few simply layouts in tables (header, sidebar, content) and use CSS for pretty much everything else.
For security, the MD5 hash of this message and sig is 09f911029d74e35bd84156c5635688c0.
How do I accomplish that without using a br tag after each letter?
.block for the slashboxes, and it's on that list of most used classes, so pleas put a unique ID in the body element, like #slashdot-front #slashdot-games. Something that I could put it into context: #slashdot .block{display:none;}. If I block .block, then my userCSS messes up the rest of the web. I worked around this by blocking each box individually.
I'd guess surrounding the name in a div and then with CSS making that div width equal to 1em and positioning it on the left of your avatar picture.
Oh, and good luck getting it to work in all browsers. Gee whiz! What is the logic behind this? You have to wrap everything in DIVs and spans, then write a bunch of ridiculous code, for what reason, so we can hold up to an irrationally strict, un intuitive, standard.
This is the opposite of accessibility. It's simply a waste of time for the author...
Though, now that I think of it, this is not the best example of BR use, since screen readers would spell everyones name out.. Oh, so I'd have to say go with the 1em CSS box, maybe try a monospace font.
And please, please try to use an existing box and try to avoid using DIV and SPAN, if you can.
Oh, wouldn't the text just flow out, or under the box? I think you can't do this?
I simply think some of the extreme concepts about how we should deprecate everything are failing to view the logic behind future potential uses, and forget how long it takes to actually get any tags to work universally, so let's just keep the tags we have, thank you. The same thing happened with italic, they said it should be deprecated and never used, then they came up with a few examples where it's needed.
If you people want to crusade against something, maybe go after the people who use DIV class=heading. That annoys me when I try to make my user stylesheets. Oh, and since we're on the topic, slashdot is using