How Much Bandwidth is Required to Aggregate Blogs?
Kevin Burton writes "Technorati recently published that they're seeing 900k new posts per day. PubSub says they're seeing 1.8M. With all these posts per day how much raw bandwidth is required? Due to innefficiencies in RSS aggregation protocols a little math is required to understand this problem." And more importantly, with millions of posts, what percentage of them have any real value, and how do busy people find that .001%?
It would make a lot more sense to have a protocol where you check one file that has a list of links to another XML file, and then the aggregator figures out which of those URLs has NOT been aggregated, then it downloads the other XML file which has the post-specific info, which it proceeds to display. That would save a lot of bandwidth, I'm sure.
How much bandwidth is required? A lot less if everyone would take the 5 minutes required to implement GZip compression on their Apache servers. It saves you bandwidth, it speeds up your site for users (especially those on dialup), and saves the bandwidth of aggregators (assuming they advertise an Accept-Encoding header for gzip; deflate)
So my plea to the internet community today.. make sure your web server is configured to send gzipped content. TFA says he doesn't know how many RSS feeds can support gzip. The answer is easy really, any feed being served by Apache (plus a LOT of other webservers. AOLserver even added gzip support recently). Here's how to setup Apache and here's where to check if your site is using GZip or and get an idea of the bandwidth savings you should see get. If you're site isn't gzipping, show your admin (if it's someone else) the 'how-to' above and ask them to implement it -- it's an absolute no-brainer win-win for everyone that takes no time at all to setup really. It's really absurd IMO that it's not enabled in Apache by default.
How much bandwidth is /. wasting every month by not creating a standard xhtml page even though someone created one for them already
"And more importantly, with 9M posts, what percentage of them have any real value, and how do busy people find that .001%?"
On slashdot.... Oh wait....
- http://www.milkme.co.uk
9M*0.001 = 9000...
Visit /.
order of magnitude out there, fella... better try again with this new fangled "math" stuff
I used to have a blog that I recently shut down because no one read it.
No one read it, but I got a ton of hits -- all from indexing services. WordPress pings a service that lets lots of indexing systems know about new posts. Some of them -- Yahoo, for example, were contstantly going through my entire tree of posts, and hitting links for months, subjects, and so on.
It didn't bother me, because the bandwidth wasn't an issue, and it wasn't like they were hammering my vps or anything. It mostly just made it really hard to read the logs, because finding human readers was like looking for a needle in a haystack.
But bandwidth is cheap, and RSS is really useful, so it seems at least as good of a use for the resource as p2p movie exchanges.
Rather than a making all these assumptions why not just email Bob Wyman and ask him?
"How much data is this? If we assume that the average HTML post is 150K this will work out to about 135G. Now assuming we're going to average this out over a 24 hour period (which probably isn't realistic) this works out to about 12.5 Mbps sustained bandwidth.
Of course we should assume that about 1/3 of this is going to be coming from servers running gzip content compression. I have no stats WRT the number of deployed feeds which can support gzip (anyone have a clue?). My thinking is that this reduce us down to about 9Mbps which is a bit better.
This of course assumes that you're not fetching the RSS and just fetching the HTML. The RSS protocol is much more bloated in this regard. If you have to fetch 1 article from an RSS feed your forced to fetch the remaining 14 addition posts that were in the past (assuming you're not using the A-IM encoding method which is even rarer). This floating window can really hurt your traffic. The upside is that you have to fetch less HTML.
Now lets assume you're only fetching pinged blogs and you don't have to poll (polling itself has a network overhead). The average blog post would probably be around 20k I assume. If we assume the average feed has 15 items, only publishes one story, and has a 10% overhead we're talking about 330k per fetch of an individual post.
If we go back to the 900k posts per day figure we're talking a lot of data - 297G most of which is wasted. Assuming gzip compression this works out to 27.5Mbps.
Thats a lot of data and a lot of bloat which is unnecessary. This is a difficult choice for smaller aggregator developers as this much data costs a lot of money. The choice comes down to cheap HTML index ing with the inaccuracy that comes from HTML or accurate RSS which costs 2.2x more.
Update: Bob Wyman commented that he's seeing 2k average post size with 1.8M posts per day. If we are to use the same metrics as above this is 54G per day or around 5Mbps sustained bandwidth for RSS items (assuming A-IM differentials aren't used)."
Technorati points out there are 900k blog posts per day:
You'd think at least the submitter would read the post.
``How Much Bandwidth is Required to Aggregate Blogs?''
.001%?''
Less than it currently takes, what with pull, HTTP, and XML used instead of more efficient technologies.
``what percentage of them have any real value, and how do busy people find that
Using a scoring system, like Slashdot's?
It's not like all of this is rocket science. It's just that people go along with the hyped technology that's "good enough for any conceivable purpose", ignoring the superior technology that had been invented before and wasn't hyped as much. Nothing new here.
Please correct me if I got my facts wrong.
By which I assume you mean 9 million....
the cited article discusses volumes of 900k, i.e.: thousands...
from whence comes this discrepancy ?
It's 900k, not 9m. Please, RTFA before hurrying over here to post.
In actuality, my guess is that there are few blogs you might decide to visit, and of those you do, several may have content you find worthwhile. Remember, worthwhile is all in the perception of the reader - there is no real definition for quality or value. Perhaps through trial and error - in essence digital tinkering - you find and derive your own value.
cheers, --dave
Does anyone else wonder why Slashdot editors seem to have it in for blogs? Is it because in Internet years, Slashdot is as old and sclerotic as the Dinomedia? Is Slashdot the Dinomedia of the new media?
Does anyone else consider it ironic that the Slashdot editorship HATES blogs, but Slashdot is actually a blog?
Anyone else getting tired of these questions?
Yes, it's a blog. Sorry if that offends you.
Sucked up and spit out monthly.
The bandwidth savings from using html+css are hugely exaggerated.
Slashdot is switching to html+css for the front page, but not for any dynamic pages like the one you're on now. Because slashcode was written by totally incompetent programmers, the markup for comment pages is not separated from the logic. Making any changes is therefore a huge undertaking and the people who wrote it are far too busy maintaining the high journalistic standards slashdot is known for to do it.
I run the spiders at Technorati, and it is 0.9 million posts a day, which Kevin Burton had correct in the post cited. Is the is the no dot effect?
If a friend is going through cancer treatment, her blog is worthwhile. If you find a youth group leader like yourself and can learn from his posts, his blog is worthwhile. A mother fighting for her health so that she can take care of her two sons and husband can share insights that are worthwhile. Someone fighting depression might have a worthwhile blog. A grandmother might have a view of the world that makes her blog worthwhile, just to get a different view. Perhaps a blog by someone who totally disagrees with you will be worthwhile, just to stretch your mind.
I've just described why I read the blogs on my blog roll. You can choose differently.
Top political blogs? You can find them easily among Technorati's top 100 list. Tags at Technorati will let you pick out specialties like science or "Master Blasters" or diabetes or the Tour de France. Google will turn up blogs if you search right, which is the trick for using Google.
"Worthwhile" is a much more difficult variable to calculate than "bandwidth." Perhaps it's the sheer variety of blogs that makes them interesting, because they are so individual and someone, somewhere will speak to your mind or your heart.
Worthwhile is what's worthwhile to you, and maybe to very few others. Not everyone will agree, and that's not a bad thing.
This sig seemed like a good idea at the time....
with 9M posts, what percentage of them have any real value, and how do busy people find that .001%?/i
Either I don't understand this question, or it's a completely idiotic question. What the fuck does "real value" mean? The maxim "One man's trash is another man's treasure " is especially important when talking about information--the asymmetry of value from person to person is even bigger than when you're talking about physical goods.
Considering the second half of the question, though, one might re-phrase the whole thing as "How do you find the posts that have value to you, individually?" That IS an important question... but like most econ majors, I figure the market will probably solve.
what percentage of them have any real value
I had for a while held the view that most blogs out there are pointless. Some can be insightful and some are basically used as company press releases, but most are people talking about their days activities that few people really care about, and a few of my friends have blogs like these. When I asked one whats the point, she said she just blogs stuff she would normally mention to many people on msn throughout the day. Its not meant to have value to anyone on slashdot, be hugely insightful, or detail some breathtaking new hack, its simply another way for her to talk to friends (that doesnt involve repeating herself).
Paul
I call BS. Gzip compresses streams in memory. It can't corrupt your hard drive.
This reads like a generic troll. "We actually had been using $PRODUCT_NAME for quite a long time on a server at home..."
You've just woken up in....The Blogosphere! De-de-de-de, de-de-de-de.....
The answer to the article's question is: nothing; there's no point in wading through the output of blogs, so don't bother aggregating it; stick the whole lot in the bin. There, wasn't that easy?
TWW
"Encyclopedia" is to "Wikipedia" what "Library" is to "Some people at a bus stop"
Making any changes is therefore a huge undertaking and the people who wrote it are far too busy maintaining the high journalistic standards slashdot is known for to do it. ...+5, nougat-filled sarcasm.
The days when 9 megabytes or 5 MPS sustained for a popular server is considered out of line is long gone. Poeple want to communicate, and they will use whatever resources are needed. How many resources do we use so that we can gaurantee that tuan will his present from grandma? How many resources do we use so that an arbitrary firm can mail a postcard to everyone in the country? How many resources do we use so that everyone can keep up with the every move of thier favorite celebrity?
As far as figuring out what is of value to a particular person, whatever judgemental figure one wishes to place on it, one can browse, or, take the time tested method of using a proxy. Find a trusted source and pay them to publish all the content that they think is of interest to a particular group. Obviously the busy person does not have time to look everywhere, so the service is of value. And, if you do not wish to actually pay for the service, perhaps the trusted source can convience others to pay in exchange for the opportunity to have control over the content or a portion of a page to promote thier particular interest.
"She's a scientist and a lesbian. She's not going to let it slide." Orphan Black
search query: blog -1337 -teh -kewl -hugz -omg -bored -lol -lmao -"can't wait to get my drivers license"
https://www.eff.org/https-everywhere
http://slashdot.org/metamod.pl
/. = blog)
is quite useful (if
The link you posted explains how to compress php files, but what if my site's server is 98% static html files? Is it possible then?
thanks
Do it. He's right.
clearly, the answer is the number of posts times 6.25.
This reads like a generic troll. "We actually had been using $PRODUCT_NAME for quite a long time on a server at home..."
No shit, Sherlock.
I call BS. Gzip compresses streams in memory. It can't corrupt your hard drive.
So why did you have to say this?
It would make you very rich. Nobody thinking about crap like that! No sir!
Like most of life, building networks of trust takes time. Aren't issues like this really part of the problem? Charging for bandwidth... My server has something like 100gig of transfer, and unless I get Slashdotted several times a month is this really a problem? And, if I do, why aren't I getting some ads in place to pay for it?
Technorati only claims to process 900,000 new entries per day - not 9 million. Burton has the number correct in his posting. The /. article quotes him incorrectly. On the other hand, the numbers cited for PubSub are correct. We have processed an average of 1,796,574 (1.8 million) new entries per day over the last 30 days. Many of our statistics are available on our site and updated daily. The "new entries per day" data can be found graphed at: http://www.pubsub.com/linkcounts_graphs.php?type=n ewentries for more graphs and tables, see: http://pubsub.com/linkcounts.php
bob wyman
CTO, PubSub.com
mod_gzip is a C program. Like any C program, some types of programming error can cause stack corruption, which can leave unexpected crud on the stack that is then executed by the CPU. It is not impossible that system calls will be part of said crap.
That said, it's not overly likely, and you'd have to be pretty unlucky to have something like that happen even when testing it and having it crash repeatedly.
The more important point is "What? You were running Apache as root?"
Er... what?
I'm not going to make the argument another fellow did (it can't corrupt your disk) 'cos it's not true, though you'd have to be pretty darn unlucky.
The more important point is - you were running Apache as root?! If so, I don't blame your boss. If not, how exactly did it corrupt the disk? I'd be putting my money on an unrelated error (without information to the contrary) personally - early disk failure, etc.
In general, though, firing someone for implementing software they've approved on a production server after testing is stupidity. That's not the sort of place I'd want to work anyway, frankly. He was middle management in a larger company I presume?
You should not have a blog if you do not meet one or more of the following requirements :
- A eyepatch ( both eyes are even better )
- A pegleg ( both is even better )
- A huge scar that has some sort of badass story behind it
- Are a bonafied certified porn star
- Have a 15 inch penis...in a flaccid state
Blogs really are pointless, their is a reason people are called the "average american" or "average joe schmoe".
No one wants to read how your day went if you sit inside a cubicle all day filing reports and creaming over the hot ladies in the office without getting some of it.
Blogs are gay.
Since we're on the subject of blog aggregation, can someone recomend a GOOD way to aggregate?
Every single RSS aggregator I've come across treats my RSS world similar to an e-mail reader, where each blog is a 'folder' and each entry is equivalent to an e-mail.
This is decidedly NOT what I want and I don't understand why everyone's writing the same thing.
My friend is running PLANET, which builds a frontpage out of the RSS feeds (looks kind of like the slasdot frontpage where adjacent stores come from different sources and are sorted in chronolocial order (newest on top)
PLANET seems to be a server-side implementation. My buddy's running Linux and he made a little page for me but it's not right for me to bug him every time I want to add a feed.
Is there anything like what I want that would run on Windows? And if not, why the heck not?
By the same token, why doesn't del.icio.us have any capacity to know when my links have been updated?
For what it's worth, here's my del.icio.us BLOGS area with some blogs I find good.
http://del.icio.us/eduardopcs/BLOG
Ecce Europa - Web Design for Business
If your weblog server implements ETag and Last-Modified, my spider can send a one packet request with the values I last saw from you, and you can send a one packet 304 response if nothing has changed.
Charles Miller explained this well a few years ago.
(I run the spiders at Technorati).
While there are some great ideas in RSS, one of the worst is polling. As discussed in Burton's post, polling results in a ridiculous waste of bandwidth. A Push approach, like the one defined in "Atom over XMPP" would result in a massively more efficient distribution system like the one that we implement in the PubSub Sidebars. But, if you insist on polling, then the best efficiency can be had by combining Gzip with the A-IM or RFC3229+feed as described on my blog. Using RFC3229+feed, your server would only serve up "unread" entries not everything in your feed. Please read and implement: http://bobwyman.pubsub.com/main/2004/09/implementa tions.html
bob wyman
CTO, PubSub.com
There are gzip accelerator PCI cards available for cases where CPU is an issue. Whether they're cheaper in large clusters than just adding some hosts or getting a bigger pipe, I don't know ... but they're another option.
HunbunFunland, AnonDotOrg, Aberfoyle, TGIFF, Gestures, CarbonBasedSoda, and BorgGates are all sock puppet accounts of the same guy who is trying to use the Slashdot comment system as his/her own personal ad agency by constantly making posts that are nothing more than thinly veiled excuses to attract traffic to his blog. His name is Louis Waweru and his information is listed below:
WHOis info:
-----------
Registrant:
Louis Waweru
525 W. 7th Street
Suite 2116
Charlotte, North Carolina 28202
United States
Registered through: GoDaddy.com
Domain Name: OVERHEARDINTHEUK.COM
Created on: 16-Jul-05
Expires on: 17-Jul-06
Last Updated on: 16-Jul-05
Administrative Contact:
Waweru, Louis youngbonzi@earthlink.net
625 W. 113th Street
Suite 3R
New York, New York 10025
United States
(646) 339-8190
Technical Contact:
Waweru, Louis youngbonzi@earthlink.net
625 W. 113th Street
Suite 3R
New York, New York 10025
United States
(646) 339-8190
Domain servers in listed order:
NS8.ZONEEDIT.COM
NS17.ZONEEDIT.COM
Further Contact info:
-----------
youngbonzi@earthlink.net
user-0c8h4ji.cable.mindspring.com
AOL: louislogicnyc
YM: lushlouis
DOB 11/09/1981
It's going to.
Someone is going to link to the original post on their blog. That article will be recopied a few times until any link to Slashdot is lost.
Some news reporter, hoping to pick up on the "next big thing" will take it to be a legitimate report.
When you watch the cable news and see an over-hyped story about a car that runs on water, ask yourself if it started out as a joke on Slashdot.
This sig seemed like a good idea at the time....
...does it take to get to the center of the blog aggregate? 1... 2... 3.
Totally Life!
ALL replies
"And more importantly, with millions of posts, what percentage of them have any real value, and how do busy people find that .001%?"
Busy people don't waste time on blogs. Blogs are the realm of internet kooks ranting about the latest conspiracy behind secret intelligence memos, not sane people with limited free time.
In theory, the two should work together seamlessly. In practice, they don't.
The Raven
and how do busy people find that .001%?
They don't, they really have better things to do. The media actually does that for us already... what me worry?
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
If they use Gzip, then all their customers are suddenly using much less bandwidth and they make less money.
... not very likely.
Of course this would not be true if their bandwidth charges were the same as their costs
For every expert, there is an equal and opposite expert. - Arthur C. Clarke
We don't use HTML and RSS for blogs because they are efficient, we use them because they are easy. They are low budget replacements for SOAP / .Net / J2EE and don't require the installation of new server software.
We get away with it because it wasn't designed to have one central server running the blogs for millions of people - it was designed so that Joe Blogs (sic) can easily update a website that their /. reading bretherin set up for them - for that purpose it is more than adequate.
We have reached an interesting juncture with RSS and Blogs. People like the technologies, they are successful, but they are hacks. Here are somethings I'd like to happen:
The pattern of client to server to server to client is a bit like the architecture of email, but it is quite spam-proof because you only ever receive what you asked for.
Additionally, subscribers can instantly "repost" a suggestion to their own channel, which will be read by their subscribers. To avoid reading duplicate posts, servers will optionally filter out duplicates. However, this has a major consequence, which is that subscribers are only ever guaranteed to see the URL, which means that anything you want to say about the content of a new page has to go into the URL. The current system of RSS titles and descriptions will not work under reposting and duplicate filtering.
The combination of real-time pushing and reposting could lead to a speeded up Internet, where exciting new ideas spread from one user to the next in a matter of minutes, without having to go through the bottlenecks of centralised attention and popular websites (such as Slashdot). This could be enough to turn the Internet into a "Global Brain", and perhaps even trigger the Technological Singularity.
I invented Miski to solve the problem of getting people to take notice of new ideas without having to engage in a massive publicity effort, but unfortunately I've failed to get anyone to take any notice of the Miski idea.
Music: a super-stimulus for the perception of musicality. Musicality: a perceived aspect of speech.
This effect is called the The long tail effect, and is visible all over the web. For instance, Amazon.com says that every day, it sells more books that didn't sell yesterday than the sum of books sold that *also* sold yesterday. In other words, they sell (in sum) more of the items selling less than one every other day than of items selling (by type) more than that.
Eivind.
Doubting the existence of evolution is like doubting the existence of China: It just shows that you're uninformed.
You got your facts wrong. When feed readers use conditional GET and respect HTTP Last-Modified headers, and when feed publishers use gzip encoding (XML, like most plain text formats, compresses wonderfully), the bandwidth requirement for aggregation is minimal; the technologies themselves, then, are not inefficient; the inefficiency is in how they are being used. And the alternative you hint at, push, is nowhere near being "more efficient" since it would require an overhaul of IP, universally adopted, to implement reliably.
Time to ditch the World Wide Web, right?.
Definition of whence: From where.
So, you can say:
Whence comes this discrepancy?
but please don't use
From whence...
because it's redundant.
I guess today is a passable day to die.
Technorati recently published that they're seeing 900k new posts per day. PubSub says they're seeing 1.8M.
PubSub later admitted they may have been double-counting.
If you go to blogger/blogspot and use their feature that allows you to scan random blogs, approximately 50% appear to be machine generated link farms with posts being generated every 15 minutes that all link back to a specific site.
I can't imagine that a lot of people are sitting there reading through sets of keywords, but maybe their jobs are more boring than mine.
If these words were people, I would embrace their genocide.
This is not the greatest sig in the world, no. This is just a tribute.
And more importantly, with millions of posts, what percentage of them have any real value, and how do busy people find that .001%?
Unless you're talking about value in terms of dollars earned per web page/blog post, value is completely subjective.
The most objectively valuable blogs are ones that link to other sites and blogs in meaningful ways, which increases the ability of google searches to find what I'm looking for. The value of the internet is raised by making searching better for me.
I don't really understand the anti-blog sentiment on slashdot. Most of the internet was already irrelevant to me before blogging came around, but google made it easier to not get bogged down in it.
The thing that makes blogs different and harder for google to track is the speed at which they are updated. If something happened yesterday I don't want to wait weeks for google to spider all the new posts, I want to find sites talking about it right now- technorati and other sites do a decent job of that, but it's annoying to have to go to two places to try to find the same thing.
Google does work if people make posts like "I'm going to go to this event next week, and I'll put pictures up, here are links to other people that are also going" and then proceed to do so, then the google will get you to the old post and you can move forward to the more recent post that actually talks about the event.
I would say that the bandwidth used for blog RSS feeds pales in comparison to that used for downloading TV shows these days.
[ReidNews]
You could of course use PSYC http://psyc.pages.de/ to syndicate your blogs... much better distribution strategy than RSS, and the overhead is not anywhere near RSS's. And you can do much more than just distributing newsfeeds... but anyway, it's one of the things it's good at :)
...and the right protocol/system for it is netnews with NNTP.
If you make the NNTP links between servers match the physical topology of the Internet, you can make the guarantee that no message cross an Internet link (in a given direction) more than once. This is because the messages are all tagged with a network-wide unique message-ID, and duplicates (which are a necessary effects of a flood broadcast system) are rejected before they're sent.
You couple that with clean separation of content into enough different news groups, and users who subscribe to just what they're interested in, and voila! Efficient, reliable, fast distribution of information over the Internet, even better than the so-called "P2P" file sharing networks.
I oughta know - I am one of the guys who invented NNTP.
Damn....He is basing his math on a an average post size of 150K. From a textual standpoint through gzip compression -- that is closer to a BOOK than a blog entry. I can't remember the last time I read a single article of original content that was that big.
(+1 Funny) only if I laugh out loud.
Instead of regenerating pages upon each request (pull), they should be regenerated upon each change (push). This will save not only bandwidth, but also memory and CPU (and lots of it) on the server and is, actually, easier to implement and debug -- no changes to web-server, which will be dealing with the regular file, for example.
In Soviet Washington the swamp drains you.
Though it seems to me that if you're interested in a particular subject, rather than a specific person, good old message boards like phpBB or VBulletin, Usenet newsgroups, or forums like Slashdot, are much better vehicles for sharing information. In these cases, you care more about the topic and not so much who it is that's doing the writing. Rather than trying to harvest relevant content about a subject from blogs, maybe we're just better off posting on message boards to begin with? You hear a lot nowadays about blogs nowadays and how great they are, but not a whole lot of noise about message boards and discussion forums. Why is that? They seem like just different ways of equal value for people to contribute information to the community.
I've been helping out some friends who are putting together a site for people to add comments to web pages. I don't know if it's up yet, but here is an example of another mechanism, sort of in between the two, where you can either follow a particular subject or a particular author without too much difficulty.
eyeout is a search agent tool that monitors RSS. Its also trainable so you can filter at a broad level. I seed a topic with a few keywords and then start giving it feedback, pretty soon I have a feed with boundaries based on my interest. amazing, checkitout at eyeout.com.