When RSS Traffic Looks Like a DDoS
An anonymous reader writes "Infoworld's CTO Chad Dickerson says he has a love/hate relationship with RSS. He loves the changes to his information production and consumption, but he hates the behavior of some RSS feed readers. Every hour, Infoworld "sees a massive surge of RSS newsreader activity" that "has all the characteristics of a distributed DoS attack." So many requests in such a short period of time are creating scaling issues. " We've seen similiar problems over the years. RSS (or as it should be called, "Speedfeed") is such a useful thing, it's unfortunate that it's ultimately just very stupid.
I don't really care for RSS either, but damn, was that necessary?
Or did the RSS reader authors hope that their applications wouldn't be used by anybody except for a few geeks?
That kind of eliminates the point of having the RSS at all, as the user no longer gets up-to-the-minute information.
Also, I doubt that the major problem here is bandwidth, more the number of requests the server has to deal with. RSS feeds are quite small (just text most of the time). The server would still have to run that PHP script you suggest.
Then their RSS client would barf on the input and the user wouldn't see any of the previously downloaded news feeds, in some cases.
:P
Or rather, anyone that programs an RSS reader so horribly as to make it so that every client downloads information every hour on the hour would probably also barf on the input of a 500 or 404 error.
Most RSS feeders *should* just download every hour from the time they start, making the download intervals between users more or less random and well-dispersed. And if you want it more than every hour, well then edit the source and compile it yourself
01100111 01100101 01110100 00100000 01101111 01110101 01110100 00100000 01101101 01101111 01110010 01100101 00101110
First post for once finally used in the correct context of a story and its modded offtopic, damn. Thought I had a winner.
We use poisson distribution to even out the load our scripts generate.
Why not have rss readers that check on startup, then check again at user specified intervals.. After a random amount of time has past.
user starts program at 3.15 and it checks rss feed.
user sets check interval to 1 hour.
rand()%60 minutes later (let's say 37) it checks feed
every hour after that it checks the feed.
simplistic sure, but isn't rss in general?
on an aside, any of you (few) non-programmers interested in creating rss feeds, i put out some software that facilitates it.
hunterdavis.com/ssrss.html
This "optimization" will not have any long-lasting benefits. There are at least three variables in this equation:
This optimization only addresses #3, which is the least likely to grow as time goes on.
Karma: -2147483648 (Mostly affected by integer overflow)
not needing user intervention is the effing POINT of rss.
its like saying - "java is great, except lets make it compiled, and platform specific"
... hi bingo
It seems kinda stupid to have the clients basing their updates on clock time. Doing an update on client startup and then every 60min after that would be just as easy as doing it on the clock time & would basically eliminate the whole DDOSesque thing.
my sig's at the bottom of the page.
Leaving thousands upon thousands of connections open on the server is a terrible idea no matter how well-implemented the TCP stack is. The real solution is to use some sort of distributed mirroring facility so everyone could connect to a nearby copy of the feed and spread the load. The even better solution would be to distribute asynchronous update notifications as well as data, because polling always sucks. Each client would then get a message saying "xxx has updated, please fetch a copy from your nearest mirror" only when the content changes, providing darn near optimal network efficiency.
Slashdot - News for Herds. Stuff that Splatters.
There are at least three variables in this equation:
1. Number of users
2. Number of RSS feeds
3. Size of each request
And I'll add:
4. Time at which each request occurs
If RSS requests were evenly distributed throughout the hour, the problems would be minimal. When every single RSS reader assumes that updates should be checked exactly at X o'clock on the hour, you get problems.
Doubly so if I want RSS content on multiple machines behind NAT. One person gets slashdot headlines, another CNN or whatever. Simple port forwarding won't solve that problem.
"Push" is dead. "Push" was stillborn. The very climate w.r.t internet security is not disposed to "hey lets let remote servers push stuff into our network!"
I don't need no instructions to know how to rock!!!!
Yeah, just use a database backend for TCP, good idea. Oh! I know! Lets use XML instead! Jesus christ, if you are this stupid, just shut your hole. Don't propose retarded solutions to problems you don't understand just cause you are bored.
500 or 404 won't work for RSS, since most readers just eat the error and try again later.
What would really, really be effective would be a valid RSS feed that contained an error message in-line describing why your request was rejected. A few big sites doing this would rapidly get the rest of the users and clients to be updated.
Even if every RSS reader used HEAD (or if-modified-since) correctly, servers would still get hammered on the hour when the RSS feed has been updated during the hour. If-modified-since saves you bandwidth over the course of a day or month, but it doesn't reduce peak usage.
The shareholder is always right.
Most RSS feeders *should* just download every hour from the time they start
That's also a problem, though, since most people start work at their computer desks on the hour, or very close to it. The better solution would be for the client (1) to check once at startup, then (2) pick a random number between one and sixty (or thirty or whatever) and (3) start checking the feed, hourly, after that many minutes. That's the only way to ensure a decently random distribution of hits.
You make some very good points. The old saying "When all you have is a hammer, everything looks like a nail" seems to ring true time and time again. These days it seems that everyone wants to use HTTP for everything and quite frankly it's not equipped to do that.
RSS over SMTP sounds pretty cool. Heck, just sending a list of subscribers an email of RSS and let their mail clients sort it out would be pretty nice.
Heh, my favorite posts are when some one suggested soething that sonuds totally novel and then someone else points our "Yeah! Like $lt;insert old and undeused technology>. It seems to do that damn well." The internet cannot forget its roots!
100% Crunchier
it seems a few peoples here dont get it. RSS is the file format, not the transfer via HTTP The whole pull problem is a problem with HTTP, in theory you could make an irc like protocol and transmit via that, solving some of the subscription, distribution and pull problems.
The main problem here is that RSS lacks any sort of distributed flow control, much as the Internet did back in the early days with tons of UDP packets flying around everywhere and periodically bringing networks to their knees.
One completely backwards-compatible fashion to add flow-control to RSS would be to use the HTTP 503 response when server load is getting too high for your RSS files. The server simply sends an HTTP 503 response with a Retry-After header indicating how long the requesting client should wait before retrying.
Clients that ignore the retry interval or are overly aggressive could be punished by further 503 responses thus basically denying those aggressive clients access to the RSS feeds. Users of overly aggressive clients would soon find that they actually provide less fresh results and would place pressure on implementors to fix their implementations.
no i think he's being serious. Since most people's schedules are based on the hour marks, it stands to chance that most people are rushing to get to their destination 20 minutes before the hour, and rushing out of their wherever 20 minutes after the hour. So, since the schedules are all synched, the traffic volume quickly swells 20 min before/after the hour and bam -- thats when you get the most accidents.
Most major cities I think have traffic reports more often than just on 20/40.
Moo.
How about having the SERVER tell the client when to download next? Sort'a like DHCP, but more inteligent: The server will even out the TTL by some sort of gausian algorithm, and in that method save itself!
If certian users want news more often, (say every 15 minutes, verses every hour), have the client say that it would like news every 15 minutes, and the server will schedule it (almost like a calendar), and will send the client a TTL that is almost 15 minutes (but close enough). Infact, this might be the better route: fundamentally change the way RSS works, so that newsreaders are REQUIRED to RSVP, and the ones that don't get an error message (telling the client about newsreaders that are supported)
Even for a poll at hourly intervals this should get staggered across an given hour according to when the client starts. Also, a client should probably not be polling every 3600 seconds (or whatever interval) but polling with a 3600 second gap between end of one poll and start of the next. In this way a loaded server will smear the clients out simply by having slower response, and the load will even out on its own.
It's always bad to have lots of agents doing things in synchrony when that involves an outside resource. Contact the client authors, give them a clue, let the upgrades push the bugfix out.
Finally, isn't RSS done over HTTP anyway? So why aren't these clients going through their ISP's proxy and doing Get-If-Modified? The target server should see only a fraction of the spike even with bad clients. Unless they're very very bad...
None of these things is a direct flaw in RSS, just crap quality of implementation in RSS clients.
Cameron Simpson, DoD#743 cs@cskk.id.au http://www.cskk.ezoshosting.com/cs/
1) The RSS-developer community has a completely irrational fear of MIME. They never completed the registration of the application/rss+xml media type, and they've shown no interest in doing so. Weiner and the gang want to use text/xml for everything, which makes it harder to separate RSS out of a newsgroup (or anything else; more on that below).
2) The RSS developer community can't picture themselves using anything except HTTP. I've tried mentioning other protocols to them; they don't respond.
3) NOBODY MAKES RSS READERS THAT WORK IN A PIPE!. Seriously. Is it really so hard to envisage somebody piping an RSS file in from the command line? Apparently, it is for the people who write RSS readers: they make you cut-and-paste URIs into a form before you can do anything with an RSS file.
Seriously, RSS over netnews wouldn't really require any new Big Ideas, just a smart re-application of the Old Ideas:
1) Post RSS files to Usenet with proper "Content-Type" and "Supersedes" headers to an appropriate newsgroup. (Maybe some new RSS-friendly newsgroups; maybe the old ones. We can figure that out later. The important thing is: This wouldn't be any more difficult than posting a FAQ is.)
2) Use newsgroup-capable RSS-readers to poll the newsgroups, and/or use regular newsreaders to pipe RSS files to dedicated RSS-readers.
3) Profit! Or at least, Fewer Accidental DDoS attaacks!
I could do Step 1 now, without significant effort. (It's no more difficult than posting a newsgroup FAQ.) Step 2 requires a real programmer, which I am not.
(In fact, you know what would be great? A combined newsgroup/RSS reader. It makes more sense than all those RSS readers patterned after e-mail programs. But I digress.)
Maybe I'm getting cynical in my old age, but I'm beginning to think this is the UNIX/Windows divide all over again. A lot of the RSS developer community comes from a Windows/Mac developer background, so they just don't see the potential of the toolbox approach, even while they're rambling about the extensibility of XML and it's "user-centric" design.
Take for example, the refusal to get a real media type for RSS: A unique MIME type would help web browsers, too, because browsers can use media types to decide which plug-in gets which file. Instead of making a user cut-and-paste URIs from his browser to his reader (which is a dreadfully Window-ish way of doing it), the user could just click on the RSS link and the web browser could launch the RSS reader by itself (which presumbably would do something smart, like ask the user if they want to subscribe to a new feed). Just like all those other plug-ins and non-HTML formats on the Web!
Makes sense, yes? But it doesn't register with anybody creating RSS readers. Some programmers still advocate the cutting-and-pasting of URIs. Some programmers advocated auto-discovery by reading HTML "link" elements. Some advocate complicated cloud/stream schemes. But nobody wants to talk about re-using basic, functional tools that we've had in the toolbox for 10, 15, or 25 years.
Some days, it's like the "RSS developers" are from another planet. And I want to send them all back.
Proud to be / Smiley-free / Since Nineteen / Ninety-Three