Throttle Apache Bandwidth Based on IP Address?
BigBlockMopar asks: "A friend of mine runs a web site which offers a very large archive of files. He wishes to continue to offer free and unrestricted access to his archive, but his bandwidth consumption has been through the roof because of people using wget (and similar) to download his entire site. Current traffic is around 200 gigabytes per month with over 50% of that being clients who are downloading every document on the site. The server space is donated by a hosting provider who is understandably starting to become impatient with the traffic. I've checked out mod_throttle and mod_bandwidth, neither appears to do exactly what is desired. Does anyone have any suggestions?"
"Eventually, he plans to set up mirrors, but he'd like to get the greedy users under control first. Alternatives are adding a (free) log-in authentication system, or a text-in-image system like Network Solutions uses to weed out automated whois queries. But I think the best solution is to allow a given client IP address full-speed downloading for the first $WHATEVER megabytes, and then automatically reduce the speed of the transfer to that IP address. This would probably deter most leeches but continue to allow legitimate users to transfer more than an arbitrary limit."
Well, you could always zip the whole site up, and put that up as a bittorrent link.
Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
I've been searching for this for over a year now, and so far have come up with nothing. I'm afraid it's not around, but if it is, let me know
if you just slow down the connection you will have a lot of nearly idle apache processes running and so that after a while you cannot get more clients connected.
Either just drop connections or use a single process proxy whith the required ability, which then forwards the requests to the apache.
But restring by IP can be dangerous if users are sitting behind a proxy from the ISP (very common at least in Germany).
Set up javascript links, which wget can't follow.
Or set up a 'captcha' for each download, so that a human has to confirm each file one at a time.
455fe10422ca29c4933f95052b792ab2
If you enable referer checking, this will stop most wget type programs. Wget has an --referer=URL option, but I find that it doesn't work. Also, there are a lot of windows clients that will spider a website and pull files based on extention, but again these don't usually have an option to set referer, or if they do most people aren't smart enough to turn it on.
One exeption to this is Pavuk, which does referer spoofing pretty well. This program is about 4 times harder to use than wget, and isn't very popular (you don't see it included in distros too often).
Of course this won't completely fix your problem, but it will probably stop about 90% of the people doing it now. It's an easy fix that you can implement quickly until you get something to throttle bandwidth properly.
-ft
If you want to limit BW by IP address, this might be doable depending on what the server is. If the server is a Linux or "virtual Linux" box, you can probably use 'tc' (Traffic Control) in the kernel to meter bandwidth by subnet or address. This works pretty well. Look at the advanced routing howto for info. It is a bear to setup, but actually works quite well.
The problem is that if these are bots grabbing your whole site, slowing them down to 10K/sec won't actually reduce the amount of traffic they pull from you. They may take all day to get the pages, but the bytes will still move.
Some options that you have.
* If the user really doesn't need the data, block their address entirely.
* consider blocking the 'bots' "client" signature. You can do this in
Apache. "Respected" bots don't lie about who they are. If a bot does
lie, then it is a DOS attack in disguise.
* Contact the users, if you can.
* If you want the user to get a mirror, setup something to actually do the
mirror that is effective. I would recommend running rsync.
I don't think a bandwidth limitation is going to be effective for this situation. They're still going to consume the same amount of bandwidth, just over a longer term. It's not like people usually sit and wait for the site sucker to do it's thing. Bandwidth limiters like you suggested are usually used to reduce the effects of slashdotting and the like.
What you need is an anti-leech mod to limit the amount of data that can be downloaded from a specific IP. I know they're out there. Just do a bit of googling.
If the hosting service had a problem with excessive bandwidth usage, I don't know why they didn't throttle port 80 at the routers.
Unfortunately, likely the best recommendation you'll receive.
You are being MICROattacked, from various angles, in a SOFT manner.
Does anyone have any suggestions?
Yes, use the frelling BitTorrent, that's exactly what it was written for!
Add to this some way of limiting bandwith per connection (so people are mainly downloading from other bt clients, not from you) and you have perfect distribution means.
Leave the possibility to download via http, but limit it with QoS or some other way to tiny little stream, plus advertise all over the site that people can achieve unlimited dl speeds using BT.
Publishing documents only to limit in every possible way access to them (like all the game files servers do) is unwise, to say the least. Especially if you don't have to.
Robert
Bastard Operator From 193.219.28.162
Compile a CBQ module for your kernel
find a way to parse the log and find out (in as close to real time as possible) who is slurping the whole site.
restrict that IP's bandwidth to the machine entirely, not just apache.
The linux kernel can handle these things very easily. Why bother with apache, use the linux machine itself to do it.
this, of course, assumes you are using linux.
Don't bother trying to rate limit downloads; you'll get exactly the same number of people downloading everything, except that instead of doing it quickly they'll leave wget running all week and tying up your server's resources.
Have a page "download.php?filename=foo.txt" that all your links point to, and have that page return <meta http-equiv="Refresh" content="1;URL=files/$filename">
(pseudocode; my php scripting is not great, but you get the idea..)
This totally breaks wget, although it's not too hard to script around. You'll cut spider traffic back by probably 95%, all the casual 'grab everything we can' downloaders, but people who really want to get all your files will still figure out how to.
Or if you totally want to stop automated downloads, put each file behind a 'captcha'.
455fe10422ca29c4933f95052b792ab2
I strongly second the idea of offering your files via BitTorrent only. If, however, you must continue to offer them via plain HTTP, you should be able to cook up something with a custom Apache module. I suggest to have a look at http://www.oreilly.com/catalog/wrapmod/
:wq
I wrote an apache module which I call mod_curb (for Apache 1.3)
This doesn't do exactly what you want, but I'm sure if you were to ask me or somebody else we could code something for you.
The basic idea I have for you problem is to have a database of currently active clients, beit MySQL/Flat files, then you can keep track of all data transferred by that address.
Once a threshold has been reached you can either stop everything, or start throttling.
However throttling alone won't help you out they'll still mirror you, just slowly.
I had the same problem on my rather popular site (posting AC to to avoid /.'ing of my site). I found that a combination of mod_bandwidth and mod_limitipconn does the trick. That combined with a script that monitors the errorlog and reconfigures the firewall to block abusers.
Of course, people can pass the -U flag to wget and get around this, but it'd work while you get a real solution in place.
-B
Ash and Hickory, straight-grained and true, make excellent bludgeons, dandy for the cudgeling of vegetarians.
If you're just trying to discourage people from downloading so much from you, you need to set up mirrors, bittorrents, or some other protection of your site. Maybe you could reduce the size of your site? Is it an archive of pictures? If so, maybe your friend could reduce the size of them? I mean, if he's offering 100,000 pictures that are 100k in size each, then if he reduces the size/quality so they're only 30k each, then you'll really reduce your bandwidth.
If it's all text, maybe you could use some kind of compression. If it's video, maybe use a lower bitrate. You get the idea.
But just limiting transfer rates by IP is probably not gonna help.
"Avoid employing unlucky people - throw half of the pile of CVs in the bin without reading them." -- David Brent
Well, some people may not know it, but the firewall (iptables) in linux is very neat when it comes to doing "tricks" with incoming connections.
:) RTFM (man iptables).
First, start by creating a table that all incoming SYN packets to the port 80 should jump to.
Next, us some sort of php script that has sudo permissions to add an ip address (DAMN WELL make sure you know how to properly check those numbers).
Set the default policy for this new table to REJECT or DROP (your call)
On each new session incoming to the server, call a script that can call iptables to add their IP specifically to the new table to ALLOW the connection, but use rate limiting on this rule of say no more than 10 requests per minute. Anything after that will cause the rule to be ignored and will hit the default policy which will reject them. The trick is to set the time value on the block to something like 30minutes. This way, anyone who goves over the 10 requests/min (or whatever you deem reasonable) will be blocked for 30 minutes. Thats should at least make it impractical to download huge amounts of data short of hitting the machine from multiple machines.
The last thing is to setup some sort of "Cleanup" script that either runs every so often when a connection hasnt been used to remove the old ip's from the iptables.
Anyways, its a rough idea, and no way in heck im gonna give the commands verbatim for nothing
Good luck
I don't think he should have a problem with wget users sucking down everything. What's the difference between using wget or doing it manually? Just the speed with which they can get it all, right? So as long as they are considerate in their automated downloading, who cares? I usually set my wget sessions to perform one fetch randomly every 1 to 5 minutes. Yes, it will take a LONG time for it to finish very large sites, but it doesn't cause undue load on someone's server, is probably slower than you could do by hand, and still gets the job done *over time*.
As several people here have said, if you just slow it down the wget will just take all week, and perhaps use more resources (you will have to keep track of it to slow it down).
Instead, when they pass the bandwidth limit (or more likely a number-of-requests limit) you should deliver a dead-end page from which there are no links to go anywhere else. Then when they wget it you will get a lot of these dead-end pages instead of the data they want. If a normal user hits it, it can tell them to wait a few minutes and then reload the page.
If the owner of the material does not mind, it does sound like a bittorrent download would help a lot too. Have the dead-end page give instructions on how to retrieve the bittorrent.
Anyway these are just my ideas, I really have zero experience in web sites so feel free to dismiss them as stupid.
Could you post a link to your friend's site? :)
Username taken, please choose another one.
What I do is have a hidden link at the top of every page that links to a specially-named missing HTML file in that directory. The missing file handler checks for this special name and, if found, adds the client's IP to the .htaccess deny list. The access denied handler checks the .htaccess list and, if their IP is found, explains the acceptable use policy to them. A cron job expires the .htaccess entries quickly once they stop trying to bulk download.
If they're going to suck down the whole thing, plan for it.
Offer it pre-zipped. This would reduce the bandwidth and download time. A plus for everyone.
Make it easy for people who do this to obtain updates/additions by date.
As part of accessing the zipped version, ask people to mirror it. If they're going to carry it all, offer it all. Arrange dynamic mirror updating with those willing.
Find one or more secondary storage site for the archive. Ask people to use these (put them highest on the list).
If people persist on sucking down the whole thing and don't go for the archive, arrange a throttle with the sysadmin, and advertise it. Let people know that if they try to wget everything, things will start going real slow for them.
Set up a small version without the files, in parallel to the real one, with a note saying "files temporarily unavailable". Allow the system owner to switch to the small version during times of high traffic so as not to bog down his other users, or alterntaively, switch it yourself according to the owner's estimates of his traffic and times.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
A friend of mine runs a web site which offers a very large archive of files.....his bandwidth consumption has been through the roof because of people using wget...50% of that being clients who are downloading every document on the site
That friend wouldn't be SCO , by any chance?
You are in a twisty maze of processor lines, all alike.
There is a lot of hype here.
SpeedLimit works by limiting the request rate of each IP address. If your web site consists of many small files (which sounds like the case), then curbing the request rate is enough to cure the most abusive bots. A determined adversary can still circumvent request rate limits with wget --random-wait, but it will be more frustrating for them, and a large percentage of clients can be expected to give up altogether.
If your web site has lots of large files, then (tooting my own horn here) your best choice is to use my mod_limitipconn module together with mod_bandwidth. The mod_bandwidth sets a total limit on traffic and the mod_limitipconn ensures that any single IP address gets only its fair share of that total traffic. I would also advise you in this case to use mod_bandwidth's built in ability to exempt small files from the bandwidth limits.
If your situation is such that the vast majority of visitors intend to download your whole web site, then your best option is to seed a bittorrent tarball of your whole web site.
That should just about cover all the bases. No single one of these proposals is ideal, but you have to realize that overcapacity has no elegant solution. The goal is to manage the situation as best you can using the available tools.
I still got what I wanted but I'm sure the vast majority of people wouldn't do it. Most wouldn't know how to and some wouldn't feel like going through the trouble.
The Javascript method mentioned earlier could work in a simillar way but what about people who disable Javascript or use a browser that doesn't support it.
just put Squid (http://squidcache.org) in front of your Apache.
Squid's config is very easy for bandwidth throttling by IP.
I would use a round robin anonymous proxy and then I could bust your IP based nonsense in a jiffy.
I already have a module for python to do this that took about half an hour to write.
HTTP is an open protocol, there is no true way to filter one set of users from another.
You could always use passworded accounts and use micropayments for bandwidth.
that's the way to do your usenet porn archive
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
Don't you hate it when everyone tells you something is impossible? It would be much more useful if they wouldn't, so that people who post solutions are easier to find.
This is absolutely possible and not that hard. It is just that most people don't take the time to learn how. The poster who mentioned Quality of Service (QOS) was correct. You will certainly want to read about traffic control and queueing disciplines.
Under Linux, use the traffic control (tc) command to configure bandwidth limits by adding or chaining queueing disciplines to your network interface. tc may not come pre-installed with your distribution, so you might have to find it.
At the end of this post is a script I wrote to limit bandwidth from my website, which limits anything going out of port 8000 to 2 Mbps, but can "borrow" up to 2 Mbps more when bandwidth is available (almost always on a 100 Mbps connection).
Since you can accidentally limit yourself to near nothing, you'll want a quick way to disable traffic control. The line below removes the "root" queueing disciple from the network interface which removes all the queueing disciplines that are chained from it.
tc qdisc del dev eth0 root
By modifying the u32 queueing discipline parameters, you can quite easily limit based upon IP addresses/networks.
This should get you started, but you really should read the traffic control documentation and understand how to configure this stuff. Don't just think you can tweak a few parameters in the script and get what you want. I'm not ashamed to admit that it took me a few hours to get a beginning grasp on it.
OK, here is the script...
# Add HTB queuing discipline to root of eth0 with handle 1:0
# unclassified traffic goes to class 1:99
tc qdisc add \
dev eth0 \
root \
handle 1: \
htb \
default 99
# Add a single class that will limit all bandwidth on this interface
# This is done so that we can borrow between the classes below
tc class add \
dev eth0 \
parent 1: \
classid 1:1 \
htb \
rate 100mbit
# Class 1:10 is limited to 2mbit/s but can borrow up to 2mbit/s more from 1:99
# in practice the other 2mbit/s should almost always be available
tc class add \
dev eth0 \
parent 1:1 \
classid 1:10 \
htb \
rate 2mbit \
ceil 4mbit
# Class 1:99 is limited to 90mbit/s and can not borrow any more
tc class add \
dev eth0 \
parent 1:1 \
classid 1:99 \
htb \
rate 90mbit \
ceil 90mbit
# Use SFQ to load balance the connections within class 1:10
tc qdisc add \
dev eth0 \
parent 1:10 \
handle 10: \
sfq
# Use SFQ to load balance the connections within class 1:99
tc qdisc add \
dev eth0 \
parent 1:99 \
handle 99: \
sfq
# This filter selects all traffic from port 8000 as belonging to class 1:10
tc filter add \
dev eth0 \
protocol ip \
parent 1: \
prio 1 \
u32 match ip sport 8000 0xffff \
flowid 1:10
Ouch! The truth hurts!
Instead of treating the leeches as enemies, treat them as friends.
Perhaps you could have a special section of the site just for leeches. Explain your problems: you have too many people downloading everything.
Ask for solutions. Maybe they'd all be happy with BitTorrent. Maybe they could help set up mirrors. Maybe they'd voluntarily restrict their leeching to lower-traffic times like the middle of the night. If you do some kind of throttling, maybe they'd rather buy a CD containing all your stuff than wait a couple days for everything to download (okay, probably not).
I use a great product called Linux Arbitrator..as I'm sure a lot of you are familiar with it. I put it in my DMZ..with a farm of machines behind it and it throttles whatever traffic I want based on port, traffic type, ip address or MAC address. http://www.bandwidtharbitrator.com/ Check it out..I think you'll like what it offers. Rob
Problem solved! ;)
If he does mind people downloading his entire site, why not box the whole lot up and offer it as a bittorrent file? Of course then he has the problem that he may have to run the torrent when no-one else is seeding it, but that's an easy way to limit bandwidth uses.
To stop wget, edit your robots.txt and forbid it. Hopefully people will obey...
Combination - fun iPhone puzzling
Consider running thttpd instead of apache for the static downloadable contact. It supports this type of throttling http://www.acme.com/software/thttpd
I run a message board, and at one time, I had a ton of programs and videos and flash files that people were doing the same thing to me. So what I did was write a php/mysql program that makes you authenticate, and then it creates a link that you can click, or it will automatically download in a set amount of time, and its only good for one click. No wget programs can get these files. If you want to try it out, goto http://downloads.tusclan.com and try it out. For instance, all of the superbowl commericals are there, so you can download them and try them out.
Throw in a couple of PHP scripts that generate millions of links to nonexistent pages.
No "um's" please.