Throttle Apache Bandwidth Based on IP Address?
BigBlockMopar asks: "A friend of mine runs a web site which offers a very large archive of files. He wishes to continue to offer free and unrestricted access to his archive, but his bandwidth consumption has been through the roof because of people using wget (and similar) to download his entire site. Current traffic is around 200 gigabytes per month with over 50% of that being clients who are downloading every document on the site. The server space is donated by a hosting provider who is understandably starting to become impatient with the traffic. I've checked out mod_throttle and mod_bandwidth, neither appears to do exactly what is desired. Does anyone have any suggestions?"
"Eventually, he plans to set up mirrors, but he'd like to get the greedy users under control first. Alternatives are adding a (free) log-in authentication system, or a text-in-image system like Network Solutions uses to weed out automated whois queries. But I think the best solution is to allow a given client IP address full-speed downloading for the first $WHATEVER megabytes, and then automatically reduce the speed of the transfer to that IP address. This would probably deter most leeches but continue to allow legitimate users to transfer more than an arbitrary limit."
if you just slow down the connection you will have a lot of nearly idle apache processes running and so that after a while you cannot get more clients connected.
Either just drop connections or use a single process proxy whith the required ability, which then forwards the requests to the apache.
But restring by IP can be dangerous if users are sitting behind a proxy from the ISP (very common at least in Germany).
If you enable referer checking, this will stop most wget type programs. Wget has an --referer=URL option, but I find that it doesn't work. Also, there are a lot of windows clients that will spider a website and pull files based on extention, but again these don't usually have an option to set referer, or if they do most people aren't smart enough to turn it on.
One exeption to this is Pavuk, which does referer spoofing pretty well. This program is about 4 times harder to use than wget, and isn't very popular (you don't see it included in distros too often).
Of course this won't completely fix your problem, but it will probably stop about 90% of the people doing it now. It's an easy fix that you can implement quickly until you get something to throttle bandwidth properly.
-ft
If you want to limit BW by IP address, this might be doable depending on what the server is. If the server is a Linux or "virtual Linux" box, you can probably use 'tc' (Traffic Control) in the kernel to meter bandwidth by subnet or address. This works pretty well. Look at the advanced routing howto for info. It is a bear to setup, but actually works quite well.
The problem is that if these are bots grabbing your whole site, slowing them down to 10K/sec won't actually reduce the amount of traffic they pull from you. They may take all day to get the pages, but the bytes will still move.
Some options that you have.
* If the user really doesn't need the data, block their address entirely.
* consider blocking the 'bots' "client" signature. You can do this in
Apache. "Respected" bots don't lie about who they are. If a bot does
lie, then it is a DOS attack in disguise.
* Contact the users, if you can.
* If you want the user to get a mirror, setup something to actually do the
mirror that is effective. I would recommend running rsync.
What I do is have a hidden link at the top of every page that links to a specially-named missing HTML file in that directory. The missing file handler checks for this special name and, if found, adds the client's IP to the .htaccess deny list. The access denied handler checks the .htaccess list and, if their IP is found, explains the acceptable use policy to them. A cron job expires the .htaccess entries quickly once they stop trying to bulk download.
Don't you hate it when everyone tells you something is impossible? It would be much more useful if they wouldn't, so that people who post solutions are easier to find.
This is absolutely possible and not that hard. It is just that most people don't take the time to learn how. The poster who mentioned Quality of Service (QOS) was correct. You will certainly want to read about traffic control and queueing disciplines.
Under Linux, use the traffic control (tc) command to configure bandwidth limits by adding or chaining queueing disciplines to your network interface. tc may not come pre-installed with your distribution, so you might have to find it.
At the end of this post is a script I wrote to limit bandwidth from my website, which limits anything going out of port 8000 to 2 Mbps, but can "borrow" up to 2 Mbps more when bandwidth is available (almost always on a 100 Mbps connection).
Since you can accidentally limit yourself to near nothing, you'll want a quick way to disable traffic control. The line below removes the "root" queueing disciple from the network interface which removes all the queueing disciplines that are chained from it.
tc qdisc del dev eth0 root
By modifying the u32 queueing discipline parameters, you can quite easily limit based upon IP addresses/networks.
This should get you started, but you really should read the traffic control documentation and understand how to configure this stuff. Don't just think you can tweak a few parameters in the script and get what you want. I'm not ashamed to admit that it took me a few hours to get a beginning grasp on it.
OK, here is the script...
# Add HTB queuing discipline to root of eth0 with handle 1:0
# unclassified traffic goes to class 1:99
tc qdisc add \
dev eth0 \
root \
handle 1: \
htb \
default 99
# Add a single class that will limit all bandwidth on this interface
# This is done so that we can borrow between the classes below
tc class add \
dev eth0 \
parent 1: \
classid 1:1 \
htb \
rate 100mbit
# Class 1:10 is limited to 2mbit/s but can borrow up to 2mbit/s more from 1:99
# in practice the other 2mbit/s should almost always be available
tc class add \
dev eth0 \
parent 1:1 \
classid 1:10 \
htb \
rate 2mbit \
ceil 4mbit
# Class 1:99 is limited to 90mbit/s and can not borrow any more
tc class add \
dev eth0 \
parent 1:1 \
classid 1:99 \
htb \
rate 90mbit \
ceil 90mbit
# Use SFQ to load balance the connections within class 1:10
tc qdisc add \
dev eth0 \
parent 1:10 \
handle 10: \
sfq
# Use SFQ to load balance the connections within class 1:99
tc qdisc add \
dev eth0 \
parent 1:99 \
handle 99: \
sfq
# This filter selects all traffic from port 8000 as belonging to class 1:10
tc filter add \
dev eth0 \
protocol ip \
parent 1: \
prio 1 \
u32 match ip sport 8000 0xffff \
flowid 1:10
Ouch! The truth hurts!