Slashdot Mirror


Throttle Apache Bandwidth Based on IP Address?

BigBlockMopar asks: "A friend of mine runs a web site which offers a very large archive of files. He wishes to continue to offer free and unrestricted access to his archive, but his bandwidth consumption has been through the roof because of people using wget (and similar) to download his entire site. Current traffic is around 200 gigabytes per month with over 50% of that being clients who are downloading every document on the site. The server space is donated by a hosting provider who is understandably starting to become impatient with the traffic. I've checked out mod_throttle and mod_bandwidth, neither appears to do exactly what is desired. Does anyone have any suggestions?"

"Eventually, he plans to set up mirrors, but he'd like to get the greedy users under control first. Alternatives are adding a (free) log-in authentication system, or a text-in-image system like Network Solutions uses to weed out automated whois queries. But I think the best solution is to allow a given client IP address full-speed downloading for the first $WHATEVER megabytes, and then automatically reduce the speed of the transfer to that IP address. This would probably deter most leeches but continue to allow legitimate users to transfer more than an arbitrary limit."

20 of 75 comments (clear)

  1. BitTorrent by Directrix1 · · Score: 2, Interesting

    Well, you could always zip the whole site up, and put that up as a bittorrent link.

    --
    Occam's razor is the blind faith in the natural selection of least resistance and in universal oversimplification. -- EF
  2. bad idea, apache because 1 connection per process by nudelding · · Score: 4, Interesting

    if you just slow down the connection you will have a lot of nearly idle apache processes running and so that after a while you cannot get more clients connected.
    Either just drop connections or use a single process proxy whith the required ability, which then forwards the requests to the apache.
    But restring by IP can be dangerous if users are sitting behind a proxy from the ISP (very common at least in Germany).

  3. Javascript? by zcat_NZ · · Score: 2, Interesting

    Set up javascript links, which wget can't follow.

    Or set up a 'captcha' for each download, so that a human has to confirm each file one at a time.

    --
    455fe10422ca29c4933f95052b792ab2
  4. Another solution might be referer checking by Loualbano2 · · Score: 5, Insightful

    If you enable referer checking, this will stop most wget type programs. Wget has an --referer=URL option, but I find that it doesn't work. Also, there are a lot of windows clients that will spider a website and pull files based on extention, but again these don't usually have an option to set referer, or if they do most people aren't smart enough to turn it on.

    One exeption to this is Pavuk, which does referer spoofing pretty well. This program is about 4 times harder to use than wget, and isn't very popular (you don't see it included in distros too often).

    Of course this won't completely fix your problem, but it will probably stop about 90% of the people doing it now. It's an easy fix that you can implement quickly until you get something to throttle bandwidth properly.

    -ft

    1. Re:Another solution might be referer checking by ncr53c8xx · · Score: 2, Informative
      Wget has an --referer=URL option, but I find that it doesn't work.

      Which version of wget are you using? The referrer option works fine for me--for one website when I don't use it, I get redirected to the main page. With the referrer option I can download the file. Although something that sets the referrer automatically would be best.

  5. Even if it works, it might not. by DDumitru · · Score: 4, Interesting

    If you want to limit BW by IP address, this might be doable depending on what the server is. If the server is a Linux or "virtual Linux" box, you can probably use 'tc' (Traffic Control) in the kernel to meter bandwidth by subnet or address. This works pretty well. Look at the advanced routing howto for info. It is a bear to setup, but actually works quite well.

    The problem is that if these are bots grabbing your whole site, slowing them down to 10K/sec won't actually reduce the amount of traffic they pull from you. They may take all day to get the pages, but the bytes will still move.

    Some options that you have.

    * If the user really doesn't need the data, block their address entirely.
    * consider blocking the 'bots' "client" signature. You can do this in
    Apache. "Respected" bots don't lie about who they are. If a bot does
    lie, then it is a DOS attack in disguise.
    * Contact the users, if you can.
    * If you want the user to get a mirror, setup something to actually do the
    mirror that is effective. I would recommend running rsync.

    1. Re:Even if it works, it might not. by Mancide · · Score: 2, Informative

      Also, wget will listen to robots.txt, just specify what is allowed and what isn't allowed for wget to grab. Granted, this can be circumvented, but it should help with most of the users who are not smart enough to get around it.

      This link should help you out.

      --
      "This amp is special, see all the knobs go up to 11, that means it is one louder than other amps"
  6. Wrong solution by insensitive+claude · · Score: 3, Informative

    I don't think a bandwidth limitation is going to be effective for this situation. They're still going to consume the same amount of bandwidth, just over a longer term. It's not like people usually sit and wait for the site sucker to do it's thing. Bandwidth limiters like you suggested are usually used to reduce the effects of slashdotting and the like.

    What you need is an anti-leech mod to limit the amount of data that can be downloaded from a specific IP. I know they're out there. Just do a bit of googling.

  7. BT by Gadzinka · · Score: 3, Insightful

    Does anyone have any suggestions?

    Yes, use the frelling BitTorrent, that's exactly what it was written for!

    Add to this some way of limiting bandwith per connection (so people are mainly downloading from other bt clients, not from you) and you have perfect distribution means.

    Leave the possibility to download via http, but limit it with QoS or some other way to tiny little stream, plus advertise all over the site that people can achieve unlimited dl speeds using BT.

    Publishing documents only to limit in every possible way access to them (like all the game files servers do) is unwise, to say the least. Especially if you don't have to.

    Robert

    --
    Bastard Operator From 193.219.28.162
  8. Seriously.. use Javascript! by zcat_NZ · · Score: 2, Interesting

    Don't bother trying to rate limit downloads; you'll get exactly the same number of people downloading everything, except that instead of doing it quickly they'll leave wget running all week and tying up your server's resources.

    Have a page "download.php?filename=foo.txt" that all your links point to, and have that page return <meta http-equiv="Refresh" content="1;URL=files/$filename">

    (pseudocode; my php scripting is not great, but you get the idea..)

    This totally breaks wget, although it's not too hard to script around. You'll cut spider traffic back by probably 95%, all the casual 'grab everything we can' downloaders, but people who really want to get all your files will still figure out how to.

    Or if you totally want to stop automated downloads, put each file behind a 'captcha'.

    --
    455fe10422ca29c4933f95052b792ab2
    1. Re:Seriously.. use Javascript! by BrookHarty · · Score: 2

      It might break wget, but HTTRACK will get around that.

      I would use PHP and log users and their IP's. And only allow so many files per user. You do want return traffic?

      And as others have said, use bittorrent.

  9. mod_perl is your friend by Etyenne · · Score: 2, Informative

    I strongly second the idea of offering your files via BitTorrent only. If, however, you must continue to offer them via plain HTTP, you should be able to cook up something with a custom Apache module. I suggest to have a look at http://www.oreilly.com/catalog/wrapmod/

    --
    :wq
  10. Not quite what you want .. by stevey · · Score: 3, Informative

    I wrote an apache module which I call mod_curb (for Apache 1.3)

    This doesn't do exactly what you want, but I'm sure if you were to ask me or somebody else we could code something for you.

    The basic idea I have for you problem is to have a database of currently active clients, beit MySQL/Flat files, then you can keep track of all data transferred by that address.

    Once a threshold has been reached you can either stop everything, or start throttling.

    However throttling alone won't help you out they'll still mirror you, just slowly.

  11. Re:WANTED by stevey · · Score: 2, Informative

    See my other comment about mod_curb which comes close to doing the right thing.

    You could hack it, or find somebody else to do so for you.

  12. How would this be a solution? by lorcha · · Score: 2, Insightful
    If you are trying to limit the actual amount of downloaded bytes, how would throttling by IP help? If Larry the leach types
    wget -l99 http://your.site.org/
    he's just gonna walk away from the machine and check back when it's done. If you serve up all those files in 1 minute, 1 day, or 1 week, it doesn't matter. He's still downloaded exactly the same amount of data from you. Your solution only works if you're trying to limit transfer rates, which you should be able to do with your mod_throttles of the world.

    If you're just trying to discourage people from downloading so much from you, you need to set up mirrors, bittorrents, or some other protection of your site. Maybe you could reduce the size of your site? Is it an archive of pictures? If so, maybe your friend could reduce the size of them? I mean, if he's offering 100,000 pictures that are 100k in size each, then if he reduces the size/quality so they're only 30k each, then you'll really reduce your bandwidth.

    If it's all text, maybe you could use some kind of compression. If it's video, maybe use a lower bitrate. You get the idea.

    But just limiting transfer rates by IP is probably not gonna help.

    --
    "Avoid employing unlucky people - throw half of the pile of CVs in the bin without reading them." -- David Brent
  13. Instead of slowing down, try stopping it entirely by spitzak · · Score: 2, Insightful

    As several people here have said, if you just slow it down the wget will just take all week, and perhaps use more resources (you will have to keep track of it to slow it down).

    Instead, when they pass the bandwidth limit (or more likely a number-of-requests limit) you should deliver a dead-end page from which there are no links to go anywhere else. Then when they wget it you will get a lot of these dead-end pages instead of the data they want. If a normal user hits it, it can tell them to wait a few minutes and then reload the page.

    If the owner of the material does not mind, it does sound like a bittorrent download would help a lot too. Have the dead-end page give instructions on how to retrieve the bittorrent.

    Anyway these are just my ideas, I really have zero experience in web sites so feel free to dismiss them as stupid.

  14. auto-block bulk downloads by dj.delorie · · Score: 5, Insightful

    What I do is have a hidden link at the top of every page that links to a specially-named missing HTML file in that directory. The missing file handler checks for this special name and, if found, adds the client's IP to the .htaccess deny list. The access denied handler checks the .htaccess list and, if their IP is found, explains the acceptable use policy to them. A cron job expires the .htaccess entries quickly once they stop trying to bulk download.

  15. Plan for them by DynaSoar · · Score: 2, Insightful

    If they're going to suck down the whole thing, plan for it.

    Offer it pre-zipped. This would reduce the bandwidth and download time. A plus for everyone.

    Make it easy for people who do this to obtain updates/additions by date.

    As part of accessing the zipped version, ask people to mirror it. If they're going to carry it all, offer it all. Arrange dynamic mirror updating with those willing.

    Find one or more secondary storage site for the archive. Ask people to use these (put them highest on the list).

    If people persist on sucking down the whole thing and don't go for the archive, arrange a throttle with the sysadmin, and advertise it. Let people know that if they try to wget everything, things will start going real slow for them.

    Set up a small version without the files, in parallel to the real one, with a note saying "files temporarily unavailable". Allow the system owner to switch to the small version during times of high traffic so as not to bog down his other users, or alterntaively, switch it yourself according to the owner's estimates of his traffic and times.

    --
    "I may be synthetic, but I'm not stupid." -- Bishop 341-B
  16. Totally Possible by yancey · · Score: 4, Informative


    Don't you hate it when everyone tells you something is impossible? It would be much more useful if they wouldn't, so that people who post solutions are easier to find.

    This is absolutely possible and not that hard. It is just that most people don't take the time to learn how. The poster who mentioned Quality of Service (QOS) was correct. You will certainly want to read about traffic control and queueing disciplines.

    Under Linux, use the traffic control (tc) command to configure bandwidth limits by adding or chaining queueing disciplines to your network interface. tc may not come pre-installed with your distribution, so you might have to find it.

    At the end of this post is a script I wrote to limit bandwidth from my website, which limits anything going out of port 8000 to 2 Mbps, but can "borrow" up to 2 Mbps more when bandwidth is available (almost always on a 100 Mbps connection).

    Since you can accidentally limit yourself to near nothing, you'll want a quick way to disable traffic control. The line below removes the "root" queueing disciple from the network interface which removes all the queueing disciplines that are chained from it.

    tc qdisc del dev eth0 root

    By modifying the u32 queueing discipline parameters, you can quite easily limit based upon IP addresses/networks.

    This should get you started, but you really should read the traffic control documentation and understand how to configure this stuff. Don't just think you can tweak a few parameters in the script and get what you want. I'm not ashamed to admit that it took me a few hours to get a beginning grasp on it.

    OK, here is the script...

    # Add HTB queuing discipline to root of eth0 with handle 1:0
    # unclassified traffic goes to class 1:99
    tc qdisc add \
    dev eth0 \
    root \
    handle 1: \
    htb \
    default 99

    # Add a single class that will limit all bandwidth on this interface
    # This is done so that we can borrow between the classes below
    tc class add \
    dev eth0 \
    parent 1: \
    classid 1:1 \
    htb \
    rate 100mbit

    # Class 1:10 is limited to 2mbit/s but can borrow up to 2mbit/s more from 1:99
    # in practice the other 2mbit/s should almost always be available
    tc class add \
    dev eth0 \
    parent 1:1 \
    classid 1:10 \
    htb \
    rate 2mbit \
    ceil 4mbit

    # Class 1:99 is limited to 90mbit/s and can not borrow any more
    tc class add \
    dev eth0 \
    parent 1:1 \
    classid 1:99 \
    htb \
    rate 90mbit \
    ceil 90mbit

    # Use SFQ to load balance the connections within class 1:10
    tc qdisc add \
    dev eth0 \
    parent 1:10 \
    handle 10: \
    sfq

    # Use SFQ to load balance the connections within class 1:99
    tc qdisc add \
    dev eth0 \
    parent 1:99 \
    handle 99: \
    sfq

    # This filter selects all traffic from port 8000 as belonging to class 1:10
    tc filter add \
    dev eth0 \
    protocol ip \
    parent 1: \
    prio 1 \
    u32 match ip sport 8000 0xffff \
    flowid 1:10

    --
    Ouch! The truth hurts!
  17. Re:Ask the leeches by SuiteSisterMary · · Score: 2, Interesting

    Move it over to FTP, and allow only X number of simultaneous logins.

    --
    Vintage computer games and RPG books available. Email me if you're interested.