We considered tarpitting before, I think we were always scared off by the prospect of having to keep tens of thousands of connections open.
Does anyone have specific software to recommend that is able to keep that many connections open on a typical cheap Linux box? (Lighttpd? Nginx? Varnish? Yaws?)
The implementation I'm thinking might work well is:
Switch www.w3.org to use some lightweight server software that is able to keep lots of connections open, and configure it to serve DTD files with an artificial 5 second delay. Proxy all the other requests to our existing Apache server running elsewhere (possibly on another port on the same system)
Most people shouldn't notice or care about the delay for DTD files, only the apps that are requesting them hundreds or thousands of times in a row will notice.
To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
I have a plan that I have yet to implement, which is to log only 0.001% of the requests for certain very popular resources (e.g. HTML DTDs and valid-HTML icons), which would allow us to monitor trends without logging tens of gigs of data per day; we'd just need to compensate for it when calculating stats later.
Then I planned to monitor for abuse by also logging every request to a script that watches for abusive traffic patterns, an easy adaptation from the current script that wakes up and skims the logs every 10 mins.
(in your journal entry, when you say you are MD5ing IP addresses for privacy reasons, are you adding a random bit of data to the IP address before calcuating the MD5? If not it's pretty easy to find out which IP address corresponds to a given MD5 sum.)
No... if the MTA is SPF-enabled it can reject it immediately (while still talking to the sending relay), without causing a bounce to the forged address.
The Web already includes non-location-based URIs like mid: (for referring
to message-ids), and urn:sha1: for referring to a specific set of bits by their
checksum.
This proposal seems like a decent way of bridging HTTP-space with URN-space,
but please remember that the Web is more than just HTTP. (see also: URIs, URLs, and URNs)
Anyway, it seems to me that sites that tend to suffer from slashdotting are:
those that use dynamically-generated pages for what is basically static
content: this problem can be fixed by sites making sure their content is cacheable, and further deployment of
HTTP caches. (I'm not convinced a p2p-style solution is the solution here.)
those with large bandwidth needs (kernel images, linux distribution.iso's,
multimedia): as p2p software becomes more mature and widely deployed, everyone
will have a urn:sha1: resolver on their desktop (pointing to their p2p software
of choice), then whenever a new kernel is announced, the announcement can say:
Linux kernel version 2.4.20 has been released. It is available from:
Patch: ftp://ftp.kernel.org/pub/linux/kernel/v2.4/patch-2.4.20.gz
a.k.a. urn:sha1:OWXEOVAK2YJW3G6XSULXDWFCNWTX7B2K
Full source: ftp://ftp.kernel.org/pub/linux/kernel/v2.4/linux-2.4.20.tar.gz
a.k.a. urn:sha1:PPWXYMA32YNDNO35UD3IQTCWBVBYK5DC
and people can just fetch the files using urn:sha1 URIs instead of everyone
hitting the same set of mirrors. (gtk-gnutella already supports searching
on urn:sha1: URIs)
Using "Click here to complete your purchase" as a regular hypertext link (i.e. href="foo") would be a violation of the HTTP protocol, so any sites that do that are broken, and should not be used. (see further reading, if you're interested.)
In general, it sould be safe to prefetch any URLs using HTTP's GET method.
I am very happy to hear about prefetching in Mozilla -- I have been wanting this feature for years!
Just tunnel IRC (or something) over ssh: works fine, is easy to set up, and you're not reinventing either wheel. (there are plenty of ssh and IRC clients available for most platforms)
I've hosted my site with pair Networks for the last few years, and have been extremely happy with them.
It's amazing what you get with a webmaster account for only $29/month -- 120M disk (and extra is cheap), 400M/day bandwidth, virtual FTP server, modern Apache http service, CGI scripts anywhere, shell access, unlimited email aliases, etc.
And they're extremely well-connected (redundant DS3s); cumulative downtime over the last few years has been maybe a few hours.
I don't get anything for plugging them, I'm just a happy customer. Oh, and unlike most sites, their own web site doesn't suck.
I think this is an excellent idea, thanks.
We considered tarpitting before, I think we were always scared off by the prospect of having to keep tens of thousands of connections open.
Does anyone have specific software to recommend that is able to keep that many connections open on a typical cheap Linux box? (Lighttpd? Nginx? Varnish? Yaws?)
The implementation I'm thinking might work well is:
Switch www.w3.org to use some lightweight server software that is able to keep lots of connections open, and configure it to serve DTD files with an artificial 5 second delay. Proxy all the other requests to our existing Apache server running elsewhere (possibly on another port on the same system)
Most people shouldn't notice or care about the delay for DTD files, only the apps that are requesting them hundreds or thousands of times in a row will notice.
W3C's current traffic is something like:
- 66% DTD/schema files (.dtd/ent/mod/xsd)
- 25% valid HTML/CSS/WAI icons
- 9% other
So we'd probably want to configure the lightweight server to serve those icons too (but then it would have to do conneg as well)
No no no, that's not the intent at all, documents should continue to point to DTDs on W3C's site. In fact the next version of W3C's markup validator will issue a warning if the FPI and system ID do not match.
People who are simply creating HTML documents generally don't need to worry about this issue at all, sorry if the article was unclear.
650 times as many hits. (163 times as many bytes.) But that's just from a quick sample.
To try to help put these numbers into perspective, this blog post is currently #1 on slashdot, #7 on reddit, the top page of del.icio.us, etc; yet www.w3.org is still serving more than 650 times as many DTDs as this blog post, according to a 10-min sample of the logs I just checked.
At W3C we log almost everything as well, and we end up with way too much data as a result.
But we use the logs to detect and prevent certain classes of abuse as well (e.g. too many requests in a short time interval or re-requesting the same resources over and over), and we also want to be able to track trends over time, so we have been reluctant to just throw that data away.
I have a plan that I have yet to implement, which is to log only 0.001% of the requests for certain very popular resources (e.g. HTML DTDs and valid-HTML icons), which would allow us to monitor trends without logging tens of gigs of data per day; we'd just need to compensate for it when calculating stats later.
Then I planned to monitor for abuse by also logging every request to a script that watches for abusive traffic patterns, an easy adaptation from the current script that wakes up and skims the logs every 10 mins.
(in your journal entry, when you say you are MD5ing IP addresses for privacy reasons, are you adding a random bit of data to the IP address before calcuating the MD5? If not it's pretty easy to find out which IP address corresponds to a given MD5 sum.)
I did:
grep 'Not authorized by SPF'
on our mail hubs.
some apache.org subdomains have txt records:
$ host -t txt xml.apache.org
xml.apache.org TXT "v=spf1 mx -all"
w3.org started rejecting forgeries based on SPF records about a week ago, and has been rejecting about 10000 forgeries/day since then, including:
52 jakarta.apache.org
18 xml.apache.org
a few other domains that have been forged and rejected according to their SPF records:
1628 amazon.com
222 gmail.com
175 redhat.com
129 lists.sourceforge.net
17 sourceforge.net
(numbers above are # of rejections in the first week)
No... if the MTA is SPF-enabled it can reject it immediately (while still talking to the sending relay), without causing a bounce to the forged address.
Regarding:
The World Wide Web is "the universe of network-accessible information", i.e. anything with a URI, including URIs that are not tied to a particular hostname.
The Web already includes non-location-based URIs like mid: (for referring to message-ids), and urn:sha1: for referring to a specific set of bits by their checksum.
This proposal seems like a decent way of bridging HTTP-space with URN-space, but please remember that the Web is more than just HTTP. (see also: URIs, URLs, and URNs)
Anyway, it seems to me that sites that tend to suffer from slashdotting are:
those that use dynamically-generated pages for what is basically static content: this problem can be fixed by sites making sure their content is cacheable, and further deployment of HTTP caches. (I'm not convinced a p2p-style solution is the solution here.)
those with large bandwidth needs (kernel images, linux distribution .iso's,
multimedia): as p2p software becomes more mature and widely deployed, everyone
will have a urn:sha1: resolver on their desktop (pointing to their p2p software
of choice), then whenever a new kernel is announced, the announcement can say:
Linux kernel version 2.4.20 has been released. It is available from:
2 .4.20.gz2 .4.20.tar.gz
Patch: ftp://ftp.kernel.org/pub/linux/kernel/v2.4/patch-
a.k.a. urn:sha1:OWXEOVAK2YJW3G6XSULXDWFCNWTX7B2K
Full source: ftp://ftp.kernel.org/pub/linux/kernel/v2.4/linux-
a.k.a. urn:sha1:PPWXYMA32YNDNO35UD3IQTCWBVBYK5DC
and people can just fetch the files using urn:sha1 URIs instead of everyone hitting the same set of mirrors. (gtk-gnutella already supports searching on urn:sha1: URIs)
Using "Click here to complete your purchase" as a regular hypertext link (i.e. href="foo") would be a violation of the HTTP protocol, so any sites that do that are broken, and should not be used. (see further reading, if you're interested.)
In general, it sould be safe to prefetch any URLs using HTTP's GET method.
I am very happy to hear about prefetching in Mozilla -- I have been wanting this feature for years!
I think widespread deployment of checksum-based URIs like urn:sha1 could help solve this problem.
Here's a screenshot, and here's the .bashrc stuff used to do it.
Why would anyone type 'ln -s /usr/bin/secsh /usr/bin/ssh' when you could just type 'ln -s /usr/bin/s{ec,}sh'?
I once saw a mouse that could change its own ball: see mpeg movie
Just tunnel IRC (or something) over ssh: works fine, is easy to set up, and you're not reinventing either wheel. (there are plenty of ssh and IRC clients available for most platforms)
Uh... the validator validates slashdot to HTML 3.2 because slashdot claims conformance with HTML 3.2. (in the doctype declaration, the first line of the file)
Given that HTML 3.2 is three and a half years old, what do you expect?
It's amazing what you get with a webmaster account for only $29/month -- 120M disk (and extra is cheap), 400M/day bandwidth, virtual FTP server, modern Apache http service, CGI scripts anywhere, shell access, unlimited email aliases, etc.
And they're extremely well-connected (redundant DS3s); cumulative downtime over the last few years has been maybe a few hours.
I don't get anything for plugging them, I'm just a happy customer. Oh, and unlike most sites, their own web site doesn't suck.