Yet another blatant error. In fact, it is not possible to pipeline the communication in merge-exchange sort. Each processor needs to see a response from one of its neighbors before deciding what to do next. Latency is a critical issue here.
I must admit to some curiosity at this point. Exactly which university grants a doctorate in computing to someone who doesn't know how the fundamental hypercube algorithms work, doesn't understand the difference between hypercubes and VLSI, and doesn't even understand the difference between a CPU and an FPU?
Let's see if I can boil the news down to something that Mr. Computing DPhil will understand.
The conventional wisdom is that each processor needs, asymptotically, a huge amount of memory for the number-field sieve. The amount of memory grows steadily with the computation time. The cost of the processor is eventually dwarfed by the cost of memory.
(These effects can already be seen to some extent in today's factorizations. You might think that the cost is about evenly balanced between processors and memory; however, the processors are constantly stalled waiting for random access to memory. A much less expensive processor could easily do the same job.)
My paper shows how to get away with, asymptotically, far less memory per processor (and, furthermore, very simple processors), while keeping the computation and communication costs under control.
These are asymptotic results: theorems saying that there is a huge cost improvement for extremely large factorizations. More work is required to figure out the exact cost of the new algorithms for, say, 1536-bit factorizations. It might turn out that the new algorithms are very helpful in practice. It might turn out that the new algorithms are useless in practice. At this point, nobody knows.
Merge-exchange is a hypercube algorithm. Every embedding of a size-n hybercube into real three-dimensional space has wires of average length at least proportional to n^(1/3+o(1)), and time proportional to n^(1/3+o(1)) for signals to traverse the wires, so the cost exponent is 5/3+o(1). In contrast, a three-dimensional mesh achieves a cost exponent of 4/3+o(1).
Mr. Computing DPhil's claim of an exponent of 1 (``by splitting the petabyte of memory into a million pieces and attaching a cpu to each, one speeds up the sorting by a factor of a million'') is simply wrong.
Of course, even 4/3+o(1) is overly optimistic, because tightly packed three-dimensional machines have all sorts of nasty heat-dissipation problems. My paper focuses on two-dimensional circuits, such as FPGAs. The exponent 3/2+o(1) for Schimmler's algorithm on an easy-to-build two-dimensional mesh is still better than the exponent 5/3+o(1) for merge-exchange on a hard-to-build three-dimensional hypercube.
The bottom line is that traditional computers, such as Mr. Computing DPhil's network of Pentiums, are not the right way to do extremely large factorizations. They're not even competitive. However, figuring out the ``extremely large'' cutoff requires a more detailed analysis.
When Schimmler's algorithm splits a huge sorting problem across one million processors, each processor of one millionth the size, it gains a factor of roughly one thousand in speed, without much increase in hardware cost.
Mr. Computing DPhil now claims that the factor is actually one million (and that everyone knows this already). In fact, such algorithms are not only unknown, but also guaranteed not to exist, by the Brent-Kung area-time theorem.
What's important is communication cost. You can see this even for a small number of processors. Exercise: If you have 16 processors, each with a billion numbers, how do you sort all 16 billion numbers? Each processor can easily sort its own numbers, but how do you merge the 16 lists?
One way to view the linear-algebra algorithms in my paper is as follows: it's possible to split that 120-terabyte matrix across a huge number of tiny computers with surprisingly small communication costs.
There's still more overhead here than in the sieving. The analysis shows that it's worth spending more time sieving to reduce the matrix size.
Let's look at this ``Pentium 4 processor'' with ``a few gigabytes of memory.'' We've spent a thousand dollars on that hardware. We can use it to sort a billion 32-bit integers in a matter of minutes, a few hundred billion clock cycles. That's high-speed computation, right?
Wrong. Schimmler's algorithm sorts a billion numbers on a 32768x32768 mesh in only a few hundred thousand parallel compare-exchange steps between adjacent numbers.
Of course, compare-exchange circuitry is more expensive than traditional memory, and a compare-exchange step could be slower than a Pentium-4 clock cycle. But these factors grow only logarithmically with the problem size. Meanwhile, there's a cycles/steps ratio of one million, and that grows as a power of the problem size, roughly the square root.
My paper explains how mesh circuits provide a similar speedup for the linear algebra in the number field sieve, and an even larger speedup for the sieving. (Identifying the asymptotic cost exponent is, obviously, much easier than working out the exact costs for particular input sizes; I simply don't know yet whether the new algorithm will be cost-effective for, say, 1536-bit factorizations.)
Contrary to the comments from Mr. Computing DPhil, this mesh-circuit parallelization is completely different from the standard parallelization of sieving, in which a billion independent billion-number problems are run on a bunch of separate machines. In particular:
The switch from standard computers to mesh circuits reduces the cost exponent. The standard parallelization does not reduce the cost at all.
The processors inside a mesh circuit are constantly talking to each other. The processors in the standard parallelization are practically silent.
The switch from standard computers to mesh circuits drastically reduces the amount of memory per processor and the cost per processor. The standard parallelization does not.
I have no idea how Mr. Computing DPhil arrived at the idea of putting vastly more memory than necessary into a single-processor computer: for example, using a million gigabytes of memory to sort a billion numbers. He is incorrect in claiming that such silly computers are the base for comparison in my paper.
I have electronic paper, and I love it
on
Electronic Paper
·
· Score: 1
Excellent contrast. Enough pixels to display dozens of long lines of readable text. The image changes at the push of a button. I can use this e-paper to see the latest news or to read my mail, exactly as Neal Stephenson predicted.
The device is fairly expensive, and it does draw some power, but the costs are not a big part of my yearly budget.
The manufacturer, some company called Compaq, has dropped the standard e-paper terminology. Instead they call this device a laptop. Catchy name, I think.
I've started using this e-paper to look at product catalogs. I've heard that it's even possible to place orders through the e-paper. Can you believe it? E-paper shopping! Amazing.
Several companies are racing to develop e-paper that's just as easy to read but isn't as expensive and doesn't draw as much power. I'm sure I'll switch once the new e-paper is available. But I want to emphasize the fact that current e-paper is already quite affordable. Go to your local hardware store and buy a Compaq laptop today!
``For example, he makes a big point of tinydns being better than BIND, because while the process is starting up, it still answers queries. While previous versions of BIND would not answer queries during startup, this is no longer a problem with BIND 9.''
False. If you have a lot of data in your zone files, BIND 9 will take quite a while to read that data into memory. All queries for not-yet-read data will fail.
``The problem is that if you make a mistake and munge the database and then rsync or rcp that to the backup servers, you're totally hosed.''
False. With tinydns, a syntax error will not be copied to the slaves. In fact, the master will continue supplying the previous data, unlike BIND, where the master starts responding SERVFAIL to all queries in that zone.
``Without a third party patch, tinydns does not support standard SRV records.''
``Without a patch from a third party, tinydns does not listen to more than one IP address.... Like tinydns, dnscache will not bind to more than one IP address without a third party patch.''
False on both counts. Adding another cache IP address, for example, is a simple matter of running dnscache-conf. Even better, on SMP machines, traffic to the second IP address is automatically handled by a separate CPU.
``By default, tinydns does not support the use of TCP at all. This most definitely violates the spirt [sic] of the RFCs, as well as the letter.''
False. See faq/tinydns.html#tcp for an explanation of when you need to set up TCP service. The RFCs do not require TCP service. (On the other hand, if you follow my instructions to upgrade from BIND, you'll have TCP service.)
``When an IQUERY is sent to a djbdns server, it will respond with opcode set to QUERY. (it should simply copy the opcode, not make something up).''
IQUERY support has never been required. The BIND company admits that IQUERY is unused and obsolete. The handling of IQUERY has no effect on DNS interoperability.
``DNSCACHE (the caching server) does not respond to queries with the RD bit clear in the query. (Instead of simply answering from cache without recursing the dns-tree).''
This is a security feature. It has no effect on DNS interoperability.
``By default, tinydns does not hand out referrals to questions it is asked about zones it does not control. I believe that this violates the spirt of the RFCs, if not the letter.''
Anyone who understands DNS knows that all such referrals are discarded by clients, and in fact must be discarded for security reasons. Omitting these referrals simplifies server configuration. It has no effect on DNS interoperability.
``Without a patch from a third party, tinydns does not support the standard "NOTIFY" protocol of informing secondary nameservers that the zone has been updated, and that they need to check the SOA serial number and download a new copy (if they don't already have it).''
This is an entirely optional standard. Omitting it has no effect on DNS interoperability. See faq/axfrdns.html#what for further discussion.
``Because they are separate programs, you can't have both tinydns and dnscache listening to the same IP address(es) on the same server.''
The DNS and BIND book, in the security section, tells BIND users to keep servers and caches separate. My instructions for upgrading from BIND explain how you can achieve this separation with a minimum of fuss.
``There aren't even any patches that can get djbdns to implement TSIG, Dynamic DNS, or DNSSEC, nor are they ever likely to be created (my understanding is that the author is strongly opposed to them).''
Notice how Knowles focuses on buzzwords rather than features. My users already have zero-outage database updates. My users already protect their networks with general-purpose tools such as ssh and IPSEC, so they don't need ad-hoc DNS-specific tools such as TSIG; Knowles is making a fool of himself when he suggests that TSIG helps ``VPNs based on IPSec.'' As for DNSSEC, see forgery.html. ``Strongly opposed'' is a complete misrepresentation of the situation.
``However, the author of this package simply refuses to accept that his code could be anything less than 100% perfect, and while he claims to have a "bounty" that he will pay for any bug that is found, in reality he is the one that gets to define what he accepts as a "bug", and has repeatedly demonstrated a tendancy to openly refuse to accept some purported bug, but then to quietly fix the code with future releases.''
This is typical Knowles slander; notice the lack of details. As for my security guarantee, read the last sentence on the page.
``Benchmarks published by Rick Jones [hp.com] have clearly shown that BIND can scale up to at least 12,000 DNS queries per second... The best benchmarks available for tinydns indicate that it can handle at least 500 queries per second.''
This is a ridiculous comparison:
The 12000 is a maximum: a test load pumped as high as possible until BIND choked (on a big machine, by the way).
The 500 is not a maximum: it is a real load for a production site, far below the maximum that tinydns could handle.
Knowles then compounds his error by comparing the non-maximum performance of dnscache for outgoing queries, which of course involve a complicated resolution process, to the server performance required for incoming root queries, which are simple database lookups. Time for a quick vote: Is this more stupid, or less stupid, than Knowles's latency comparison two months ago between a BIND server on a T1 and a tinydns server on a modem? Anyway, I'm not going to bother responding to Knowles's ignorant comments about fork/exec.
By the way, that production site is now part of Namezero, the largest domain-hosting company on the Internet, publishing more than two million second-level domains with tinydns.
``Under US law, a trademark registration entitles the owner to exclusive
use of the trademark as it is registered, in relation to the goods
and/or services for which it is registered,'' Tatu Ylonen tells us.
But that's not true.
What the owner obtains are exclusive rights (within one field)
to use the trademark in commerce.
Read the law:
15 U.S.C. 1114(1)(a)
applies only to actions ``in commerce.''
15 U.S.C. 1114(1)(b)
applies only to documents ``intended to be used in commerce.''
15 U.S.C. 1125(a)(1)
applies only to actions ``in commerce.''
The new cybersquatting prohibitions,
15 U.S.C. 1125(d)(1)(a), and 15 U.S.C. 1129(1)(A),
apply only to people acting with an ``intent to profit.''
I doubt that the name OpenSSH is likely to confuse or deceive people.
However, even if it is, non-commercial use of the name is legal.
I recently announced my plans to set up
freebugtraq,
a non-commercial competitor to bugtraq.
I was promptly threatened with a trademark lawsuit.
Where do trademark owners get the idea
that they can control non-commercial activities?
Professional antialiasing takes account of how your monitor responds to
voltage. Pixels in antialiased text are shades of gray, not just black
and white; the antialiasing software has to know how bright those shades
will be when they're displayed on your monitor.
This means that an antialiased picture that looks perfect on one CRT can
look horrible on another, with light or dark splotches at the edge
of every shape. Don't expect an antialiased picture to look good if it
hasn't been tuned for your monitor's voltage response!
(As several people have already commented, LCDs have the additional
feature of different positions for red and green and blue subpixels.
This allows even better antialiasing, as illustrated by ClearType, but
pictures that take advantage of this will look worse on a CRT.)
Voltage response can be summarized reasonably well by a single number,
called the gamma correction. Look for gamma correction options in all
your graphics software. I was much happier with xdvi on my laptop, for
example, after I put xdvi.gamma:1.8 into.Xresources.
In case anyone is wondering what antialiasing actually means, here's the
quick-and-dirty explanation. Pretend that you have a much nicer monitor,
twice the vertical resolution and twice the horizontal resolution. Then
convert each two-by-two array of pixels into a single pixel by simply
averaging the colors. For better results, change 2 to 10. (To change
antialiasing to motion blur, change resolution to refresh rate.)
I must admit to some curiosity at this point. Exactly which university grants a doctorate in computing to someone who doesn't know how the fundamental hypercube algorithms work, doesn't understand the difference between hypercubes and VLSI, and doesn't even understand the difference between a CPU and an FPU?
Let's see if I can boil the news down to something that Mr. Computing DPhil will understand.
The conventional wisdom is that each processor needs, asymptotically, a huge amount of memory for the number-field sieve. The amount of memory grows steadily with the computation time. The cost of the processor is eventually dwarfed by the cost of memory.
(These effects can already be seen to some extent in today's factorizations. You might think that the cost is about evenly balanced between processors and memory; however, the processors are constantly stalled waiting for random access to memory. A much less expensive processor could easily do the same job.)
My paper shows how to get away with, asymptotically, far less memory per processor (and, furthermore, very simple processors), while keeping the computation and communication costs under control.
These are asymptotic results: theorems saying that there is a huge cost improvement for extremely large factorizations. More work is required to figure out the exact cost of the new algorithms for, say, 1536-bit factorizations. It might turn out that the new algorithms are very helpful in practice. It might turn out that the new algorithms are useless in practice. At this point, nobody knows.
Mr. Computing DPhil's claim of an exponent of 1 (``by splitting the petabyte of memory into a million pieces and attaching a cpu to each, one speeds up the sorting by a factor of a million'') is simply wrong.
Of course, even 4/3+o(1) is overly optimistic, because tightly packed three-dimensional machines have all sorts of nasty heat-dissipation problems. My paper focuses on two-dimensional circuits, such as FPGAs. The exponent 3/2+o(1) for Schimmler's algorithm on an easy-to-build two-dimensional mesh is still better than the exponent 5/3+o(1) for merge-exchange on a hard-to-build three-dimensional hypercube.
The bottom line is that traditional computers, such as Mr. Computing DPhil's network of Pentiums, are not the right way to do extremely large factorizations. They're not even competitive. However, figuring out the ``extremely large'' cutoff requires a more detailed analysis.
Mr. Computing DPhil now claims that the factor is actually one million (and that everyone knows this already). In fact, such algorithms are not only unknown, but also guaranteed not to exist, by the Brent-Kung area-time theorem.
What's important is communication cost. You can see this even for a small number of processors. Exercise: If you have 16 processors, each with a billion numbers, how do you sort all 16 billion numbers? Each processor can easily sort its own numbers, but how do you merge the 16 lists?
There's still more overhead here than in the sieving. The analysis shows that it's worth spending more time sieving to reduce the matrix size.
Wrong. Schimmler's algorithm sorts a billion numbers on a 32768x32768 mesh in only a few hundred thousand parallel compare-exchange steps between adjacent numbers.
Of course, compare-exchange circuitry is more expensive than traditional memory, and a compare-exchange step could be slower than a Pentium-4 clock cycle. But these factors grow only logarithmically with the problem size. Meanwhile, there's a cycles/steps ratio of one million, and that grows as a power of the problem size, roughly the square root.
My paper explains how mesh circuits provide a similar speedup for the linear algebra in the number field sieve, and an even larger speedup for the sieving. (Identifying the asymptotic cost exponent is, obviously, much easier than working out the exact costs for particular input sizes; I simply don't know yet whether the new algorithm will be cost-effective for, say, 1536-bit factorizations.)
Contrary to the comments from Mr. Computing DPhil, this mesh-circuit parallelization is completely different from the standard parallelization of sieving, in which a billion independent billion-number problems are run on a bunch of separate machines. In particular:
- The switch from standard computers to mesh circuits reduces the cost exponent. The standard parallelization does not reduce the cost at all.
- The processors inside a mesh circuit are constantly talking to each other. The processors in the standard parallelization are practically silent.
- The switch from standard computers to mesh circuits drastically reduces the amount of memory per processor and the cost per processor. The standard parallelization does not.
I have no idea how Mr. Computing DPhil arrived at the idea of putting vastly more memory than necessary into a single-processor computer: for example, using a million gigabytes of memory to sort a billion numbers. He is incorrect in claiming that such silly computers are the base for comparison in my paper.The device is fairly expensive, and it does draw some power, but the costs are not a big part of my yearly budget.
The manufacturer, some company called Compaq, has dropped the standard e-paper terminology. Instead they call this device a laptop. Catchy name, I think.
I've started using this e-paper to look at product catalogs. I've heard that it's even possible to place orders through the e-paper. Can you believe it? E-paper shopping! Amazing.
Several companies are racing to develop e-paper that's just as easy to read but isn't as expensive and doesn't draw as much power. I'm sure I'll switch once the new e-paper is available. But I want to emphasize the fact that current e-paper is already quite affordable. Go to your local hardware store and buy a Compaq laptop today!
False. If you have a lot of data in your zone files, BIND 9 will take quite a while to read that data into memory. All queries for not-yet-read data will fail.
``The problem is that if you make a mistake and munge the database and then rsync or rcp that to the backup servers, you're totally hosed.''
False. With tinydns, a syntax error will not be copied to the slaves. In fact, the master will continue supplying the previous data, unlike BIND, where the master starts responding SERVFAIL to all queries in that zone.
``Without a third party patch, tinydns does not support standard SRV records.''
False. See newtypes.html.
``Without a patch from a third party, tinydns does not listen to more than one IP address. ... Like tinydns, dnscache will not bind to more than one IP address without a third party patch.''
False on both counts. Adding another cache IP address, for example, is a simple matter of running dnscache-conf. Even better, on SMP machines, traffic to the second IP address is automatically handled by a separate CPU.
``By default, tinydns does not support the use of TCP at all. This most definitely violates the spirt [sic] of the RFCs, as well as the letter.''
False. See faq/tinydns.html#tcp for an explanation of when you need to set up TCP service. The RFCs do not require TCP service. (On the other hand, if you follow my instructions to upgrade from BIND, you'll have TCP service.)
``When an IQUERY is sent to a djbdns server, it will respond with opcode set to QUERY. (it should simply copy the opcode, not make something up).''
IQUERY support has never been required. The BIND company admits that IQUERY is unused and obsolete. The handling of IQUERY has no effect on DNS interoperability.
``DNSCACHE (the caching server) does not respond to queries with the RD bit clear in the query. (Instead of simply answering from cache without recursing the dns-tree).''
This is a security feature. It has no effect on DNS interoperability.
``By default, tinydns does not hand out referrals to questions it is asked about zones it does not control. I believe that this violates the spirt of the RFCs, if not the letter.''
Anyone who understands DNS knows that all such referrals are discarded by clients, and in fact must be discarded for security reasons. Omitting these referrals simplifies server configuration. It has no effect on DNS interoperability.
``Without a patch from a third party, tinydns does not support the standard "NOTIFY" protocol of informing secondary nameservers that the zone has been updated, and that they need to check the SOA serial number and download a new copy (if they don't already have it).''
This is an entirely optional standard. Omitting it has no effect on DNS interoperability. See faq/axfrdns.html#what for further discussion.
``Because they are separate programs, you can't have both tinydns and dnscache listening to the same IP address(es) on the same server.''
The DNS and BIND book, in the security section, tells BIND users to keep servers and caches separate. My instructions for upgrading from BIND explain how you can achieve this separation with a minimum of fuss.
``There aren't even any patches that can get djbdns to implement TSIG, Dynamic DNS, or DNSSEC, nor are they ever likely to be created (my understanding is that the author is strongly opposed to them).''
Notice how Knowles focuses on buzzwords rather than features. My users already have zero-outage database updates. My users already protect their networks with general-purpose tools such as ssh and IPSEC, so they don't need ad-hoc DNS-specific tools such as TSIG; Knowles is making a fool of himself when he suggests that TSIG helps ``VPNs based on IPSec.'' As for DNSSEC, see forgery.html. ``Strongly opposed'' is a complete misrepresentation of the situation.
``However, the author of this package simply refuses to accept that his code could be anything less than 100% perfect, and while he claims to have a "bounty" that he will pay for any bug that is found, in reality he is the one that gets to define what he accepts as a "bug", and has repeatedly demonstrated a tendancy to openly refuse to accept some purported bug, but then to quietly fix the code with future releases.''
This is typical Knowles slander; notice the lack of details. As for my security guarantee, read the last sentence on the page.
``Benchmarks published by Rick Jones [hp.com] have clearly shown that BIND can scale up to at least 12,000 DNS queries per second... The best benchmarks available for tinydns indicate that it can handle at least 500 queries per second.''
This is a ridiculous comparison:
- The 12000 is a maximum: a test load pumped as high as possible until BIND choked (on a big machine, by the way).
- The 500 is not a maximum: it is a real load for a production site, far below the maximum that tinydns could handle.
Knowles then compounds his error by comparing the non-maximum performance of dnscache for outgoing queries, which of course involve a complicated resolution process, to the server performance required for incoming root queries, which are simple database lookups. Time for a quick vote: Is this more stupid, or less stupid, than Knowles's latency comparison two months ago between a BIND server on a T1 and a tinydns server on a modem? Anyway, I'm not going to bother responding to Knowles's ignorant comments about fork/exec.By the way, that production site is now part of Namezero, the largest domain-hosting company on the Internet, publishing more than two million second-level domains with tinydns.
But that's not true. What the owner obtains are exclusive rights (within one field) to use the trademark in commerce. Read the law:
- 15 U.S.C. 1114(1)(a)
applies only to actions ``in commerce.''
- 15 U.S.C. 1114(1)(b)
applies only to documents ``intended to be used in commerce.''
- 15 U.S.C. 1125(a)(1)
applies only to actions ``in commerce.''
- The new cybersquatting prohibitions,
15 U.S.C. 1125(d)(1)(a), and 15 U.S.C. 1129(1)(A),
apply only to people acting with an ``intent to profit.''
I doubt that the name OpenSSH is likely to confuse or deceive people. However, even if it is, non-commercial use of the name is legal.I recently announced my plans to set up freebugtraq, a non-commercial competitor to bugtraq. I was promptly threatened with a trademark lawsuit. Where do trademark owners get the idea that they can control non-commercial activities?
This means that an antialiased picture that looks perfect on one CRT can look horrible on another, with light or dark splotches at the edge of every shape. Don't expect an antialiased picture to look good if it hasn't been tuned for your monitor's voltage response!
(As several people have already commented, LCDs have the additional feature of different positions for red and green and blue subpixels. This allows even better antialiasing, as illustrated by ClearType, but pictures that take advantage of this will look worse on a CRT.)
Voltage response can be summarized reasonably well by a single number, called the gamma correction. Look for gamma correction options in all your graphics software. I was much happier with xdvi on my laptop, for example, after I put xdvi.gamma:1.8 into .Xresources.
In case anyone is wondering what antialiasing actually means, here's the quick-and-dirty explanation. Pretend that you have a much nicer monitor, twice the vertical resolution and twice the horizontal resolution. Then convert each two-by-two array of pixels into a single pixel by simply averaging the colors. For better results, change 2 to 10. (To change antialiasing to motion blur, change resolution to refresh rate.)