kent.dickey · Slashdot Mirror

Clever trick! on Origin of Quake3's Fast InvSqrt() · 2006-12-01 10:58 · Score: 5, Informative

To summarize, the article is about a piece of code to approximate 1/sqrtf(f):

float InvSqrt (float x){ float xhalf = 0.5f*x; int i = *(int*)&x; i = 0x5f3759df - (i>>1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; }

The trick is the "i = 0x5f3759df" line. It's certainly a magic number.

The algorithm is simple Newton-Raphson -- make a good initial guess, then iterate making the guess better. I think Newton-Raphson on 1/sqrt picks up 5-6 bits each try in the line "x = x*(1.5f - xhalf*x*x)". It can be repeated to get a more accurate result each time it's repeated.

The problem with Newton-Raphson is making a good first guess--otherwise, you need an extra iteration or two. And that's what the magic number is doing, making a good first guess.

So let's work out what a good first guess would look like for 1/sqrt(f), to see where this code came from.

Floating Point numbers are stored with a mantissa and an exponent: f = mantissa * (2 ^ exponent), where exponent is 8-bits wide and the mantissa is 23-bits wide.

Let's take an example: 1/sqrt(16) would have f = 1.0 * (2 ^ 4). We want the result 0.25 which is f = 1.0 * ( 2 ^ -2).

So our first guess should take our exponent, negate it, and cut it in half. (Try more examples to see that this works--it's basically the definition of 1/sqrt(f)). We'll ignore the mantissa--if we can just get within a factor of 2 of the answer in one step, we're doing pretty well.

Unfortunately, the exponent is stored in FP numbers in an offset format. In memory,

exp_field = (actual_exp + 127) << 23

The mantissa is in the low 23 bits, and the most-significant bit is the sign (which will be 0 if we're taking roots). For now, let's just assume we have our exponents as 8-bit values, to work out what we need to do with the +127 offset.

We want new_actual_exp = -(actual_exp)/2. But in memory, exp = (actual_exp + 127). Or, actual_exp = exp - 127.

Substituting gives (new_exp - 127) = -(exp - 127)/2. Simplify this to: new_exp = 127 - (exp - 127)/2 => new_exp = 3*127/2 - (exp / 2).

Now the exponent is shifted 23 places in memory, so let's write out our code (and ignore the mantissa completely for now...):

i = ((3*127)/2) << 23) - (i >> 1);

rewriting as hex:

i = 0x5f400000 - (i >> 1);

Well, first, it's arguable whether it should be 0x5f000000 or 0x5f400000 (The "4" is actually in the mantissa). I'm guessing resolving that dilemma led to the original author discovering that choosing a particular pattern of bits in the mantissa can help make the initial guess even more accurate, leading to the 0x5f3759df constant.

I haven't worked it out, but Chris Lomont http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf shows this first guess is accurate to about 4-5 bits of significance for all floating point values. That's a good result, considering that mucking with the exponents was just hoping to get us within 1-2 bits of significance.

Re:Well, that's a broken idea on Elastic Tabstops — An End to Tabs vs. Spaces? · 2006-07-03 08:37 · Score: 1

I played with his editor for a little while, and found the same kinds of problems the parent author found. It's especially bad if you try to cut&paste code from another area in place--the formatting goes crazy. I thought it was neat how the comments at the end of the lines would move around automatically, but it's easy to get it to do odd things and to not format code well at all. The whole things feels like the computer is about to do something unexpected all the time.

I think the author is solving the problem the wrong way. His solution has the undesirable property that a relatively small change can change the formatting for a large block--and then the user must go fix things.

I know it's a hotly-debated topic about formatting code, but the #1 rule for formatting code is: You must have a rule for formatting your code. If you don't then your code is a lot less readable. Pick a style and stick to it.

My style: I find the constraints of using tabs to indent to 8-character tabstops and having a hard 80-column line limit helps enforce good readability, for C/C++ and Verilog code. If your code starts crowding the right hand-margin, you have to move code into a function which you then call, so you can start over against the left margin in that function. These simple rules eliminate a lot of difficult-to-read code by simply not allowing you to get 6 levels deep with logic which should be refactored for readability. Sometimes working with and embracing some limits forces the coder to have better discipline. I don't like 2-space indents since it's just not enough to make it easy to scan through code, and it encourages a complex coding style (you only need 2-space indents if you're nesting many levels deep). And I think all code should have a line limit, and 80 is a natural choice. Sure, you could pick 145 or whatever, but I've found code using much longer lines to almost always be harder to read. And 80 colums prints 2-up at a still readable size on a printer, so I can view more lines at a time when scanning through printouts.

Power capacity of server rooms on Chipmakers Admit Your Power May Vary · 2006-06-10 17:49 · Score: 1

The article glosses over the real problems.

The first real problem is that blade servers are so small now, but require so much power, that companies can easily fit way more compute power in a server room than can be reasonably cooled. So they need more power-efficient servers to use their server space effectively.

And the problem isn't that power can't be measured--it can be measured just as easily as performance. Which is the problem hinted at in the article--firms focusing on the positive results they have and pushing that way of measuring power. Which is what they've been doing with performance for decades. Everyone can measure power, but what "benchmark" should we use?

And to correct other comments, chip dynamic power utilization is proportional to fCV^2, where f=frequency, C=capacitance, V=voltage. Reducing a chip's frequency from 2GHz to 1.5GHz will only at best save you 25% of the power. But, circuit speed is also proportional to voltage, so if a chip at 1.2V can operate at 2GHz, there's a good chance it might operate at 1.5GHz at 1.0V (or maybe 1.1V). So the real power savings is in the voltage reduction: 1.2V at 2GHz use almost twice the power as at 1.0V at 1.5GHz. But chips waste power even at 0Hz, especially at 90nm and below, so it's not quite that good. I believe AMD and Intel both use voltage reduction to save power in their reduced power modes.

O-Zone (Numa Numa) on Viral Music Videos A Problem For RIAA · 2006-06-03 17:12 · Score: 5, Interesting

The music industry doesn't seem to know how to make money anymore.

Just take the Numa Numa video on the internet from a year ago. This is a potential hit song made popular in the US from the "Numa Numa" video at http://www.newgrounds.com/portal/view/206373 that went nowhere on the buying charts due to pure stupidity of the recording industry. If you liked this song, you couldn't buy it.

iTunes only added it to their collection well after the interest in it subsided (and I bought it then). Sure it was in Romanian, but that really wasn't a big deal--just look at the success of 99 Luftballooons from 20 years ago.

The record industry is over-focused on piracy from folks who would never buy their music anyway. The positive word-of-mouth of a good song more than outweighs any piracy of a good song. And the greedy executives don't realize they'll make more money when teenagers grow up and *buy* music from nostalgia then they'll ever get from the same people when they are teenagers. But if the greedy recording companies force teenagers to get their music through piracy because they have no alternative, then those customers may be gone for good.

I'm old enough to know what I want in music, and as best as I can tell, the recording industry doesn't want to sell it to me at any price. They want to sell me their crap instead.

Re:Bugs by Design on Why Buggy Software Gets Shipped · 2006-05-25 16:09 · Score: 1

I agree with the parent article about design being very important.

But the article irked me in a number of ways. The author strikes me as being on shaky ground defending buggy software, especially since his bug examples mostly showed weak design and seemed to confuse bugs with new feature requests.

The best way to keep bugs out is to have pride in your work. Not an obsessive slavery to minutiae, but wanting to deliver a quality result.

Most software is written on a rushed schedule to meet some limited goals. Quality is never a high priority, since high quality does not deliver instant monetary results. If you don't plan on a future for your software, it will not have one.

The software industry, in my opinion, doesn't need any more excuses for bugs--we're pros at that already it seems. What we need to do is plan ahead a little more, and know that rushing out version 1.0 with bugs is going to cost us long-term. It's the rare market where a rush to market with junk makes or breaks a company. And if you're competing in a market like that, someone who's more ruthless than you will beat you up and take your lunch money anyway.

The article's author provides another example of a common software-industry effect: If you build for one platform, with one set of tools, with a narrow market, you tend to produce a product inferior to one which is multi-platform using multiple tools. And you could get stuck with these tools, and be very hard to get away from them.

My experience is there are lots of easy bugs and design decisions where planning on a Mac/Linux/Windows version right from the start (even if you don't really port it--just plan to and prototype it a little bit at least) produces a much higher quality product. For example, if you're only compiling your product with one compiler, then you'll become stuck to that compiler even if you don't want to be. You can do all your production builds with one compiler, but at least make sure your product passes g++ -Wall.

Re:All true. on Sun to Release Java Source Code · 2006-05-16 13:43 · Score: 1

I can't keep track of the dates of each benchmark being run at the kano.net site, but you should visit: http://www.freewebs.com/godaves/javabench_revisite d/. They say some of the Java benchmarks are flawed, and once they are made more fair to C++ (i.e., not being unfair in major ways), then C++ does much better. As in, C++ almost always beats Java.

Benchmarking is easy to do wrong, and if you're getting unusual results, you may want to look into it closely.

Re:Technical paper? on Chip Power Breakthrough Reported by Startup · 2006-05-08 17:26 · Score: 4, Informative

The press has a knack for distorting stories and making it very hard to figure out real technical details.

http://multigig.com/pub.html has some whitepapers. I read the ISSCC 2006 slide set, which let me know the general technique.

Basically, they produce a clock ring to produce a "differential" clock pair that after one lap swaps neg and pos and so it's frequency is tuned by it's own capacitance and inductance. They call it a "moebius" loop since it's not really a differential pair, but the clock wave makes two round trips before getting back to the start.. Neighboring loops can be tuned together (although if that's by just routing the wave throughout the chip I'm not sure). They didn't seem to mention synchronizing the period to outside sources, and I'm not sure how they'll be able to do that.

The clocking is not the interesting part to me, but rather their logic strategy. The trick is that logic itself has no connection to power or ground. The clock nets provides the "power and ground" and all logic must be done as differential (a and abar as inputs, q and qbar as outputs). This is where they get the power savings from--the swings are reduced and there's no path to power or ground to drain away charge. Without really discussing it, charge seems to just shift around on internal nodes between the differential logic states. They then use pure NMOS fets for logic, which removes all PMOS. The logic will never read the power rail, though--it will always be a Vt drop. I just looked this over quickly, but it seems the full-swing clocks and lack of PMOS make this work out fine.

For quick adoption, they'll need to work out clever techniques to connect this logic to standard clocked logic. Otherwise, it looks only a little bit easier to use than asynchronous logic. The issues they face seem very similar to asynchronous logic issues--tool support, interface to standard clocked logic, debug, test, etc.

It's not vapor.

Re:Man, I still don't 'get' functional programming on Developing Applications With Objective Caml · 2004-12-01 15:01 · Score: 1

Thanks for the help. I was just seeing if the language was ready for general use. 50000 is a very low limit on recursion, and it also fails when using ocamlc. And the error messages are not very helpful. It's clearly not a jump-right-in language, which is too bad.

Re:Man, I still don't 'get' functional programming on Developing Applications With Objective Caml · 2004-12-01 03:35 · Score: 1

I tried that tutorial website and was reading about ocaml. Writing tutorials is very hard, but when he gave an example for a "range a b" function to give a list with integers from a to b, I just had to try to fix it so that it used tail-recursion in a nice way.

And when I did, I ran right into a problem: "Stack overflow during evaluation (looping recursion?)."

Here is my first ocaml code:

let rec range2 a b accum = if a > b then accum else range2 a (b-1) (b :: accum) ;; let longlist = (range2 1 50000 []) ;; print_endline (string_of_int (List.length longlist)) ;; let list_str = List.map string_of_int longlist ;; (* This fails *) print_endline (string_of_int (List.length list_str)) ;;

So it can't call a basic function map on a list of 50000 items? Can anyone point out what I'm doing wrong? The code works for lower values like 20000.

Other Variables on Berkeley Researchers Analyze Florida Voting Patterns · 2004-11-19 10:02 · Score: 1

Given the set of variables assumed, the data these researchers collected show that e-voting may have skewed the results for Bush.

All analyses like this depends on listing all the likely variables affecting the results. These researchers have a fairly short list, and I think they've missed a big one: they should have made the previous voting method in each county be a variable as well. So they needed to include which counties previously had punch-cards, optical scan, etc.

For example, what if punch-cards were unfair to Republicans in 2000 and 1996, but e-voting made it "fair"--this hypothesis could explain the data as well. Their analysis does not take this effect into account.

There are other variables that maybe should be taken into account as well, such as population turnover, church attendance rates, unemployment rate, federal aid received this year, etc.

The reason they probably didn't include more variables is that it makes any sort of trend almost impossible to detect (and certainly not as bold as their analysis makes it) since the effect they are claiming is relatively small in the overall vote totals.

I dislike e-voting without a paper trail, but this fairly simple statistical analysis doesn't seem like very strong proof of a problem to me. I don't want too much crying-wolf talk to make normal people immune to the real risks of e-voting.

Re:Heat? on Affordable Modern Graphics Cards · 2004-09-24 15:46 · Score: 1

I'll try to explain chip technology a little more accurately, and answer the heat question.

CMOS, basically used for all CPUs and GPUs, beat out all other chip technologies about 15 years ago since it has the property that if the chip isn't changing state (and doing "work") then very little power is consumed (you can almost assume 0). This is very important for large caches on chips since otherwise they'd burn a ton of power just by existing. It also means that running at a slower clock speed will use less power. The formula for transistor power is Power = Frequency*Capacitance*(Voltage^2). Cutting the frequency by 2 reduces the power by 2. Capacitance is proportional to the size of each transistor (smaller is better), the number (more transistors raise the capacitance on the chip) and also the physical size of the chip.

So there is your answer: reduce frequency to save power; use smaller transistors to save power; reduce the chip's operating voltage to save a lot of power.

To make chips go fast, there is often a tradeoff designers can make between speed and power. There are lots of techniques, but a big one in a CPU is to have a circuit which is always drawing power, but able to react to a small change very fast. Normal circuits that do not waste power when they are not active take more time to change state and so are slower. CPU designers end up using a lot of these power-hungry circuits at very high speeds, but GPUs probably don't need to use them at all.

Also, cache transistors are counted in the transistor budget, but they don't need to use much power. So 100 million transistors which are mostly cache use a lot less power than 100 million transistors in a complex speedy CPU. To compare transistor counts meaningfully, you should always compare cache-to-cache transistors, and then the rest, but this is usually hard to do since vendors usually don't make the values obvious. In short, you can almost ignore cache transistors for power reasons.

But here's the bad news for CMOS power in the future: at 130nm and 90nm especially, CMOS now "leaks" a lot of power even when it isn't doing work. So most processes include slower transistors that are more like older processes that don't leak power, and power-conscious designs use more of these slower transistors. This raises chip costs since having all these transistor choices means making the chips is more expensive. Future process generations are now working harder on controlling power than on speed because of this new major effect.

Re:Google GLAT ( Google Labs Aptitude Test ) on Google's Math Puzzle · 2004-09-16 07:49 · Score: 2, Interesting

If you go a little higher, you get a fun number:

1111111110 == 1111111110

which is floor((10^10/9) - 1).

Note that since 500,000,001 worked, and 1 works and 500,000,000 has no 1's in it, then any working values from 0-499,999,999 will repeat with 500,000,000 added. After that, the one above is the only one through 2 billion (where I stopped looking).

Re:Don't jump up and down yet... on Grokster Wins Big in Ninth Circuit · 2004-08-19 09:34 · Score: 1

More importantly, the decision said that this is only a partial decision--other matters were not resolved. The key paragraph from page 25 of the decision:

Resolution of these issues does not end the case. As the district court clearly stated, its decision was limited to the specific software in use at the time of the district court decision. The Copyright Owners have also sought relief based on previous versions of the software, which contain significant--and perhaps crucial--differences from the software at issue. We express no opinion as to those issues.

So grokster and others may still be found liable for older software versions.

Re:FPU intensive? on EM64T Xeon vs. Athlon 64 under Linux (AMD64) · 2004-08-09 05:28 · Score: 5, Informative

The "primegen" program listed where the Xeon beats the Athlon slightly does not do any floating point.

I looked at the code and played with it a little (I got it from http://cr.yp.to/primegen.html and it seems the benchmark is mostly limited by the implementation of putchar().

My system was an dual AMD Opteron 1.8GHz running Win XP pro with Cygwin. I modified the benchmark to not use putchar() but instead just write the characters to a 1MB buffer, and it got 16 times faster! To be specific, "primes 1 100000000 > file" went from 24.2 seconds to 1.497. Note that it's generating 51MB of output for primes under 100 million. I didn't bother running it for the 100 billion max, but would expect it to be around 50GB.

This is a very poor benchmark since it's just measuring your stdc implementation of putchar and your system's ability to sink data to /dev/null, not anything useful.

Re:1,000,000 is nothing... on How Would You Handle a $1,000,000 Coding Error? · 2004-07-19 17:09 · Score: 2, Interesting

I agree, $1 million is really not a big deal.

The problem is that customers of software do not really understand how they need to treat software upgrades.

Here's a useful analogy: A customer getting a software upgrade should treat it the same way they would treat being moved to a brand-new building. Sure, the building contractor might say the new building is exactly like the old one except for a minor change, and that they have installed exact copies of all the equipment from the first building.

Software upgrades are like this except the new building has no warranty, and to save money, the customer burns down the old building before even inspecting the new one.

So who's fault is this?

Macworld.co.uk says this is mostly urban legend on iPod: This Season's Must-Have for Muggers · 2004-03-30 11:21 · Score: 1

Macworld UK had a story about this earlier today:

'iPod mugging' latest media frenzy

Basically, this Roland guy is the one person any serious journalist has found to have been mugged for his iPod. That being said, carrying around a $400 device probably does raise your chances of being mugged. But there's no evidence iPods are being targeted. You're probably just as likely to be mugged for your battered Walkman.

Re:How smart u are.. on Recovering Secret HD Space · 2004-03-10 02:59 · Score: 5, Informative

The parent post is incorrect in regards to chip testing.

Manufacturers test every single chip pretty much identically. Different companies differ in how they determine speed of parts (run some patterns at full speed, measure the delay of some known circuits, etc.) but each part is tested. There is too much variation across the wafer to do much else.

It's always possible to run a chip faster than a manufacturer's testing especially if it is kept cooler than the max spec, voltage is within tighter tolerance than spec, or if the user doesn't care about correct answers. I find the last point is what usually allows the greatest overclocking.

Also, some large manufacturers (Intel, AMD) have marketing needs to sell certain speed grades. So if all parts run at 3.0GHz, but users are demanding the cheaper 2.8GHz parts, then they'll sell some faster parts marked at 2.8GHz. In general, this is a temporary situation since re-pricing to reflect the increased yield will probably move the 3.0GHz price down shortly to increase pressure on the competition.

Re:The Finder on Learning Unix for Mac OS X Panther · 2004-02-24 09:41 · Score: 1

I just use Terminal.app and xterm full time to do my work, but I thought I'd see if iterm or glterm were better.

First speed. I first did a hex-dump (using my own program) of a 5MB file. But iterm and xterm are so slow I just used a 500K file instead.

Time to display 35000 lines of 75 character hex text:
Terminal.app: 5.5sec
xterm: 9.7sec
iterm: 9.1sec
glterm: 1.8sec

So glterm is fast. Only xterm seemed to have a high-refresh rate with lots of text blurring--the rest seemed to update a lot less often which would make them faster. iTerm was uneven in its jumpiness which is quite annoying.

But the killer for me is glterm and iterm do not support page-up/page-down to scroll (you have to hit shift) which is just a non-starter for me. Since iterm has source, it could be fixed, but I don't see an advantage in using iterm for myself.

Note that xterm does not directly support page-up/page-down either, but it will if you have the following resources set (all one line with no linefeeds) in ~/.Xresources:

XTerm*VT100.translations: #override Prior: scroll-back(1,page) Next: scroll-forw(1,page)

I find Terminal.app completely adequate and fine for my work. It's way better than cmd.exe!

Re:Intel wouldn't ditch Itanium... on Intel 64-bit Announcements at IDF · 2004-02-17 13:45 · Score: 2, Interesting

The "troll" comment is incorrect.

100,000 Itaniums is comparable to other server chips, considering that probably most of them were in fairly large systems, not cheap workstations. I agree Sun probably sells more, but that wasn't the point. Apple Computer shows that one can be successful without being the biggest player.

That being said, Itanium clearly is not where Intel hoped it would be. I doubt Itanium will ever recoup its investment, which was huge unless something drastically changes. I worked on Itanium (when it was called something else) starting back at least in 1996 when I was at HP, so that's a lot of sunk cost to recover.

Re:DUPE. on USPTO Grants CA Lawyer Domain-Naming Patent · 2004-01-28 03:44 · Score: 1

Someone should file a patent on posting the same story on Slashdot over and over. He could get a fortune in royalties. Or, it might provide financial incentive to stop dupes.

Hard to say what's new here on Sun Unveils Direct chip-to-chip Interconnect · 2003-09-22 02:13 · Score: 2, Informative

The article is a bit vague as to what the innovation really is.

The article immediately made me think of multi-chip modules. Multi-chip modules is an idea which never really caught on in the industry (except for IBM), and I'm not sure how Sun's innovation isn't just a take-off along that idea. Multi-chip modules have failed due to costs since much has to go right to get a multi-chip module that works.

Any practical chip-to-chip connectivity scheme had better have a good rework scheme. If it doesn't, it's just boutique technology that will not affect the industry overall.

Having worked on chips with multi-gigabit pins, a huge problem is resynchonizing the signals. Creating a receiver to align one pin's data with 15 neighbors at 3GHz takes a whole lot more logic space on the die than a small driver (or receiver). The auxiliary logic basically makes shrinking the final driver FET almost meaningless.

Modern chip design is a constant trade-off between features and cost. And what's cheap is what everyone has been doing for years (or is an evolution of that).

Re:Access to fast machines required? on ICFP 2003 Programming Contest Results · 2003-09-14 12:00 · Score: 2, Interesting

I entered using a fairly average machine (Apple powerbook G4). It looks like I came in 30th out of about 90 entries (I'm "apple2gs"). I'm disappointed that I had to find out results through slashdot.

My strategy was to try to use "waypoints" to help guide an optimizing algorithm, but I gave up and just made a manual car simulator (meaning, you manually enter the commands, and my program just shows where the car is and if it's hit a wall yet). With more time, I could easily improve most tracks by at least 5% by just racing them through again. This would only move my rank by a few places, though.

So machine speed was not that big a factor in my case. Others also made simple driving simulators as well, although I don't know how well they did overall.

Slashdot Mirror

User: kent.dickey

Comments · 72