Mr+Z · Slashdot Mirror

Re:Yay more cores that I won't be using much of! on Intel's Knights Landing — 72 Cores, 3 Teraflops · 2014-01-04 18:09 · Score: 1

Did you miss the part in the article about 512-bit AVX and being able to do 32 double precision floating point operations per clock? Or the other part about running four-way SMT to hide memory system latency? Or the other, other part about 128 byte (1024-bit) L1D to CPU bandwidth?

These ain't plain ol' Atom processors.

For HPC workloads, these seem to be right up the alley of "heavy lifting."

Re:But... why? on Cairo 2D Graphics May Become Part of ISO C++ · 2014-01-04 17:45 · Score: 1

Well, you'd hardly get to VGA quality. You can't even get all the way to CGA resolution. The TMS9918A VDP supported a maximum of 16 colors on a 256 x 192 (sub-CGA) resolution. Furthermore, it had a limitation of no more than 2 colors per 8-pixel wide span (in Graphics II "bitmap" mode). You need a minimum of 24K bytes to support a 4bpp bitmap at 256 x 192, but the TMS9918A VDP could only address 16K bytes. The graphics II mode (with its color limitations) fit into more like ~13K bytes. (6K for the pattern table, 6K for the color table and 768 bytes for the name table.)

Of course, TI BASIC and TI Extended BASIC didn't offer this mode. For one thing, TI BASIC / TI XB programs got stored in graphics memory, because the console itself only had 256 bytes of CPU addressable memory, and relied over-heavily on the VDP's graphics memory for everything else, including storing BASIC programs, variables, and sprite motion information (on top of the sprite position data that the VDP accessed directly itself).

From BASIC, you had a subset of the character set that was redefineable. I forget how many characters; I think it was something like 192 in TI BASIC and 168 in TI Extended BASIC, but those numbers are from (faded) memory. So, you could (slowly) simulate bitmapped memory in TI BASIC / TI XB by drawing a small patch of screen (much less than full-screen), and redefining character tiles to simulate bitmaps. It wasn't terribly efficient, because CALL CHAR took character patterns as ASCII strings of hexadecimal digits. Useful for fixed bitmaps, but not for variable bitmaps. If you had a TI MiniMemory cartridge, though, you could use CALL PEEKV andCALL POKEV to do things more efficiently, or write a small assembly routine stored in the cartridge accessed through CALL LINK.

Not that I'd actually know how to use that old computer.... ;-)

Re:But... why? on Cairo 2D Graphics May Become Part of ISO C++ · 2014-01-04 16:33 · Score: 1

My DOS machine didn't have a VGA buffer, you insensitive clod! And on the machine I had before that, BASIC on my machine didn't even have bitmapped graphics!

Re:Vectorized factorials! on Comparing G++ and Intel Compilers and Vectorized Code · 2014-01-02 06:19 · Score: 1

I think you misunderstood here: There were two different levels of optimization that happened depending on the compiler I used. Both were much faster than a stacked if—else construction.

On the older GCCs and other tool-chains, I got code roughly equivalent to FORTRAN's computed GOTO. The switch—case became roughly goto label_table[ switchvar ], and each of the cases had a branch back to the common join point. That's not pessimizing at all. For the general switch—case statements with relatively dense case values, that's pretty much what you need to do.

On newer GCCs, for this particular construct, where every case was of the form x = value , it got rid of all the branches and replaced it with a lookup table on x itself, removing nearly all the branches (except some basic range guards), and generating, roughly, x = lookup_table[ switchvar ], which is quite a fundamental leap over normal switch—case reduction.

So to re-cap, the default behavior is to generate a lookup table of branch targets. That's what I expect is a baseline minimum for switch—case code generation, and I think we agree on that. On modern deep-pipeline processors, branches with variable targets tend to be very expensive (the BTB needs to guess correctly, and often it can't if the inputs vary wildly), but they still may be the best choice for some algorithms. The more advanced optimization that I was surprised to see GCC implement replaced the computed-GOTO construct with an actual lookup table on the value itself. This eliminates the computed branch entirely and is a huge win (order of magnitude!) on modern architectures, when it can be applied.

Re:First let's understand this x32 correctly. on Linux x32 ABI Not Catching Wind · 2013-12-25 03:42 · Score: 1

I think we're in violent agreement. The only reason I included Firefox in the list is that I've seen it top 4GB on my own system. Maybe it's a x86-64 related memory leak, since all the memory measurements I hear people touting for the 32-bit version are far lower. Or maybe it's just full of pointers, which wouldn't be too surprising, really. :-)

again, most people posting on the x32 subject don't even know what exactly is a mmap mapping, a shared memory segment, memory allocated from sbrk and other methods, stacks allocated dynamically, and jump to the conclusion a browser must require over 4GB of address space

Well I only included Firefox because, on my computer right now, it has mapped over 3GB, and has 2.1GB resident. That still fits within the 3GB/1GB split of 32-bit Linux. I have seen it go as high as 6GB, but its usual steady state for me is around 4GB. I never close any tabs. You're right though that a web-browser shouldn't require that much RAM under normal circumstances. FWIW, I am quite familiar with mmap, shared memory, sbrk, huge pages (including using mmap to map files on a hugetlbfs to get larger pages and improve my TLB performace), etc. I didn't include Firefox out of ignorance.

So, I reiterate: I think we're in violent agreement that x32 looks interesting and relevant, and the vast majority of applications don't need 64-bit, and many would benefit from smaller pointers.

A few applications I've written (large heuristic solvers, for example) benefit from 10GB - 15GB RAM. And from what I hear, some EDA apps they use at work could use 200+GB. And, of course, there's the ever present large databases, as you mentioned. But that's pretty specialized as compared to everything else.

Re:First let's understand this x32 correctly. on Linux x32 ABI Not Catching Wind · 2013-12-24 15:57 · Score: 1

I have a warm place in my heart for x32 for the reasons you mention: 95% - 98% of code is perfectly happy with 32-bit pointers. (I would posit 99%+ for many folks, actually. The kernel and a few big apps benefit from 64-bit, and the rest fit well in 4GB. I put "web browser" in the "big app" list, but it is only one app even if it's one of the biggest cycle-eaters.)

Now that the ABI has matured and presumably they've had some time to work the kinks out, I'd love to see them post some up-to-date benchmarks. The previous benchmarks were actually somewhat disappointing. I expected a bit more noticeable speedup.

Re:Totally missed memorable computers of the 80s on A Short History of Computers In the Movies · 2013-12-23 19:43 · Score: 2

I did see a mention of a Commodore 64 and an Amiga in there... but still, yeah, the 80s "didn't happen" for computers. It was mostly Burroughs and other big iron, a quick nod to the 80s, and suddenly it's all Vaios and Macs. WTF?

Totally missed memorable computers of the 80s on A Short History of Computers In the Movies · 2013-12-23 17:53 · Score: 2

Where was WarGames, Weird Science, TRON, Electric Dreams, etc.? Who gives a crap about a Vaio showing up in The Pink Panther 2. (Oh, and that's Steve Martin. Who's Steve Allen?)

Re:Probably optimizing for larger numbers on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 16:05 · Score: 1

Well, I just compared the vectorized implementation to the simple tail-call optimized version (ie. reduced to a simple loop) that GCC 4.4 produced.

The vectorized version was a paltry 6% faster on my box, measured across 65,536 iterations. (My machine is a AMD Phenom X4, 3.4GHz.)

So, it is faster, but not by a meaningful amount, and probably not enough faster to justify the hilarious code size increase.

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 13:40 · Score: 1

Ok, you know all this, and yet claim that "no compiler does this." That's rather a different claim than "In practice, it doesn't happen as often as one would like."

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 13:39 · Score: 1

So did you read the two paragraphs that followed? And in case you think I'm imagining these compilers, flip to PDF page 129 (page 119 in the text) and see:

High-level optimizations (HLO) exploit the properties of source code constructs, such as loops and arrays, in the applications developed in high-level programming languages, such as C++. They include loop interchange, loop fusion, loop unrolling, loop distribution, unroll-and-jam, blocking, data prefetch, scalar replacement, data layout optimizations, and others. The option that turns on the high-level optimizations is -O3.

The optimizations I highlighted all will affect the data access pattern, especially loop interchange, blocking and data layout optimization. Next, check the date on the document: 2003. That's a decade ago. It's reasonable to expect they've only gotten more aggressive since then.

This much shorter survey of modern compilers also goes into optimizations that might change access order. They even call this reordering out: "Since loop interchanging may change the access pattern, care must be taken to not introduce non-unit stride access."

Welcome to modern compilers.

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 12:48 · Score: 1

If you want to learn more about the state of the art, you might start here. It'll catch you up to where we were 20 years ago.

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 12:46 · Score: 1

Ugh... should have previewed. That third paragraph should read:

Consider a simple memory copy: for (i = 0; i < len; i++) dst[i] = src[i]; As long as dst and src do not overlap, you can vectorize this code. But, that implies reading N elements of src before writing N elements of dst. I don't know how you define reordered memory access, but that looks like reordered memory accesses to me.

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 12:44 · Score: 1

I don't know what world you live in. Nearly every modern C compiler majorly reorders memory accesses when optimization is enabled. Reordering memory accesses is part of what the new restrict keyword is about, after all. The volatile keyword is there to prevent the compiler from reordering or eliminating memory access.

Vectorized memory accesses nearly always implies reordered memory access.

Consider a simple memory copy: for (i = 0; i As long as dst and src do not overlap, you can vectorize this code. But, that implies reading N elements of src before writing N elements of dst. I don't know how you define reordered memory access, but that looks like reordered memory accesses to me.

Ok, I hear you shouting "That's not what I meant! I meant it still accesses array a in the same order before and after, and likewise for b!" Well, that's just a trivial example. I've worked with C compilers that interchange loop levels, unroll outerloops and jam the inner loops together, etc. Such compilers are more the rule than the exception these days.

So, I call BS on this bald, false statement of yours: "no optimizer of any C compiler changes the memory access pattern of your code." That pretty much hasn't been true since nearly the beginning. Even just having a register allocator changes the memory access pattern of your code, not to mention instruction scheduling, etc. But it's especially true these days. Google "loop interchange", "loop fusion", "loop tiling", "loop distribution", etc. You might be surprised what compilers might do to your code.

Re:Vectorized factorials! on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 12:24 · Score: 1

You mean on the recursive version? You get an order of magnitude speed up just switching to a trivial lookup table...

Re:Vectorized factorials! on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 11:55 · Score: 1

FWIW, I ran the same test on an older machine with G++ 3.4. (1.2GHz Duron, if you're curious.) G++ doesn't optimize the switch-case into a lookup table. On that system, the switch-case code ran 1/3rd the speed of an explicit lookup table. The switch-case turns into a jump table, so you end up with an indirect branch, move, and unconditional branch on any given path through the switch-case. The unconditional branch is easy for hardware to handle. The indirect branch, not so much, especially since I sent a pseudorandom sequence of inputs to the function.

So, this goes back to what someone said in a comment elsewhere on this article: Trust, but verify. Trust your compiler to do a good job, but verify that it actually is.

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 11:22 · Score: 1

How is Floating Point Operations Per Second not a real unit? It measures amount of floating point computation performed in a unit of time. I can compute the theoretical peak FLOPS for a given architecture by multiplying the number of computational units by the clock rate. I can compute the achieved FLOPS for an algorithm by dividing the total work performed by the time taken. The ratio of achieved rate to theoretical peak even gives me a measure of efficiency, so I know what's performing well and what isn't.

And, I would argue that cache is extremely important when considering vectorization, especially when considering loop nests. I might get much more impressive vectorization if I execute a loop nest in a particular order. But, if I get better cache locality by interchanging two loops, I may see much better performance in the second case. Matrix multiply is a poster child for this.

So if you're looking at the output of the compiler's optimizer and saying "compiler A is better than compiler B at vectorizing" looking only at the instruction sequence, and ignoring the actual memory access pattern and the effects of the cache, you might draw the wrong conclusion. The optimizer may have made other transformations to help the cache that have the effect of throttling vectorization, but result in better overall code. I recall seeing elsewhere that Intel's compiler is more aggressive about reordering loop nests to help out the cache, for example. When that happens, it might look like the Intel compiler didn't vectorize nearly as aggressively as another compiler, when that's not really what's at play.

Re:Vectorized factorials! on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 11:03 · Score: 1

I never meant to suggest that it is optimal. But it certainly is "optimized!" Vectorizing this function is simply ridiculous.

That said, I just ran a benchmark, comparing it to the more straightforward code output by G++ 4.4. The vectorized version produced by 4.8 is slightly faster, by about 12%. The recursive approach is still quite a bit slower than a lookup table or switch-case. Interestingly, the lookup table and switch-case versions got slightly slower in 4.8 compared to 4.4.

Re:Vectorized factorials! on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 10:56 · Score: 1

Interestingly, GCC 4.8 actually replaces that switch/case with a lookup table. On older GCCs and with compilers for other platforms, the switch-case is an order of magnitude slower or worse, as it actually resulted in branches. And `switch-case` branches are sometimes very difficult to predict, depending on the hardware branch predictor and the code around it.

It appears GCC has an "unswitch" optimization that handles a switch-case used in this way.

Re:Very different code on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 08:29 · Score: 1

I think you meant if ( ( a = b ) ), which highlights a different reason this construct is problematic: If you make that error outside the context of a control construct, you'll get a warning about a meaningless computation.

Your proposed fix isn't really a fix, though. It shuts up GCC, but it doesn't shut up RVCT, for example.

Re:Trust but verify on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 08:09 · Score: 1

I'm in the same camp.

It's also worth noting that C and C++ make it really easy to trip up the optimizer and disqualify code from certain optimizations. Little things like const, restrict, alignment and trip-count hints can go a long way, though. Reviewing the generated code can highlight places where these hints would be useful.

Re:Very different code on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 07:51 · Score: 1

I've used naked { } to scope things before. It's actually quite handy. In C, it serves two purposes: It makes sure that this variable's value doesn't intentionally spill into code beyond it, and it gives you a new scope to declare temporaries without having to worry about clobbering some other value you didn't think of. (But, if you're doing things right, then that latter concern should be a lesser concern.) In C++, it also gives you a clear boundary where objects go out of scope that isn't just "by the end of the function."

Re:Very different code on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 07:47 · Score: 2

Right, but would your server utilization go from 80% to 90% if you wrote it as two lines?

a = b; if (a) { // yadda }

Unless you have an incredibly crappy compiler, the two will generate identical code, but this second version won't give a warning.

Re:Vectorized factorials! on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 07:34 · Score: 3, Informative

Why isn't that just a lookup table? My point in mentioning factorial is that there's no point in vectorizing that thing. Even a simple loop would be small compared to the cost of a single L2 cache miss.

Re:News for nerds or not on Comparing G++ and Intel Compilers and Vectorized Code · 2013-12-19 04:38 · Score: 2

Also notably absent were any performance benchmarks. Two pieces of code might look very different but perform identically, while two others that look very similar could have very different performance. In any case, you should be able to work back to an achieved FLOPS number, for example, to understand quantitatively what the compiler achieved. You might have the most vectorific code in existence, but if it's a cache pig, it'll perform like a Ferrari stuck in mud.

Slashdot Mirror

User: Mr+Z

Comments · 3,254