marcansoft · Slashdot Mirror

Re:Audio/Videophiles Beware on THX Caught With Pants Down Over Lexicon Blu-ray Player · 2010-01-16 00:31 · Score: 1

I don't know if audiophile humans can even detect a single bit error every ten seconds.

Any human can detect that. With 16-bit or 24-bit words, there's a fair chance that that bit will hit one of the most significant bits of the audio word. This will cause an audible click that anyone can hear (especially with calm music). Actually, I'd say audiophiles may be less likely to detect it, since they tend to live in their own little lalaland where immeasurable BS sound qualities are more important than blatantly obvious clicking.

On the other hand, one single bit error every 10 seconds for audio cabling (relatively low speed) is ridiculously bad.

Re:Audio/Videophiles Beware on THX Caught With Pants Down Over Lexicon Blu-ray Player · 2010-01-16 00:22 · Score: 3, Informative

The difference in wire length for i2s is either very audible or not audible. It does affect the DAC and matching clock and data lengths is important, but it's a data corruption issue - if the lengths differ enough that the signal is out of spec with regard to the setup and hold times of the DAC, you get glitchy audio. This isn't an "analog" difference.

Clock jitter may be audible, and mismatched clock skew between outputs can be too, but skewed clock and data to a single DAC will not cause any audible changes until you exceed the specifications and then all hell breaks loose.

Re:Premature optimization is evil... and stupid on Cliff Click's Crash Course In Modern Hardware · 2010-01-15 02:54 · Score: 1

Nope. On an ARM9xxEJ-S (that's based on the ARMv5TEJ architecture), MUL takes 2 cycles plus an extra penalty cycle if the next instruction depends on the result. Data ops (which all have a free shift) take one cycle. This means using two ADDs (or a MOV and an ADD) to multiply by a constant with two bits set plus an optional free LSB set and it will only take two cycles, which is as fast as a MUL or faster if you have a dependency on the result in the next instruction.

Cortex-A8 (that's ARMv7) (which is used e.g. on the iPhone 3GS) does away with the dependency penalty as far as I can tell, but MUL still takes two cycles as opposed to one for ADD/MOV (with free shift).

Re:Theory bites back on Airport Access IDs Hacked In Germany · 2010-01-15 02:12 · Score: 2, Interesting

There is no cipher. There is no security. These guys gave a talk on LEGIC Prime at the congress. The digest version is that LEGIC Prime is 100% obscurity and 0% security: LEGIC cards are wireless read/write memories with a tiny LFSR scrambler thrown on top to obfuscate things a bit. There are no keys. All the access controls are implemented in the reader/writer software. These cards are not only trivial to emulate, they're also trivial to modify.

Re:Premature optimization is evil... and stupid on Cliff Click's Crash Course In Modern Hardware · 2010-01-14 13:19 · Score: 1

I was talking of multiplying by a power of two constant, of course. You're quite correct in saying that shift+add combinations may or may not be faster than multiplying by more complex constants, depending on the particular implementation. Usually, two shifts and one add is a fairly safe bet for simpler CPUs, but it can actually slow things down on modern superscalar CPUs where it creates undesirable dependencies in the pipeline.

Re:It's just outdated knowledge on Cliff Click's Crash Course In Modern Hardware · 2010-01-14 13:09 · Score: 1

Ok well that's changed now. Our more complex modern CPUs have special logic for clears, and doing a move to the register with 0 is faster. So it was a time limited trick, useful back when he started doing it, but no longer something worth trying.

I'm definitely no expert on x86, but my impression was that precisely because of this trick that everyone does, modern CPUs still do xor reg,reg at least as fast as moving 0. Because they want existing code to run as fast as possible, and in x86 compatibility-is-king land, that means optimizing for the common-if-weird cases, not the sane cases.

Re:Premature optimization is evil... and stupid on Cliff Click's Crash Course In Modern Hardware · 2010-01-14 13:00 · Score: 4, Informative

Which CPU's are those?

Those with a barrel shifter.

The fastest way to multiply today on AMD/Intel is to use the multiply instructions.

Then someone needs to beat the GCC developers with a cluestick.
$ cat test.c int main(int argc, char **argv) { return 4*(unsigned int)argc; } $ gcc -march=core2 test.c -o test $ objdump -d test ... 00000000004004ec <main>: 4004ec: 55 push %rbp 4004ed: 48 89 e5 mov %rsp,%rbp 4004f0: 89 7d fc mov %edi,-0x4(%rbp) 4004f3: 48 89 75 f0 mov %rsi,-0x10(%rbp) 4004f7: 8b 45 fc mov -0x4(%rbp),%eax 4004fa: c1 e0 02 shl $0x2,%eax 4004fd: c9 leaveq 4004fe: c3 retq 4004ff: 90 nop

yeah... it seems like only assembly language programs know this.

I program in assembly language, but not for x86. I usually program in ARM, which always has a barrel shifter. I guarantee shifts are faster than multiplies there.

Re:Not A Nerd? on Google Switching To EXT4 Filesystem · 2010-01-14 12:53 · Score: 1

Modern NAND cannot be written one byte at a time. You can only write full pages (that's the Flash term for a block, usually 2K or so). For MLC NAND Flash (most common these days, as it has higher density) you can only do this write once between erases, so you're stuck writing 2K at a time. For SLC NAND, you can write each page multiple times (usually 4 or so) between erases, though of course you can only flip bits from 1 to 0, not vice versa.

It is true that other technologies behave more like RAM, but so far none of them are viable for what we call SSDs today. This may change in the future. My comment was about current SSDs.

Re:Premature optimization is evil... and stupid on Cliff Click's Crash Course In Modern Hardware · 2010-01-14 12:00 · Score: 3, Interesting

Using shift to multiply is often a great idea on most CPUs. On the other hand, just about every compiler will do that for you (even with optimization turned off I bet), so there's no reason to explicitly use shift in code (unless you're doing bit manipulation, or multiplying by 2^n where n is more convenient to use than 2^n). However, a much more important thing is to correctly specify signed/unsigned where needed. Signed arithmetic can make certain optimizations harder and in general it's harder to think about. One of my gripes about C is defaulting to signed for integer types, when most integers out there are only ever used to hold positive values.

Re:Code in high-level on Cliff Click's Crash Course In Modern Hardware · 2010-01-14 11:56 · Score: 3, Informative

Coding in x86 ASM is never fun. Weird and odd and masochistically pleasurable for some, maybe, but not fun. Other architectures, on the other hand (like ARM), can be fun. x86-64 manages to increase the "funness" value somewhat, but I still wouldn't quite qualify it as "fun".

On the other hand, it's very true that knowing some ASM can help you write code that the compiler will translate into better assembly code, without going through all of the trouble yourself.

Re:Not A Nerd? on Google Switching To EXT4 Filesystem · 2010-01-14 11:51 · Score: 2, Interesting

SSD (NAND Flash) is still a block device. In fact, it's even "more" block, insomuch as it requires a filesystem a lot more aware of blocks, their limitations, and the proper way of using them (wear leveling, error correction, etc). It also uses larger blocks and also addresses groups of blocks for certain operations (erase). You either need a Flash-specific filesystem, or a translation to a more typical block device via a flash translation layer (FTL). Furthermore, I'm not aware of a single NAND Flash device that is accessible as memory mapped storage, nor can you run code from NAND, nor do I know of any CPUs capable of booting from NAND (they tend to have built-in ROM bootloaders to do the job). NOR Flash is another matter, but it's not competitive for SSDs. Going from HDDs to SSDs is hardly anything like going to RAM, except for the "solid state" part.

Re:Debug key on Does Your PC Really Need a SysRq Button Anymore? · 2010-01-14 09:01 · Score: 2, Informative

Ubuntu's recent decision to disable Ctrl+Alt+Backspace by default is a separate issue.

It wasn't Ubuntu's decision, it was Xorg's. I had to explicitly map Ctrl+Alt+Backspace again under Gentoo after a recent Xorg update.

Re:Unless I'm mistaken... on Nintendo Wii To Get Netflix Streaming · 2010-01-13 10:17 · Score: 1

They could've sold the DVD player channel for that price though.

Re:Unless I'm mistaken... on Nintendo Wii To Get Netflix Streaming · 2010-01-13 05:55 · Score: 1

The Wii was DVD ready until the latest iteration of drives. Nintendo had DVD support in there all along, including in their SDK and firmware. All they would've had to do is release a DVD Player Channel with a system update and suddenly all Wiis would gain DVD playback capability. Why they didn't do that is anyone's guess.

They've since killed the DVD readback ability in newer drives because it was being used for piracy (if you can read DVDs, you can read DVD-Rs. If you can read DVD-Rs, you can use software hacks or hardware MITM modchips to play games from them).

Re:PS3 will go Disc Free in Late 2010 on Nintendo Wii To Get Netflix Streaming · 2010-01-13 05:52 · Score: 1

You aren't going to be streaming HD on a Wii anyway. It doesn't have enough CPU muscle to decode HD H.264 (or any other modern codec) and it doesn't have the capability to output HD anyway.

Re:PS3 will go Disc Free in Late 2010 on Nintendo Wii To Get Netflix Streaming · 2010-01-13 04:21 · Score: 3, Insightful

A player does not take much space, and discs can't add storage to a Wii anyway (for caching or what have you). The Wii's ridiculously small storage and lack of expandability does not affect this particular application.

Re:self-modifying code == JIT on Intel and LG Team Up For x86 Smartphone · 2010-01-13 02:48 · Score: 1

I looked up the ARM code for Linux 2.6.31 and it doesn't generally work the way you think it does.
The whole instruction cache gets invalidated.
Ouch. Even the Pentium-4 wasn't that awful.

Ah, so you're blaming a poor OS implementation now.

Normally they go together. ARM Linux has exactly one system call for this, and the name is "cacheflush".

That's probably because JIT isn't popular enough that someone cared to add proper support.

Well, if you want to be explicit. Normally it comes free.

No, it comes at a huge expense in die area and power, and at a drop in performance due to unnecessary flushing by the CPU logic. The only "free" thing about it is the programmer doesn't have to do anything.

You ignored my example showing that this is not the case. The delayed flushing of x86 lets you take advantage of normal background writeback and it lets you take advantage of the possibility that the data might never need to be flushed.

"Delayed" flushing isn't so "delayed". Stop pretending like x86 implementations are 100% perfect and ideal at flushing only when necessary. a) they aren't nearly as optimal in your case as you think, b) they flush tons of unneeded stuff because they guess wrong, and they can't afford to miss a flush and cause bugs.

You get full-size offsets and other constant values on x86

You're proving my point. This is a waste of code and decoding logic (all the variable-length instruction crap). And it makes translating RISC to x86 a lot slower because you have to somehow recognize things like (this is powerPC) "lis 1, 0xdead" (more code) "stw 2, 0xbeef(1)" as a store to 0xdeadbeef. No one does this of course, you just have to waste all the time using complex and large CISC ops for trivial RISC ops.

There is no need to screw around with painful restrictions.

What are these "restrictions" that you speak of? Hint: on x86, each register is optimized a different way, which means that for proper performance compilers and JITs have to deal with not only a restricted set of registers, but very particular and odd semantics as to which is the faster register to use. On ARM all registers are just about equal, and on PowerPC all but a couple special purpose ones are too.

Look at the constant generation offered by ARM and weep. It's very limited and even redundant.

Which is why you use PC-relative loads for large constants. ARM constant generation covers the vast majority of common cases. CPUs aren't about wasting space and time giving you the ability to do everything in one instruction. That's the outdated CISC mindset.

Jump offsets are limited too.

Bullshit. Jump range is +/-32MB. If your code is larger than 32MB then you just use loads for external calls. Most single units of code (executables, shared libraries, etc) are well under 32MB. The largest program on my box is blender, which is 15MB.

If you generate too much ARM code in one place, you can no longer reach constant values that had to be stored near the instruction stream.

Constant values are typically stored after each function. If your function is longer than the offset range (exceedingly rare), then all you have to do is put a pool in the middle. Laugh all you want (it's funny), but it's not a problem and does not affect performance. (The crap that x86 compilers have to do these days isn't funny, it makes me weep and cringe every time I try to read x86 code. x86-64 is somewhat better in that regard.)

Translating RISC to x86 requires a register allocator for decent performance. Fortunately, there is no value in RISC binaries and thus no need to translate from RISC to x86. :-)

Re:The probability is not 214 on Second 3G GSM Cipher Cracked · 2010-01-12 03:26 · Score: 2, Insightful

a p value of 214 would be an amazingly impossible probability. Probability goes from 0 to 1, and 1 is the highest (most likely). 2^-14 = (1 in 16384) is "low" by human standards but amazingly high by crypto standards (most importantly, because computers can try something 16384 times in a split second).

Re:Again, Failing ... on Second 3G GSM Cipher Cracked · 2010-01-12 03:23 · Score: 0

Hint: it takes on the order of milliseconds for a computer to do something 2^14 = 16384 times.

CPU downclocking is not news on Asus Promises 12-Hour Battery Life In New High-End Laptop · 2010-01-11 08:52 · Score: 2, Interesting

This thing is as old as my beat up Pentium III Inspiron 5000. Varying GPU clocks is also old.

What is interesting is seamless switching between GPUs. Everything else is just marketingese for "we do what everyone else does and we actually bothered to put some extra effort into power optimization".

Re:self-modifying code == JIT on Intel and LG Team Up For x86 Smartphone · 2010-01-11 07:07 · Score: 1

You don't JIT until you hit the branch. On x86, this is mildly bad because you add instruction cache pressure ever time you code back into the JIT engine. On ARM, this is severely bad because you're doing a system call for every handful of JITed instructions.

System calls have tiny overhead on just about every system. Except x86.

Nevermind that I doubt most JIT stuff out there works at a branch level; they tend to compile at a procedure level.

You batch things up. You'll JIT some stuff that never runs.

ICache flushes are going to do nothing for a JIT anyway (most of the time), so they are little overhead too. So we've eliminated system calls and ICache invalidates as valid points for your argument. You're down to claiming that a single DCache partial store at the right point is going to be more expensive than the ridiculously complex and poor auto-flushing that x86 does. Good luck with that.

ARM caches tend to be virtually indexed and/or virtually tagged. This normally means that you need to flush the whole thing.

That's bullshit. You can't JIT from one address space affecting another without telling them about it anyway (else how are they going to know what to run?), so they'll have a chance to invalidate the right thing anway (hint: not flush, get your vocabulary right. If you write code, the writer has to store/flush it in DCache, and then those who want to run it have to invalidate it in ICache).

Properly flushing a cache line is trivial; there aren't any aliasing issues.

Properly flushing a cache line is all but impossible on x86, since it only got a cache line flush instruction with the introduction of SSE2. Before that, the only real cache management instruction was "flush and invalidate everything in every cache all at once". x86 depends solely on guesswork by the CPU, and guesswork is by definition going to be worse than the programmer flushing what's needed ONLY when it's needed.

Finally, x86 is a horrible instruction set to JIT/translate to anyway. JIT for RISC is a lot more efficient and easier to write. And if you look at emulators that do binary translation, x86 is just about the worst possible arch for that. Translating RISC to x86 is horrendously inefficient.

Re:self-modifying code == JIT on Intel and LG Team Up For x86 Smartphone · 2010-01-11 00:57 · Score: 1

Yeah, I know. That's unusually stupid; on PowerPC at least they made those instructions unprivileged. System calls are not free.

Yes, PowerPC is a nice architecture too. I wouldn't mind a world with PowerPC desktops and laptops and ARM netbooks and cellphones. FWIW, there's nothing preventing ARM from adding unprivileged operations on the cache.

It's even cheaper on x86, because you can delay doing that work. Cache lines get flushed naturally as time passes. By the time the CPU needs things flushed, it might have already happened. Some JITed code will never run; it's best to never pay the cost of handling it.

If you're not going to run the JIT code why on earth are you compiling it anyway? There's a reason why it's called JIT. Not to mention that you can just stick the cache ops right before the JIT code is run, instead of as it is compiled.

Of course these details are only explanations of why the system as a whole might perform well or poorly. The simple fact is that ARM is terribly slow in real-world use. You can blame the compiler, the architecture, the fabrication process, the cache size, whatever... but the end result is the same.

Oh yeah, because saying an architecture as a whole is "slow" makes so much sense. You're aware that it beats anything made by Intel on performance/watt, right? Sure, there are no ARM chips currently competitive with desktop and laptop x86 offerings, but we're talking about cellphones and netbooks here.

Re:self-modifying code == JIT on Intel and LG Team Up For x86 Smartphone · 2010-01-10 14:33 · Score: 1

On ARM, you call into the OS and store DCache/invalidate ICache ranges. That means only blocks you touched are stored on the D side and invalidated on the I side. This invalidation on the I side is likely to cost near nothing because chances are you weren't running code out of there before.

x86 has to do the same thing, except the CPU has to devote a huge gob of logic to guessing when. And sometimes it guesses wrong and flushes stuff needlessly.

Just because something is easier on x86 doesn't mean it's cheaper. In fact, most "easy" stuff on x86 is expensive precisely because it goes against modern CPU design principles.

Re:you've read Hennessy/Patterson/Tannenwhatever on Intel and LG Team Up For x86 Smartphone · 2010-01-10 06:52 · Score: 2, Informative

If you're going to cheat, I may as well cheat too. I can use the MMX and XMM registers you see.

Except no compiler actually does that, while ARM Thumb compilers routinely make use of the extra registers for longer-term less-frequent storage within larger functions.

Re:you've read Hennessy/Patterson/Tannenwhatever on Intel and LG Team Up For x86 Smartphone · 2010-01-10 06:27 · Score: 2, Informative

Old x86 gives you 8, the same as ARM Thumb.

Bzzt, wrong. ARM Thumb gives you 16 registers, it's just that you can only really compute on 8. The others are still accessible by a few instructions (mov, add) and they are still extremely useful for storing values around during the life of a function without having to constantly hit the stack.

Slashdot Mirror

User: marcansoft

Comments · 1,245