That's simply not true, and hasn't even been true since the first computers I've used (like, 1980s). Only the most basic, cheap devices use polled I/O for all hardware access. Even an ancient floppy disk peripheral has a small FIFO it can simultaneously fill while the CPU is busy doing other things. I can't understand how you can pass comment given this apparent lack of basic architecture knowledge.
If we want efficient code, we have to figure out ways to reward the programmers that write it. I don't see any sign that people anywhere are interested in doing this. Anyone have suggestions for how it might be done?
It's happening, from a source people didn't expect: portable devices. Battery life is becoming a primary feature of portable devices, and a large fraction of that comes from software efficiency. Take your average cell phone: it's probably got a half dozen cores running in it. One in the wifi, one in the baseband, maybe one doing voice codec, another doing audio decode, one (or more) doing video decode and/or 3d, and some others hiding away doing odds and ends.
The portable devices industry has been doing multi-core for ages. It's how your average cell phone manages immense power savings: you can power on/off those cores as necessary, switch their frequencies, and so on. They have engineers who understand how to do this. They're rewarded for getting it right: the reward is it lives on battery longer, and it's measurable.
Yes, you can get lazy and say 'next generation CPUs will be more efficient', but you'll be beaten by your competitors for battery life. Or, you fit a bigger battery and you lose in form factor.
The world is going mobile, and that'll be the push we need to get software efficient again.
Hardly a day goes by without HTML5/H.264/Theora popping up here. Here's the facts, not the fiction:
Theora is not as good as H.264, but it's not far off. People seem to turn this into huge issue. It's just not quite as good as H.264, but it's still perfectly usable (I would say better than just 'usable') as an alternative codec in pretty much every use case. A universally supported codec is far more useful than a slightly better but non-free codec.
Hardware support is a RED HERRING. You don't need hardware support in most of the cases mentioned. An iPhone is perfectly capable of playing 480p resolution Theora in realtime, in software only (yes I've tried this). Most of the newer Android handsets likewise (not tried this but they're similar or better spec). Please don't tell me you want 720p on your handset. Power consumption difference to hardware decode is noticeable, but it's not huge, and again - it works, and that's what matters.
Hardware support wouldn't involve re-inventing huge bits of codec. Theora is strikingly similar to the way MPEG-4 works in many ways. Most hardware codecs run configurable microcode anyway.
The bigger issue for most vendors is QA. Distributing software in the non-FOSS world isn't as simple as adding a library a flipping compiler switches. They'll have to support Theora from then on, test it for exploits, crashes, compliance, and keep on doing that every release. It's a support burden. It's chicken and egg: there isn't the demand, so they aren't supplying.
Mozilla can never include H.264 support as it would make the browser non-free-as-in-freedom, which is entirely against their mission. People suggesting they do this either do not understand the point of free software, are just selfish, or are (here anyway) trolls. Redistributors (Firefox) may be able to include H.264 support but it seems to be on shaky legal ground to me (IANAL).
Whether they can just support generic plug-ins is another issue altogether. My opinion: using system plug-ins is a bad idea because it takes the (presumably) well examined, mostly-exploit-free browser code and exposes it to (from experience) the much less well examined system plug-ins. But that argument is separate to everything else.
I'm guessing from the number of people that think Theora is unusable, too slow, requires hardware to run on handsets, and H.264 can just be popped into Mozilla, that there's the usual troll:signal ratio problem.
And an old work colleague reminds me of another annoyance of the format: page numbering. Yes, the article already touches on it being too big and unnecessary. It's also a pain when it comes to tagging.
All metadata/headers for Ogg files must be at the start of file. This is unfortunate, because it means if you re-tag a file, you need to resize the entire file and move the majority of it forward or backwards some bytes. Worse still, if you re-tag and end up creating a new page because it got large, you need to renumber the page numbers of the entire stream. That means reparsing the entire Ogg stream. Other containers stick metadata at the end of file, for good reason. This makes tagging utilities much more complicated for Ogg than other containers.
Either that or it just means that you're not that good of a coder.
And everyone at ffmpeg, and most of the previous companies I've worked for.
There's a reason there's such universal dislike for the Ogg container (outside of Xiph.org anyway). I really wish the criticism was taken seriously and changes made, rather than just dismissed as "Slashdot crowd trolling" or some kind of bad blood. I hope it's just a case of inexperience using other codecs - because there's a lot you could learn from them.
What the fuck are you talking about? There is absolutely no "latency" harm caused by the CRC, at least not on any hardware actually able to decode the formats much less encode. If performing the CRC on decode is so burdensome, you can stop checking it once you obtain sync and only check it if you obviously lose sync.
There may be, for example, 64KB pages, containing many packets. None of the packets can be decoded until the entire 64KB page is received and its CRC checked. This may sound small, but for 32-64kbit stream, that's 10 seconds of latency right there. Alternatively, you can have 1 page per packet, but on 32-64kbit streams you end up with about 5-10% overhead from the container. It is a REAL problem.
So 5 times the decoding complexity, correctly masking out the right bit, just to save 7 bits out of a half million. Yea, I'll get right on it.
There is a version field on every page header, and it's 32 bits. It's a tiny waste, but it's still a waste. It's not so tiny a waste on the above mentioned low-latency, low-bitrate streams.
Ugh! so the amount of data that you must read in order to obtain a framing lock is then infinite?
Yes. Why not? Packets aren't infinite unless you're deliberately malforming a stream. Codecs generally have 'profiles' defining what the limits are. For example, Vorbis has a soft-limit of 8KB. Framing lock serves a purpose in some transports, but for on-disk, on-disc or WAN transport, it's not a big issue.
Or if at least the container had simplified framing that could be placed throughout large packets. There are huge advantages to packet streaming being as simple as possible. Copying the packets out of a video stream is a bad thing for CPU and power consumption.
Did you even bother to spend five minutes thinking before posting this crap? The designers of Ogg obviously spent a lot more time.
I've spent a very long time with the Ogg container format, as well as most of the others in common use. That's why I can recognize the problems with it, as can the ffmpeg developers, as can all the other developers I've worked with at various companies. It's universally hated by anyone who's had to deal with integrating it into a project already supporting other containers/codecs.
If you're reading this, Monty - it's not just bad blood with ffmpeg. I can't think of anyone I've worked on Ogg with who would admit to liking it, and who hasn't had to spend hours re-working their nice A/V streaming designs to work around its oddities.
You're right, in that Ogg could be improved to make it suck a lot less. Here's my list, which I'm sure is the same as everyone else's list:
Make CRC either not mandatory or attached to individual packets. Optional 8/16 bit (truncated) CRC. A single byte CRC per packet would do its job fine. For any stream less reliable, there's bigger problems to tackle. The CRC requirement alone is big latency issue.
Use 'UTF-8' style fields. The lacing values look stupid when packets are large (video). E.g, 0x40 can be just 0x40. 0x800 can be 0x80,0x10 and so on.
Single byte version field. There has been 1 (One) Ogg format so far. I don't expect there to be 256 any time soon.
Disallow lacing values crossing pages. I did not enjoy implementing my own Ogg reader supporting stuff like that.
Disallow packets crossing pages. Allow larger pages to compensate. This removes any need to copy pages - they can be streamed directly into the decoder backend. That means page size doesn't matter so much as there's no buffering requirement.
Tie down the requirements on interlaced stream headers more tightly. Get all logical streams to 'register' themselves up front, or at least have their headers interleaved before anything else.
Remove the absurd requirement to support concatenation. It is intractable for implementations to support all cases of this.
Don't be so agnostic about contained streams. It would be nice to know some basics about them: audio, video, seek hints, that sort of thing. This isn't just 'that would be nice': it affects the start up order of your player. In Ogg's case, you need to attempt decode of each stream (or 'magic' detection) instead of just knowing it's an A/V stream before opening them.
The major issue I've had with Ogg is it just does everything a different way to everyone else. Everyone else did it another way for good reason. It's not that your decisions are terrible ones - they're just making it awkward to implement Ogg in players where there's already support for a bunch of other containers. You have to admit that's a valid complaint.
Copying packets/pages is a surprising hit on CPU and power consumption. When I was making MP3 players, it accounted for a few percent of power consumption. It's notably not something other codecs require, because they thought about that issue up-front.
The complaint is that there's no up-front header declaring all the streams contained. This is actually absurd - in theory you need to scan the entire file in case someone's just concatenated a video file with an audio file. This was, also absurdly, one of the aims of the Ogg container spec: concatenation. It's awesome to ask implementations to do this.
Overhead
One of Ogg's aims was to try to be less than 1% of the total stream space. It does achieve that, but the 'lacing values' end up looking pretty stupid for anything with large packets. It's like the article says: you end up with long strings of '255' summing up to 32-64KB packets, and hey just for extra complexity's sake, you'll have to split them across multiple not-quite-64KB pages. And then figure out where in that mess you're supposed to stick a timestamp: and here's a hint, you first page in that sequence has timestamp 0xffffffff which is nice if you randomly seeked to that place to find a position. God, what a mess that is to implement.
Then there's decode CPU overhead: the above basically means you end up copying the bitstream, which is a significant few percent overhead when you're talking about video.
Latency
You didn't understand his point. The latency is inherent in Ogg due to the large pages (not packets) required to reduce its size overhead, and in the position of the CRC (at the front of the page rather than the end). Reducing the page size makes the page headers start taking significant percentages of size if it's a low bit rate stream, e.g internet audio.
Random Access
Try pre-caching a 2GB video file. Or try pre-caching a 2GB video stream coming off the internet where the other end of the pipe is the other side of the world. Random access in these two realistic cases (if you'll admit that) requires a look up table, and it's precisely why many containers DO.
Complexity
The lacing values crossing pages, packets crossing pages, position of CRC, position of timestamp between packets/pages especially when cross-page, timestamps between logical streams (elementary streams), and other oddities/idiocies all ADD UP to make it a bloody mess to deal with. You end up just making copies of packets out of the stream, which is inefficient. In fact, that's exactly what the official Xiph codecs do: they make ugly copies. On real world MP3 players (and I've worked on some) that accounts for about 10% of your battery play time right there. I kid you not.
What this guy is expressing is what everyone who's worked on the Ogg container format itself has found out: it's just BAD at EVERYTHING. It needs replacing with something that doesn't suck, and there are free/open alternatives around. Maybe Vorbis 2 should switch container.
I sadly have to agree, and I've voiced the same objections for a long time. It really is like he tells it: it's just bad at everything it was intended to achieve. It's a source of bugs, it's horrendously complicated to support, and it's horrendously inefficient at anything but audio (and even then, not so good).
It seems to me, most of what went wrong was trying to support concatenation of Ogg streams. This is a nice idea, but actually quite a rare case. It's also incredibly naive for the specification document to request that Ogg implementation detect this. What, I'm supposed to scan the entire file in case that happens? No. I'll just not be compliant to that, thank you very much.
I even wrote my own Ogg/Vorbis decoder from scratch a while back (and dabble every now and then), and found Ogg to be a never-cooling, never-extinguishing steaming pile of hippo crap left over from consuming a dog. It just made everything so difficult to do. Seeking a stream involves divide-and-conquer - not necessarily a bad thing, but when you have huge streams the number of seeks can be bad. Not to mention if your stream has an endpoint the other side of the Atlantic Ocean. Why oh why did they pick timestamps being at the END of a page and indicating the output byte count produced by the END of that page? That little detail alone probably cost me days of debug.
I almost gave up at one point and went to a container format of my own which would have worked much better. Header: 'CONTAINER v1'. Packet: 'MAGIC', 4 byte Length, 4 byte Output pos. Job done. The sad fact is, that's easier than Ogg, smaller than Ogg (unless you're talking really low bit rate), and does entirely the job of Ogg without the complexity.
I'm probably going to add a Matroska container to my codec just to see how easy they are to produce. The spec looks fantastic, but the devil's always in the details - although seeing the praise on various (engineer) forums, it looks like the way to go.
So, Ogg, please die. We need you to get out of the way.
Your objective is to Armchair engineers? Ok, well I'm not an armchair engineer. I've written my own Ogg/Vorbis decoder from scratch in the past (here). I've worked on codecs for about 10 years. I'm a fan of Vorbis and Theora, but Ogg needs to die a horrible death.
Ogg was by far the most bug-inducing part of the code. It's just AWFUL. It's ill-designed. It's incredibly complicated. It's inherently inefficient (copy sometimes required).
In short, it's the worst container format I've used in any serious application, and I've used pretty much all the common ones.
The irony of what you're saying, is that actually Ogg is what you'd end up with if an armchair engineer designed an audio codec container from scratch.
I liked the look of the language as far as it went but what disappointed me a bit was that it still didn't seem suitable for lower level stuff, e.g. embedded and kernel work. Those guys are still stuck with C, which serves them well but isn't as nice as it could be either. If I wanted a replacement for C in userland without so much complexity as C++ or Obj-C offer, I'd think Go would be a relevant choice, although it probably needs to mature a bit first.
I like Go, but it's wrong to call it a systems language. I have no idea why they're marketing it as such. I can't think of a single system (as in, low-level, embedded, or even PC system software) where it would be suitable.
What they should have said is it's a NATIVE compiled language, or maybe just 'static'. There's huge advantages to that, even if it's not suitable for systems programming. Sadly, I think that niche is already pretty much covered by Java and C#.
Depends on the ARM CPU. ARM7/ARM9 are alignment sensitive. ARM Cortex has a bus/cache interface that allows arbitrary alignment. Porting to the former may be difficult depending on the software, or may simply be tedious, the latter is usually as easy as a recompile if the platform and toolchain is similar (e.g. Linux+gcc).
C code which accesses data unaligned is illegal anyway. Correctly written C and C++ programs which do not assume structure packing, size or alignment will compile and run without changes.
Sadly, there are a lot of programs written assuming otherwise. Usually stuff written for an x86/Windows environment. You'd be surprised just how much GNU/Linux source compiles and runs without a single change: basically the entire of Ubuntu. Hell, I used to run a Debian ARM system back in 1999. This isn't news to me, at least.
The Itanium is more that just superscalar, it is explicit parallelism.
Except it isn't, really, and that's what precisely the problem with Itanium. It was intended to be Explicitly Parallel Instruction Computing (EPIC), but then Intel remembered they were Intel and instead designed Extremely Complicated Parallel Computing (EPIC FAIL). Yes, the instruction bundles do explicitly indicate dependencies, but they still have to worm their way through a huge reorder and retirement buffer, and just to top things off there's STILL register renaming. These are supposed to be things that EPIC allows you to remove from a design!
It's a shame, because it could have been a very straightforward design, with a shorter pipeline, and most importantly smaller and cheaper. Or just tons of cores. I think Intel just had POWER envy and wanted to be reclaim the crown of King of Fucking Complicated back from IBM, which is funny because at least IBM ditched the complexity when they saw it wasn't working so well (POWER6).
I do mean burn in an FPGA, yes. Or more likely: extend the existing silicon in a video codec for the relatively minor differences in bitstream (minor in that we're not talking wavelet vs DCT or anything). It would be a very small amount of extra silicon required, if any, as you'll find many codecs actually run uploaded microcode.
Decoding 1080p Theora may be asking a lot in software, except for high-end desktops and laptops. But that doesn't rule out having 480p or even 720p versions. The point is there's always going to be a baseline version of content, and that version might as well be in a format that can be universally used: Theora.
I do wonder about mapping the features of Theora to a GPU - might be something to try some rainy weekend. It would mean there's one less (lame) excuse not to use Theora.
By virtue of the de facto status, it seems like anything that the majority of people use will never be superceded by anything that barely matches or only slightly improves on the de facto standard. From what I've read [reddit.com] Theora is quite bare-bones compared to H.264 and hasn't been designed with hardware decoding in mind.
And if you actually read what you linked you'll see it immediately debunked. Theora is up to scratch and has been designed with hardware decoding in mind. It's slightly behind H.264, but come on, we're not talking double the bit rate or anything. It never stopped MP3 being the defacto standard when better stuff was around. Universal availability trumps technical excellence always.
Why is a lack of hardware codec a problem? For a modern smartphone, it's possible to decode half-VGA resolution in real time. Yes, this is slightly (not majorly) more power expensive than hardware decode, but at least it provides a universal baseline.
You also seem to be expending a lot of energy writing an email for something you don't care about?
So, mjg59, I would kindly request that you stop making such claims. I doubt you have worked on embedded devices -- I believe you work on server level stuff -- but you most certainly haven't worked extensively on the SoCs being used on Android phones.
I have worked on embedded devices - Linux based and others - for over a decade, and mjg59 is right. If you need something as brutal as hard suspend mode, you're doing it wrong. Any serious low power optimized ARM SoC can run down to very low current when idle, and has peripherals which can be individually clocked down and/or gated. I did work in a periphery way on the G1 at Google, and was very surprised at the way power gating was done. Put simply: the only other embedded OS in the same class as Android which does power gating like this is Windows Mobile. Everyone else learned it was unnecessary a long time ago. I was fairly shocked how few engineers had ever done serious embedded work before, and the result shows.
I know the Qualcomm parts have horrendously stupid design decisions in them which prevent decent idle current, but it's a wash compared to the other sources of battery drain. It's also a wash compared to the damage it does to your code design to support full suspend as part of normal per-second operation.
The Linux maintainers are right: Android is just doing it the wrong way. If there's any one feature I think Android could have done without, it's wake-locks. Learn how to use fine-grained clock switching and gating, not brutally-coarse-grained suspend. This isn't a bloody laptop. And no, I'm not saying this as a back seat driver: I really have done this kind of crap for a decade.
Simply invert the statistic and state what this new portion represents (usually the opposite). In this case:
A full 44 percent of visitors to Google News scan headlines without accessing newspapers' individual sites
... turns into:
A full 56 percent of visitors to Google News scan headlines and access newspapers' individual sites
Wow, doesn't that sound better? Not only that, but it makes the next step easily seen: how many people scan Google News? What's 55% of that number? How many clicks is that? Isn't that a gigantic portion of a news site's revenue?
But hey, the stat sounds much more evil when you say it the other way around.
The main take-away from this talk is that the modern software engineer needs to pay more attention to memory access and data dependency.
For some reason, the Slashdot luddites have come out in force to declare that it was actually about how inaccessible modern architectures are and how it's more proof that you should never use anything but a high level language. Nonsense.
I see this happen every time the subject of low level architecture comes up. There's a (sadly) large proportion of engineers who vehemently refuse to learn anything below the highest levels of programming. This turns into a silly justification backed by the evidence of how complex modern architectures are.
Some variants of this luddite behavior emerge as 'premature optimization is the root of all evil'. Yes, it's a good quote, but it's not referring to what you're referring to. There's nothing wrong with knowing in advance where the bottlenecks in a system will likely be. That's called experience. It's called knowing the characteristics of your platform. Those who stubbornly design systems without thought to performance are doomed to produce code which is inefficient, slow, and worst of all - incapable of being optimized without a re-write. Premature optimization may be bad, but preemptive optimization is a good quality to have.
That's the second take-away, in my opinion, from the talk: Engineers are all going to have to learn how to optimize code for the architecture, because your free ride on the MHz and CPI slope has ended. Here's a clue: if you're someone who knows how it all works, can preemptively optimize their designs to better fit their system, and can use their knowledge to debug issues, you are a far more valued engineer than the others. Bear this in mind the next time you find that a million outsourced engineers can do exactly the same job as you.
As an ex-google employee who worked on Android and (peripherally) Chrome, I can tell you the grandparent is absolutely right. They share WebKit, but they were written using branches that were not merged until after public release. None of the rest shares any code whatsoever. Hell, the 'chrome' (not 'Chrome') in the Android browser is written in JAVA.
Android's browser is NOT Chrome, shares absolutely no code outside of WebKit, and even that is via general-purpose merging of changes via branches. I have no idea why various Google directors/PMs have suggested otherwise in the past, other than of course just being incorrect without realizing it.
This stuff is all visible in public repositories. Go have a look. There's no code shared other than WebKit.
And, of course, it doesn't even send the same User Agent string so it won't be counted as Chrome in any case.
What kind of ARM cores? 15 year old ARM7 cores, maybe. A single Cortex-A8 core (found in most of the high-end smart phones) at 1GHz would be on a par with a P3-500. These things are dual-issue and have very fast, low latency memory access compared to what was available on P3s.
If you had Quad 2GHz Cortex-A9s, I think a more appropriate comparison would be a 1.8GHz Athlon XP.
Most of the difference in speed you see isn't the core anyway. It's the video, storage (flash), wireless and other peripherals which bog things down. In fact, the GPUs in today's smart phones are faster than what was available in the P3-500 days too (at least, the SGX based parts are).
I think it would be cool to take RISC one step further where a single instruction is broken into 5-7 parts (or however deep the pipeline is), which each part controlling each step in the pipeline. But the assembler exposed a very traditional RISC instruction set similar to MIPS or Sparc, with a naked pipeline model
This is pretty much what microcode is. The trouble is you need a huge amount of control signals to drive the pipeline - much more than 32 bits for a start. The vast majority of combinations also don't make sense (99% of signals are halt-and-catch-fire). So you'll have to encode the signals to some extent. At which point you've rediscovered instruction set encoding:) An ancient 6502 has about 130 signals for its non-pipelined architecture, and last I checked x86 (which was years ago) it was in the thousands!
To be fair, there are architectures where driving the pipeline is much more explicit. A lot of (older) DSPs are like this. You'll find some that don't have any hazard protection (e.g Motorola 56k): you're not supposed to write (or generate) code that has input/output conflicts, or accesses values before they're ready. Some vector DSPs has explicit input/output driven by the number of cycles between instructions, and yes that's fairly insane.
The reason it's not common is it's not much benefit. There's not a huge amount of extra silicon to have a statically scheduled pipeline, e.g like ARM. Even Cortex-A8, which has dual pipelines, admits to having a few instructions with unnecessary extra cycles of delay due to simplifying the scheduling (so it doesn't end up out-of-order scheduled).
It is very tempting to have an instruction set designed exactly to a single generation of CPU and its exact pipeline. Transmeta's CPUs are the closest anyone's really come to this - it's a shame the timing of their venture was bad, because in the current low power market, with much better compilers and tools, I think it's a really good idea to test some more.
How does multi-threading work without complex atomic instructions?
There are a few exceptions to the simple instruction / simple pipeline. They tend to stall everything but they're rare and don't complicate the design. For example on ARM:
Before ARMv6, the 'swp' instruction performed a bus-locked swap of a register with a word of memory, returning the existing value. That was all you needed.
ARMv6 and up, there's 'ldrex/strex', which perform a load/store of a word, with the store failing if a non-ldrex instruction or other core accessed the same cache line before the strex. Slightly more efficient, and all you need.
Even on x86 you'll find that the vast majority of multi-threading is done using just 'cmpxchg'. You don't need a huge amount of instruction set to get it done.
Go has garbage collection and lacks pointer arithmetic. So... it won't replace C++, then?
Why was that so easy and quick to say? I really don't understand the repeated banging-head-against-wall that language inventors are doing. There's a good reason why C++ is still in wide and very popular use: precisely because it does have explicit memory management and pointer arithmetic. C++ is a static, explicit language. Go is not. It will not replace C++, and no language will until that is understood.
The problems C++ need fixing are elsewhere. The syntax needs cleaning up. The ABI needs rationalizing between architectures. Multiple inheritance needs some taming (ditch 'virtual' multiple inheritance - it's insane), but not removing. Interface-only classes need promoting to a full type rather than inferred from being 100% pure virtual (and even then there's usually a non-pure-virtual destructor for stupid foot-bullet-avoiding reasons). There needs to be saner syntactic sugar for repeated operations (like python's 'with' keyword). Templates syntax needs to be less verbose and more automatic (already being worked on for C++0x but at this rate will be C++1x, keyword 'auto').
Stop trying to replace C++ with a language that does not fulfill every aspect C++ covers. If you ARE a language inventor and reading my comment, answer this: can you write a cache/MMU interface or an interrupt handler in your language? If the answer is no, go back to the drawing board.
There's a lot to be said for exercise - it makes you healthier except in exceptional circumstances (like overdoing it, or if you have a heart condition).
Muscle mass is also a good way to lose weight long term. Short term, it weighs more than fat, so you get the surprising (to naive people) result that exercise can make you put weight ON if nothing else changes (and subconsciously you get more hungry due to the calorie burning).
Long term, muscle mass needs feeding. That's why your body gets rid of it if you don't use it - it's a waste of energy. You put muscle mass on, you burn calories whether you use it or not. Granted, it takes a lot. The best to focus on (so I'm told) is leg muscle, as they're already big and building them up is relatively easy (running/cycling/walking all do it).
But sure - exercise alone and diet alone isn't going to lose you weight. You need to do both.
That's simply not true, and hasn't even been true since the first computers I've used (like, 1980s). Only the most basic, cheap devices use polled I/O for all hardware access. Even an ancient floppy disk peripheral has a small FIFO it can simultaneously fill while the CPU is busy doing other things. I can't understand how you can pass comment given this apparent lack of basic architecture knowledge.
If we want efficient code, we have to figure out ways to reward the programmers that write it. I don't see any sign that people anywhere are interested in doing this. Anyone have suggestions for how it might be done?
It's happening, from a source people didn't expect: portable devices. Battery life is becoming a primary feature of portable devices, and a large fraction of that comes from software efficiency. Take your average cell phone: it's probably got a half dozen cores running in it. One in the wifi, one in the baseband, maybe one doing voice codec, another doing audio decode, one (or more) doing video decode and/or 3d, and some others hiding away doing odds and ends.
The portable devices industry has been doing multi-core for ages. It's how your average cell phone manages immense power savings: you can power on/off those cores as necessary, switch their frequencies, and so on. They have engineers who understand how to do this. They're rewarded for getting it right: the reward is it lives on battery longer, and it's measurable.
Yes, you can get lazy and say 'next generation CPUs will be more efficient', but you'll be beaten by your competitors for battery life. Or, you fit a bigger battery and you lose in form factor.
The world is going mobile, and that'll be the push we need to get software efficient again.
Hardly a day goes by without HTML5/H.264/Theora popping up here. Here's the facts, not the fiction:
I'm guessing from the number of people that think Theora is unusable, too slow, requires hardware to run on handsets, and H.264 can just be popped into Mozilla, that there's the usual troll:signal ratio problem.
And an old work colleague reminds me of another annoyance of the format: page numbering. Yes, the article already touches on it being too big and unnecessary. It's also a pain when it comes to tagging.
All metadata/headers for Ogg files must be at the start of file. This is unfortunate, because it means if you re-tag a file, you need to resize the entire file and move the majority of it forward or backwards some bytes. Worse still, if you re-tag and end up creating a new page because it got large, you need to renumber the page numbers of the entire stream. That means reparsing the entire Ogg stream. Other containers stick metadata at the end of file, for good reason. This makes tagging utilities much more complicated for Ogg than other containers.
Either that or it just means that you're not that good of a coder.
And everyone at ffmpeg, and most of the previous companies I've worked for.
There's a reason there's such universal dislike for the Ogg container (outside of Xiph.org anyway). I really wish the criticism was taken seriously and changes made, rather than just dismissed as "Slashdot crowd trolling" or some kind of bad blood. I hope it's just a case of inexperience using other codecs - because there's a lot you could learn from them.
What the fuck are you talking about? There is absolutely no "latency" harm caused by the CRC, at least not on any hardware actually able to decode the formats much less encode. If performing the CRC on decode is so burdensome, you can stop checking it once you obtain sync and only check it if you obviously lose sync.
There may be, for example, 64KB pages, containing many packets. None of the packets can be decoded until the entire 64KB page is received and its CRC checked. This may sound small, but for 32-64kbit stream, that's 10 seconds of latency right there. Alternatively, you can have 1 page per packet, but on 32-64kbit streams you end up with about 5-10% overhead from the container. It is a REAL problem.
So 5 times the decoding complexity, correctly masking out the right bit, just to save 7 bits out of a half million. Yea, I'll get right on it.
There is a version field on every page header, and it's 32 bits. It's a tiny waste, but it's still a waste. It's not so tiny a waste on the above mentioned low-latency, low-bitrate streams.
Ugh! so the amount of data that you must read in order to obtain a framing lock is then infinite?
Yes. Why not? Packets aren't infinite unless you're deliberately malforming a stream. Codecs generally have 'profiles' defining what the limits are. For example, Vorbis has a soft-limit of 8KB. Framing lock serves a purpose in some transports, but for on-disk, on-disc or WAN transport, it's not a big issue.
Or if at least the container had simplified framing that could be placed throughout large packets. There are huge advantages to packet streaming being as simple as possible. Copying the packets out of a video stream is a bad thing for CPU and power consumption.
Did you even bother to spend five minutes thinking before posting this crap? The designers of Ogg obviously spent a lot more time.
I've spent a very long time with the Ogg container format, as well as most of the others in common use. That's why I can recognize the problems with it, as can the ffmpeg developers, as can all the other developers I've worked with at various companies. It's universally hated by anyone who's had to deal with integrating it into a project already supporting other containers/codecs.
If you're reading this, Monty - it's not just bad blood with ffmpeg. I can't think of anyone I've worked on Ogg with who would admit to liking it, and who hasn't had to spend hours re-working their nice A/V streaming designs to work around its oddities.
You're right, in that Ogg could be improved to make it suck a lot less. Here's my list, which I'm sure is the same as everyone else's list:
The major issue I've had with Ogg is it just does everything a different way to everyone else. Everyone else did it another way for good reason. It's not that your decisions are terrible ones - they're just making it awkward to implement Ogg in players where there's already support for a bunch of other containers. You have to admit that's a valid complaint.
Copying packets/pages is a surprising hit on CPU and power consumption. When I was making MP3 players, it accounted for a few percent of power consumption. It's notably not something other codecs require, because they thought about that issue up-front.
I'll do some analysis for you:
Generalities/codec mapping
The complaint is that there's no up-front header declaring all the streams contained. This is actually absurd - in theory you need to scan the entire file in case someone's just concatenated a video file with an audio file. This was, also absurdly, one of the aims of the Ogg container spec: concatenation. It's awesome to ask implementations to do this.
Overhead
One of Ogg's aims was to try to be less than 1% of the total stream space. It does achieve that, but the 'lacing values' end up looking pretty stupid for anything with large packets. It's like the article says: you end up with long strings of '255' summing up to 32-64KB packets, and hey just for extra complexity's sake, you'll have to split them across multiple not-quite-64KB pages. And then figure out where in that mess you're supposed to stick a timestamp: and here's a hint, you first page in that sequence has timestamp 0xffffffff which is nice if you randomly seeked to that place to find a position. God, what a mess that is to implement.
Then there's decode CPU overhead: the above basically means you end up copying the bitstream, which is a significant few percent overhead when you're talking about video.
Latency
You didn't understand his point. The latency is inherent in Ogg due to the large pages (not packets) required to reduce its size overhead, and in the position of the CRC (at the front of the page rather than the end). Reducing the page size makes the page headers start taking significant percentages of size if it's a low bit rate stream, e.g internet audio.
Random Access
Try pre-caching a 2GB video file. Or try pre-caching a 2GB video stream coming off the internet where the other end of the pipe is the other side of the world. Random access in these two realistic cases (if you'll admit that) requires a look up table, and it's precisely why many containers DO.
Complexity
The lacing values crossing pages, packets crossing pages, position of CRC, position of timestamp between packets/pages especially when cross-page, timestamps between logical streams (elementary streams), and other oddities/idiocies all ADD UP to make it a bloody mess to deal with. You end up just making copies of packets out of the stream, which is inefficient. In fact, that's exactly what the official Xiph codecs do: they make ugly copies. On real world MP3 players (and I've worked on some) that accounts for about 10% of your battery play time right there. I kid you not.
What this guy is expressing is what everyone who's worked on the Ogg container format itself has found out: it's just BAD at EVERYTHING. It needs replacing with something that doesn't suck, and there are free/open alternatives around. Maybe Vorbis 2 should switch container.
I sadly have to agree, and I've voiced the same objections for a long time. It really is like he tells it: it's just bad at everything it was intended to achieve. It's a source of bugs, it's horrendously complicated to support, and it's horrendously inefficient at anything but audio (and even then, not so good).
It seems to me, most of what went wrong was trying to support concatenation of Ogg streams. This is a nice idea, but actually quite a rare case. It's also incredibly naive for the specification document to request that Ogg implementation detect this. What, I'm supposed to scan the entire file in case that happens? No. I'll just not be compliant to that, thank you very much.
I even wrote my own Ogg/Vorbis decoder from scratch a while back (and dabble every now and then), and found Ogg to be a never-cooling, never-extinguishing steaming pile of hippo crap left over from consuming a dog. It just made everything so difficult to do. Seeking a stream involves divide-and-conquer - not necessarily a bad thing, but when you have huge streams the number of seeks can be bad. Not to mention if your stream has an endpoint the other side of the Atlantic Ocean. Why oh why did they pick timestamps being at the END of a page and indicating the output byte count produced by the END of that page? That little detail alone probably cost me days of debug.
I almost gave up at one point and went to a container format of my own which would have worked much better. Header: 'CONTAINER v1'. Packet: 'MAGIC', 4 byte Length, 4 byte Output pos. Job done. The sad fact is, that's easier than Ogg, smaller than Ogg (unless you're talking really low bit rate), and does entirely the job of Ogg without the complexity.
I'm probably going to add a Matroska container to my codec just to see how easy they are to produce. The spec looks fantastic, but the devil's always in the details - although seeing the praise on various (engineer) forums, it looks like the way to go.
So, Ogg, please die. We need you to get out of the way.
Your objective is to Armchair engineers? Ok, well I'm not an armchair engineer. I've written my own Ogg/Vorbis decoder from scratch in the past (here). I've worked on codecs for about 10 years. I'm a fan of Vorbis and Theora, but Ogg needs to die a horrible death.
Ogg was by far the most bug-inducing part of the code. It's just AWFUL. It's ill-designed. It's incredibly complicated. It's inherently inefficient (copy sometimes required).
In short, it's the worst container format I've used in any serious application, and I've used pretty much all the common ones.
The irony of what you're saying, is that actually Ogg is what you'd end up with if an armchair engineer designed an audio codec container from scratch.
I like Go, but it's wrong to call it a systems language. I have no idea why they're marketing it as such. I can't think of a single system (as in, low-level, embedded, or even PC system software) where it would be suitable.
What they should have said is it's a NATIVE compiled language, or maybe just 'static'. There's huge advantages to that, even if it's not suitable for systems programming. Sadly, I think that niche is already pretty much covered by Java and C#.
Depends on the ARM CPU. ARM7/ARM9 are alignment sensitive. ARM Cortex has a bus/cache interface that allows arbitrary alignment. Porting to the former may be difficult depending on the software, or may simply be tedious, the latter is usually as easy as a recompile if the platform and toolchain is similar (e.g. Linux+gcc).
C code which accesses data unaligned is illegal anyway. Correctly written C and C++ programs which do not assume structure packing, size or alignment will compile and run without changes.
Sadly, there are a lot of programs written assuming otherwise. Usually stuff written for an x86/Windows environment. You'd be surprised just how much GNU/Linux source compiles and runs without a single change: basically the entire of Ubuntu. Hell, I used to run a Debian ARM system back in 1999. This isn't news to me, at least.
The Itanium is more that just superscalar, it is explicit parallelism.
Except it isn't, really, and that's what precisely the problem with Itanium. It was intended to be Explicitly Parallel Instruction Computing (EPIC), but then Intel remembered they were Intel and instead designed Extremely Complicated Parallel Computing (EPIC FAIL). Yes, the instruction bundles do explicitly indicate dependencies, but they still have to worm their way through a huge reorder and retirement buffer, and just to top things off there's STILL register renaming. These are supposed to be things that EPIC allows you to remove from a design!
It's a shame, because it could have been a very straightforward design, with a shorter pipeline, and most importantly smaller and cheaper. Or just tons of cores. I think Intel just had POWER envy and wanted to be reclaim the crown of King of Fucking Complicated back from IBM, which is funny because at least IBM ditched the complexity when they saw it wasn't working so well (POWER6).
I do mean burn in an FPGA, yes. Or more likely: extend the existing silicon in a video codec for the relatively minor differences in bitstream (minor in that we're not talking wavelet vs DCT or anything). It would be a very small amount of extra silicon required, if any, as you'll find many codecs actually run uploaded microcode.
Decoding 1080p Theora may be asking a lot in software, except for high-end desktops and laptops. But that doesn't rule out having 480p or even 720p versions. The point is there's always going to be a baseline version of content, and that version might as well be in a format that can be universally used: Theora.
I do wonder about mapping the features of Theora to a GPU - might be something to try some rainy weekend. It would mean there's one less (lame) excuse not to use Theora.
By virtue of the de facto status, it seems like anything that the majority of people use will never be superceded by anything that barely matches or only slightly improves on the de facto standard. From what I've read [reddit.com] Theora is quite bare-bones compared to H.264 and hasn't been designed with hardware decoding in mind.
And if you actually read what you linked you'll see it immediately debunked. Theora is up to scratch and has been designed with hardware decoding in mind. It's slightly behind H.264, but come on, we're not talking double the bit rate or anything. It never stopped MP3 being the defacto standard when better stuff was around. Universal availability trumps technical excellence always.
Why is a lack of hardware codec a problem? For a modern smartphone, it's possible to decode half-VGA resolution in real time. Yes, this is slightly (not majorly) more power expensive than hardware decode, but at least it provides a universal baseline.
You also seem to be expending a lot of energy writing an email for something you don't care about?
So, mjg59, I would kindly request that you stop making such claims. I doubt you have worked on embedded devices -- I believe you work on server level stuff -- but you most certainly haven't worked extensively on the SoCs being used on Android phones.
I have worked on embedded devices - Linux based and others - for over a decade, and mjg59 is right. If you need something as brutal as hard suspend mode, you're doing it wrong. Any serious low power optimized ARM SoC can run down to very low current when idle, and has peripherals which can be individually clocked down and/or gated. I did work in a periphery way on the G1 at Google, and was very surprised at the way power gating was done. Put simply: the only other embedded OS in the same class as Android which does power gating like this is Windows Mobile. Everyone else learned it was unnecessary a long time ago. I was fairly shocked how few engineers had ever done serious embedded work before, and the result shows.
I know the Qualcomm parts have horrendously stupid design decisions in them which prevent decent idle current, but it's a wash compared to the other sources of battery drain. It's also a wash compared to the damage it does to your code design to support full suspend as part of normal per-second operation.
The Linux maintainers are right: Android is just doing it the wrong way. If there's any one feature I think Android could have done without, it's wake-locks. Learn how to use fine-grained clock switching and gating, not brutally-coarse-grained suspend. This isn't a bloody laptop. And no, I'm not saying this as a back seat driver: I really have done this kind of crap for a decade.
... turns into:
Wow, doesn't that sound better? Not only that, but it makes the next step easily seen: how many people scan Google News? What's 55% of that number? How many clicks is that? Isn't that a gigantic portion of a news site's revenue?
But hey, the stat sounds much more evil when you say it the other way around.
The main take-away from this talk is that the modern software engineer needs to pay more attention to memory access and data dependency.
For some reason, the Slashdot luddites have come out in force to declare that it was actually about how inaccessible modern architectures are and how it's more proof that you should never use anything but a high level language. Nonsense.
I see this happen every time the subject of low level architecture comes up. There's a (sadly) large proportion of engineers who vehemently refuse to learn anything below the highest levels of programming. This turns into a silly justification backed by the evidence of how complex modern architectures are.
Some variants of this luddite behavior emerge as 'premature optimization is the root of all evil'. Yes, it's a good quote, but it's not referring to what you're referring to. There's nothing wrong with knowing in advance where the bottlenecks in a system will likely be. That's called experience. It's called knowing the characteristics of your platform. Those who stubbornly design systems without thought to performance are doomed to produce code which is inefficient, slow, and worst of all - incapable of being optimized without a re-write. Premature optimization may be bad, but preemptive optimization is a good quality to have.
That's the second take-away, in my opinion, from the talk: Engineers are all going to have to learn how to optimize code for the architecture, because your free ride on the MHz and CPI slope has ended. Here's a clue: if you're someone who knows how it all works, can preemptively optimize their designs to better fit their system, and can use their knowledge to debug issues, you are a far more valued engineer than the others. Bear this in mind the next time you find that a million outsourced engineers can do exactly the same job as you.
As an ex-google employee who worked on Android and (peripherally) Chrome, I can tell you the grandparent is absolutely right. They share WebKit, but they were written using branches that were not merged until after public release. None of the rest shares any code whatsoever. Hell, the 'chrome' (not 'Chrome') in the Android browser is written in JAVA.
Android's browser is NOT Chrome, shares absolutely no code outside of WebKit, and even that is via general-purpose merging of changes via branches. I have no idea why various Google directors/PMs have suggested otherwise in the past, other than of course just being incorrect without realizing it.
This stuff is all visible in public repositories. Go have a look. There's no code shared other than WebKit.
And, of course, it doesn't even send the same User Agent string so it won't be counted as Chrome in any case.
A quad-2ghz arm would be on par with a P3-500.
What kind of ARM cores? 15 year old ARM7 cores, maybe. A single Cortex-A8 core (found in most of the high-end smart phones) at 1GHz would be on a par with a P3-500. These things are dual-issue and have very fast, low latency memory access compared to what was available on P3s.
If you had Quad 2GHz Cortex-A9s, I think a more appropriate comparison would be a 1.8GHz Athlon XP.
Most of the difference in speed you see isn't the core anyway. It's the video, storage (flash), wireless and other peripherals which bog things down. In fact, the GPUs in today's smart phones are faster than what was available in the P3-500 days too (at least, the SGX based parts are).
I think it would be cool to take RISC one step further where a single instruction is broken into 5-7 parts (or however deep the pipeline is), which each part controlling each step in the pipeline. But the assembler exposed a very traditional RISC instruction set similar to MIPS or Sparc, with a naked pipeline model
This is pretty much what microcode is. The trouble is you need a huge amount of control signals to drive the pipeline - much more than 32 bits for a start. The vast majority of combinations also don't make sense (99% of signals are halt-and-catch-fire). So you'll have to encode the signals to some extent. At which point you've rediscovered instruction set encoding :) An ancient 6502 has about 130 signals for its non-pipelined architecture, and last I checked x86 (which was years ago) it was in the thousands!
To be fair, there are architectures where driving the pipeline is much more explicit. A lot of (older) DSPs are like this. You'll find some that don't have any hazard protection (e.g Motorola 56k): you're not supposed to write (or generate) code that has input/output conflicts, or accesses values before they're ready. Some vector DSPs has explicit input/output driven by the number of cycles between instructions, and yes that's fairly insane.
The reason it's not common is it's not much benefit. There's not a huge amount of extra silicon to have a statically scheduled pipeline, e.g like ARM. Even Cortex-A8, which has dual pipelines, admits to having a few instructions with unnecessary extra cycles of delay due to simplifying the scheduling (so it doesn't end up out-of-order scheduled).
It is very tempting to have an instruction set designed exactly to a single generation of CPU and its exact pipeline. Transmeta's CPUs are the closest anyone's really come to this - it's a shame the timing of their venture was bad, because in the current low power market, with much better compilers and tools, I think it's a really good idea to test some more.
How does multi-threading work without complex atomic instructions?
There are a few exceptions to the simple instruction / simple pipeline. They tend to stall everything but they're rare and don't complicate the design. For example on ARM:
Before ARMv6, the 'swp' instruction performed a bus-locked swap of a register with a word of memory, returning the existing value. That was all you needed.
ARMv6 and up, there's 'ldrex/strex', which perform a load/store of a word, with the store failing if a non-ldrex instruction or other core accessed the same cache line before the strex. Slightly more efficient, and all you need.
Even on x86 you'll find that the vast majority of multi-threading is done using just 'cmpxchg'. You don't need a huge amount of instruction set to get it done.
Go has garbage collection and lacks pointer arithmetic. So... it won't replace C++, then?
Why was that so easy and quick to say? I really don't understand the repeated banging-head-against-wall that language inventors are doing. There's a good reason why C++ is still in wide and very popular use: precisely because it does have explicit memory management and pointer arithmetic. C++ is a static, explicit language. Go is not. It will not replace C++, and no language will until that is understood.
The problems C++ need fixing are elsewhere. The syntax needs cleaning up. The ABI needs rationalizing between architectures. Multiple inheritance needs some taming (ditch 'virtual' multiple inheritance - it's insane), but not removing. Interface-only classes need promoting to a full type rather than inferred from being 100% pure virtual (and even then there's usually a non-pure-virtual destructor for stupid foot-bullet-avoiding reasons). There needs to be saner syntactic sugar for repeated operations (like python's 'with' keyword). Templates syntax needs to be less verbose and more automatic (already being worked on for C++0x but at this rate will be C++1x, keyword 'auto').
Stop trying to replace C++ with a language that does not fulfill every aspect C++ covers. If you ARE a language inventor and reading my comment, answer this: can you write a cache/MMU interface or an interrupt handler in your language? If the answer is no, go back to the drawing board.
Muscle mass is also a good way to lose weight long term. Short term, it weighs more than fat, so you get the surprising (to naive people) result that exercise can make you put weight ON if nothing else changes (and subconsciously you get more hungry due to the calorie burning).
Long term, muscle mass needs feeding. That's why your body gets rid of it if you don't use it - it's a waste of energy. You put muscle mass on, you burn calories whether you use it or not. Granted, it takes a lot. The best to focus on (so I'm told) is leg muscle, as they're already big and building them up is relatively easy (running/cycling/walking all do it).
But sure - exercise alone and diet alone isn't going to lose you weight. You need to do both.