Slashdot Mirror


The Sacrifices of Portablility?

hackwrench asks: "There is lots of talk about writing portable programs, but this pursuit has resulted in a lot of processor features going unused. One example is being able to write a program that purposely uses a combination of 16-bit and 32 bit. I know there are arguments that writing solely in one or the other is a performance advantage, but what are the factors involved? Is the slowness of such a combination inherent in its design or is it a result of current hardware. We are beginning to replace systems and programs designed primarily to run in pure 32-bit mode with systems designed to run in pure 64-bit mode, so I ask: Is such purity really worth it?"

95 comments

  1. half used registers? by tom8658 · · Score: 0

    As I understand it, 16 bit code can only address 64k of flat address space. I suppose this would be a performance hit.

    Also, wouldn't it waste space in a 32 bit register to hold a 16 bit integer? If so, half of your register space is wasted... I'm not sure about this though. If anyone knows better, please correct me

  2. Compiler Optimizations by jaredmauch · · Score: 2, Informative

    Is the problem that the compiler optimizations are not producing the right outputs? or too much of the code is compiled with debug flags (ie: -g). I would expect the compiler to handle things, but i've found that I rarely have the desire to run the non-debug code as when things do go south it's rare and i'd rather have ease of solving the problem being available to me. There are some cases where I don't do this where performance matters, but that's rare in my experience.. People have done many studies of what compiler optimizes things better, eg: gcc vs intel compiler. gcc vs sun compiler. Generally the one written by the vendor does a slightly better job.

  3. The industry by Device666 · · Score: 2, Interesting

    The hardware / software industry (generally speaking) doesn't care about quality, as long as they are so busy competing with eachother in a high pace. Because companies are competing they will seek some features others haven't and most fo the time the relevancy of those features is very small, especially if the company has become very big (exceptions of course are always there). The crowd isn't very picky either, though it is clear open source has put value to the development of portable software. People buy a amd64 (not only because of the price, ofcourse some geeks also for the 64 bit feature) but the majority runs still 32 bit binary windows software on it. So most people don't care so much I would say. They just buy something because it's cheap, to play games on and do things like patience, free cell and the basic things like word etc. There is so much technology to know about that at some point people don't care anymore. The next day there will always be newer and faster hardware.. Even when you only wanted to make a simple document. But then you have to use Vista (for some) so then you would need a faster computer.. Portabillity is only useful for people who don't want to keep buying software and are fed up with it. A very few of them make their hands dirty and migrate to open/free source software or start to write alternatives themselves. To those people these thing really matter. They want to make something durable and it simply takes time for software to mature. Portabillity of code really matters if there are more open source/ free software users and developers. Then people will experience the benefits of portable code.

    1. Re:The industry by __david__ · · Score: 2, Insightful
      Portabillity is only useful for people who don't want to keep buying software and are fed up with it.
      No, portability is more useful to those writing software that has to run in 2 (or more) environments. Say I want to write a game that runs on the xbox and the ps2. The more portable I make my code, the happier I will be in the long run (and the cheaper the price will be for the port to whichever platform comes second).

      -David
  4. 64 / 32 bit: it depends for what use by Device666 · · Score: 1, Informative

    See http://en.wikipedia.org/wiki/64-bit: "A change from a 32-bit to a 64-bit architecture is a fundamental alteration, as most operating systems must be extensively modified to take advantage of the new architecture. Other software must also be ported to use the new capabilities; older software is usually supported through either a hardware compatibility mode (in which the new processors support an older 32-bit instruction set as well as the new modes), through software emulation, or by the actual implementation of a 32-bit processor core within the 64-bit processor die (as with the Itanium2 processors from Intel). One significant exception to this is the AS/400, whose software runs on a virtual ISA which is implemented in low-level software. This software, called TIMI, is all that has to be rewritten to move the entire OS and all software to a new platform, such as when IBM transitioned their line from 32-bit POWER to 64-bit POWER. While 64-bit architectures indisputably make working with huge data sets in applications such as digital video, scientific computing, and large databases easier, there has been considerable debate as to whether they or their 32-bit compatibility modes will be faster than comparably-priced 32-bit systems for other tasks. Theoretically, some programs could well be faster in 32-bit mode. Instructions for 64-bit computing take up more storage space than the earlier 32-bit ones, so it is possible that some 32-bit programs will fit into the CPU's high-speed cache while equivalent 64-bit programs will not. However, in applications like scientific computing, the data being processed often fits naturally in 64-bit chunks, and will be faster on a 64-bit architecture because the CPU will be designed to process such information directly rather than requiring the program to perform multiple steps. Such assessments are complicated by the fact that in the process of designing the new 64-bit architectures, the instruction set designers have also taken the opportunity to make other changes that address some of the deficiencies in older instruction sets by adding new performance-enhancing facilities (such as the extra registers in the AMD64 design)."

  5. 16 bit is often slower than 32 bit by dtfinch · · Score: 4, Informative

    In 32 bit protected mode, 16 bit instructions require a prefix to tell it that the following instruction is 16 bit, wasting a byte and a CPU cycle. In 16 bit real mode, the same is true of 32 bit instructions. But modern processors aren't optimized to preserve 16 bit performance. If they can improve 32 bit performance just a little, they'd be willing to sacrifice a lot of 16 bit performance to do it. Also, if you're mixing 16 and 32 bit variables in C/C++, it'll do a lot of expensive conversions to make it all work. I've done very little with 64 bit though, aside from playing with MMX on one occasion.

    1. Re:16 bit is often slower than 32 bit by Nyall · · Score: 1

      The original question mentioned mixing 16 and 32 bit, not explicitely variables. When I code c I'll use 32 bit integers for variables without worrying. But for arrays, data structures, and arrays of data structures I'll put some thought into whether I need a field to be 32 or 16 bits.

      you are right that mixing 32 bit and 16 bit variables is a recipe for slowness.

      --
      http://en.wikipedia.org/wiki/Jury_nullification
    2. Re:16 bit is often slower than 32 bit by drxenos · · Score: 1

      First of all, 16 and 32 MODES have nothing to do with 16 and 32 bit variables. Secondly, variables smaller than int are ALWAYS promoted to int or unsigned int automatically in expressions. So, assuming int is 32 bit, 16 bits variables are always promoted to 32, regards if all variable involved are 16 bit or not. Besides, as long as the variables are aligned in memory (which the compiler will take care of), there is no penalty.

      --


      Anonymous Cowards suck.
  6. Think memory usage, not size... by hackwrench · · Score: 1

    You can fit two 16 bit integers in the space of a 32-bit register or any other memory device. Existing 16 bit code shows that you can code useful routines that fit in 64k. Also, it's not like 16-bit code and 32-bit code can't communicate with each other. 32-bit code can have several 16-bit routines within its space.

    1. Re:Think memory usage, not size... by be-fan · · Score: 2, Insightful

      First, using 16-bit components of registers incurs a stall on most modern x86 CPUs. Remember, they are RISC processors underneath, which have no conception of partial GPRs. Second, RAM is dirt cheap, so let's not even consider blowing RAM. Things get interesting when talking about fitting things in cache, but the simple truth is that if your data doesn't fit in cache, the benefit from just halving its size is usually minimal. As soon as your data set grows, you've blown the cache again. You're almost always better-served trying to figure out how to get your code to operate on data in cache-sized chunks, so your performance stays constantly good, instead of being great with one data set and piss-poor with another.

      --
      A deep unwavering belief is a sure sign you're missing something...
    2. Re:Think memory usage, not size... by __david__ · · Score: 1
      You can fit two 16 bit integers in the space of a 32-bit register or any other memory device.
      Yes. Most compilers let you optimize for size, or speed--that is, they are mutually exclusive. What you are suggesting is hand optimizing for size. This isn't necessarily bad, but is pointless for structures that exist in 1 or 2 places (what, you're saving 2 bytes total?). In a huge multi-megabyte array it can make a dramatic size difference. But it can also slow the crap out of your code in certain situations, so you'd have to know your architecture well before making the judgement call. Most architectures (though possibly not x86) require double the instructions to load a non-machine word length number. You first have to load it in, then you have to sign extend or unsigned extend. These are almost always separate instructions.

      That being said, most times you wont care about size or speed and so ints will do just fine. You shouldn't be hand optimizing at all unless you've determined that something is too large or too slow.

      -David
    3. Re:Think memory usage, not size... by Nutria · · Score: 1

      You can fit two 16 bit integers in the space of a 32-bit register or any other memory device.

      In 16-bit "mode", the x86 lets you access the upper and lower halves of the [abcd]x registers as [abcd]l and [abcd]h.

      While the registers were extendex to the 32-bit e[abcd]x registers, and the lower 16 bits are still accessible via [abcd]x, there is not, TTBOMK, any way to access the upper 16 bits.

      The same goes for the 64-bit r[abcd]]x registers.

      --
      "I don't know, therefore Aliens" Wafflebox1
  7. Wth? by ratatask · · Score: 0, Flamebait

    So you decided to post a /. story but you didn't have anything to say ?

    >One example is being able to write a program that purposely uses a combination of 16-bit and 32 bit.
    Idiotic example. What are you talking about ? 16 and 32 bit data types ? If so you don't trust your compiler to optimize ? damn. Oh, you're tlking a bout code then ? a combination of 16/32 bit is amazingly rare. For app practical purposes it means running DOS programs on windows, and speed isn't an issue here. Sorry.

    >We are beginning to replace systems and programs designed primarily to run in pure 32-bit
    >mode with systems designed to run in pure 64-bit mode,
    Converting a 32bit application to 64 but will mean nothing, unless it's a special purpose program that can take advantage of the expanded address space. Consider it close to nil percent of desktop software, but important for those few that uses it.

    However hardware vendors will jump to 64 bit, they will support it, develop it and 64 bit systems will in short be the ones pushing more GHz through marketing ads. And running in 32 bit compatibility will have a (small) performance hit.
    So yes. It's worth it from a performance point for laymans, in the near future, but likely they wont have any use of the gains.
    Which area of "worth it" did you want to discuss ? Performance, reliability, investement or something else.
    And for whom ? Weather centers needing big iron to predict next weeks weather ?

    1. Re:Wth? by hackwrench · · Score: 1

      a combination of 16/32 bit is amazingly rare.

      That's my point. They're rare only because the tools to make code are designed to make them rare.

      Converting a 32bit application to 64 but will mean nothing, unless it's a special purpose program that can take advantage of the expanded address space.

      Accesses to hard drives make 64-bit addressing more useful. It's too early for exploration of 64-bit architecture to have yielded applications that run best in 64-bit mode.

    2. Re:Wth? by Rezonant · · Score: 1

      (assuming we're talking x86) 16-bit code is invariably slower than 32-bit code on modern processors. 32-bit code can deal with 16-bit values just fine and has a much simpler and more efficient memory model. Why would anyone want to run 16-bit code except when running old DOS programs?

    3. Re:Wth? by AvitarX · · Score: 1

      If we are talking about x86 there is a signifigant improvement from 32 to 64 bit.

      It has to do with the fact that the 64 bit instructions clean up a lot of the mess (including removing the 16 bit ones) and add extra pipelines.

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
  8. How Protected Mode works. by hackwrench · · Score: 1

    In 32-bit protected mode, there are 32-bit segments and 16-bit segments. The determination of which is which is stored in a flag in a descriptor stored in a descriptor table. In 32-bit segments, 16-bit instructions require a prefix and in 16-bit segments, 32-bit instructions require a prefix. However, both segments can and do exist side-by-side.

  9. Ideally, your code is clean enough by Frumious+Wombat · · Score: 4, Interesting

    that this transition isn't all that painful.

    My personal experience with this was Linux on Alpha, where certain programs assumed a 32-bit environment, rather than querying the system they were built on for size of int, pointer, etc. As a result many programs were funky on the Alpha, and the 'pc-isms' (what we once would have called Vaxocentrisms) caused great waste of time as they had to be tracked down an eliminated.

    Your code, if you've been worrying about anything other than 32-bit PCs, should already be 64-bit clean, as you've had 15 years of Alpha, SGI, Power, Itanium, and Sun 64-bit systems to support. If it isn't, hopefully it's something such as user interface which will still run in the 32-bit environment, though not necessarily optimally.

    Personally, I think that writing robust, portable, code is worth the effort. Unless you're talking about running on an embedded system where every byte counts, it doesn't hurt you at all to design clean algorithms and data structures, and put in checks to actually determine the size of ints, longs, pointers, etc, rather than just assuming that everyone will run x86 (or MIPS-64 or whatever) from now until the end of time. I have research programs that were written in the 70s (in their original form), on Cyber 205 and similar long-gone architectures, which still work because they were written in a mostly portable manner, with only the most critical nasty bits tied specifically to that machine. Your code is going to be in use longer than you think; be nice to your successors and make it portable now.

    --
    the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
    1. Re:Ideally, your code is clean enough by cerberusss · · Score: 1
      Personally, I think that writing robust, portable, code is worth the effort.

      I hope your manager does, too. What does he say if you're already late with the project, but you tell him you'd like to test it on another architecture?

      --
      8 of 13 people found this answer helpful. Did you?
    2. Re:Ideally, your code is clean enough by Taladar · · Score: 3, Insightful

      If you have to test it on another architecture you are not writing portable code. Portability is less about specific architectures and more about "don't assume anything that might not be true on other architectures" like endianess, sizeof(int),...

    3. Re:Ideally, your code is clean enough by cerberusss · · Score: 1, Insightful
      If you have to test it on another architecture you are not writing portable code

      I understand the fact that you can at least prepare for portability. However, I would always want to run it through an alpha, beta (and maybe acceptance) environment before saying it'll work.

      --
      8 of 13 people found this answer helpful. Did you?
    4. Re:Ideally, your code is clean enough by Anonymous Coward · · Score: 0

      > If you have to test it on another architecture you are not writing portable code.

      Right. Taking it further, if you have to test (at all), you're not writing code (at all).
      Uh huh.

  10. Detailed Reponse to Cliff and HackWrench by woolio · · Score: 2, Interesting

    Does "hackwrench" even know how to program? Does he know anything about Computer Architecture? "Hennesy" or "Patterson" ring a bell? Sounds like "Cliff" likes to feed trolls. Maybe "hackwrench" will choke while digesting this one:

    What is the inherent "slowness" of "16 bit code" WTF is "16 bit code" anyway? Sounds like has been duped by the marketing droids...

    So-called "32-bit" processors are typically designed to perform (up to) 32-bit arithmetic efficiently. For integer operations, 8bit, 16bit and 32bit arithemetic usually each take the same amount of time (8bit add = 16bit add != 16 bit multiply) .

    Because "32-bit" processors can do "32-bit" arithemtic efficiently, it makes sense for them to use (up to) 32 bits for addressing. Arithmetic involving addresses comes up more often than you would think... (Branch/Jump instructions, memory operations, and even the basic updating of the program counter). Since these processors data paths are (typically) 32-bits wide, instructions are typically coded using up to 32-bits. (In a 32-bit RISC processor, most of the instruction bits are reserved to allow large immediate operands for memory offsets, jump targets, and arithmetic/logic operations).

    The only thing a "32 bit" processor typically isn't good for is "64 bit" arithemtic. (And any arithmetic over 32 bits for that matter). Which means on these, a "64 bit" addition could be performed using 3 "32-bit" additions and a branch. "64-bit" multiplications get even worse...

    But if a program doesn't access much memory ( packed arithmetic whereby it can treat a 32-bit integers as a pair of 16 bit integers and a single operation can calculate both results... But this by itself is hardly justification alone for using such a processor.

    So guess what folks: There will likely never be a "1024bit" processor. (At least not for general purpose computing). I'm not trying to sound like Bill Gates with his "640k is enough" quote, but I don't see why processors will ever use much more than 64 or 128bit addressing. (Keep in mind that EACH BIT *doubles* the range of integer numbers/addresses the procesor can handle efficiently).

    Yes we now can have 2^32 bytes of memory in computers (4GB). But WTF is anyone going to do with 2^64 bytes of ram? Thats probably many orders of magnitude greater than the total capacity of all electronic devices ever produced from the 1950s until now...

    In conclusion, WTF? Mod Editor Down!

    1. Re:Detailed Reponse to Cliff and HackWrench by Device666 · · Score: 1

      You right. But it also depends on what kind of bits for what kind of processor and how the innovation in software use (that affects the architecture of that processor) will be. I think before they will sell (never) 1024 bits "processors", there might be some qubits and quantum processors and a pletoria of different architectures in processors as well (if we haven't become backwards cavemen due to the centuries of patent-wars)..

    2. Re:Detailed Reponse to Cliff and HackWrench by Lorkki · · Score: 1

      Seeing as you summon Computer Architecture [sic] onto the field, I'd like to take this chance to remind you about the existence of MMUs and memory mapping. It's not all core memory you see in that address space. Even if that were not the case, 1 GiB of core memory is no longer a rarity, and 2 GiB is getting there as well. It's not difficult to guess the direction from there on.

      As for "anyone", there's this bunch of meteorologists, biologists and astrophysicists I'd like you to
      meet...

    3. Re:Detailed Reponse to Cliff and HackWrench by a_ghostwheel · · Score: 1

      NNN-bit does not necessarily defines size of addressable memory - it might as well be just size of the internal processor bus. For pretty much every modern problem that relis on heavy number crunching (like, e.g., anything based on FEM) you will benefit a lot from, say 128-bit (or 256-bit) native floating point operations even if address space still would be 64-bit. Another use would be large number vector registers (to support which you would need wide internal bus) - so I could easily imagine useful "1024bit" processor which allows you to perform single cycle operations on multiple 1024-bit registers each holding, e.g., 1x1024/2x512/4x256/8x128/16x64 -bit values in 64-bit address space.

    4. Re:Detailed Reponse to Cliff and HackWrench by Nyall · · Score: 1

      The m68k is a good counter example. 8 bit and 16 bit adds take 4 clocks, but 32 bit math takes 8 clocks. But if you aren't working with embedded systems you don't need to worry about things like this.

      Also Most processors have a carry/extend flag (there are exceptions) so a 64 bit add with a 32 bit registers can be done with 2 adds.

      >>In a 32-bit RISC processor, most of the instruction bits are reserved to allow large immediate operands for memory offsets, jump targets, and arithmetic/logic operations

      In the procs I've studied only half the instruction is used for imediate values, and thats only for instructions that need them. Instructions that don't have an imediate value put these bits to use encoding other things.

      --
      http://en.wikipedia.org/wiki/Jury_nullification
    5. Re:Detailed Reponse to Cliff and HackWrench by Anonymous+Brave+Guy · · Score: 3, Insightful
      Yes we now can have 2^32 bytes of memory in computers (4GB). But WTF is anyone going to do with 2^64 bytes of ram?

      I don't know. Then again, ten years ago, if you'd told me that an e-mail client or web browser would require tens of megabytes of memory just to load, or it would require over 100MB just to store the quick start-up code for an office application, I'd have laughed. Right now, that's exactly what Firefox, Thunderbird and OpenOffice 2.0 are claiming on the PC where I'm writing this.

      Actually, I'm still laughing, because that says more than words about the design of those applications and the tools used to compile them. But the applications have expanded to fill the space nevertheless.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    6. Re:Detailed Reponse to Cliff and HackWrench by joto · · Score: 1
      For pretty much every modern problem that relis on heavy number crunching (like, e.g., anything based on FEM) you will benefit a lot from, say 128-bit (or 256-bit) native floating point operations even if address space still would be 64-bit.

      Agreed. There's still room for improvement when it comes to floating point formats. While 2^63-1 is a ridiculously large number for integer calculations, with floating point, you will still see the benefit going from 64-bit to 256-bit. And then, there's also funkier things, like SIMD, and more futuristic improvements, like interval arithmetic in hardware.

      Another use would be large number vector registers (to support which you would need wide internal bus) - so I could easily imagine useful "1024bit" processor which allows you to perform single cycle operations on multiple 1024-bit registers each holding, e.g., 1x1024/2x512/4x256/8x128/16x64 -bit values in 64-bit address space.

      I very much doubt this is the way vector operations in the future will go. While it's the way they work on intel architectures now, it's not a particularly good way. MMX/3dnow/SSE is notorously difficult to actually make use of in real applications. Hopefully future improvements would be more programmer/compiler-friendly. The trick is to come up with an instruction set that the compiler can easily use, and that at the very least doesn't decrease performance when used in an obvious way.

    7. Re:Detailed Reponse to Cliff and HackWrench by woolio · · Score: 1

      Yes. I too can image that in the days of 16-bit 'desktop' processors, the idea of 4GB of ram must have been absurd... But its really not when it comes to manipulating hi-resolution photos, multi-user databases, etc...

      However,think about how large 2^64 is... Isn't like the same order as the number of atoms in the universe or something like that??

      Let's guess that in the near future, there will be 10 billion people in the world (~10^10). Let's say we wanted a single computer that could store something about every person in the world. 2^64/10^10 = 1.7 GB of data per person!!! I think it is safe to say that for most (home/corporate) purposes, 2^64 bytes of RAM will always be extremely absurd. Although TerraByte storage today is becoming more common (and widely used), 2^64bytes = 1.67*10^7 TB!!!! If every person in New York City had 1TB of data, 2^64 byte storage would still have plenty left over... It is scary to imagine reasonable uses for such storage.

      I can only think of one entity in the US that might ever be interested in 16 Million TB of storage... And it probably aint pretty...

    8. Re:Detailed Reponse to Cliff and HackWrench by Anonymous Coward · · Score: 0

      You also have 2^64 bytes of virtual memory.

      This way realloc()s and mmap()s always work even
      when you are using nearly 100% of machine's memory.

    9. Re:Detailed Reponse to Cliff and HackWrench by Anonymous Coward · · Score: 0

      Well, going from 1MB to 4GB got us the ability to easily manipulate sound and high resolution images easily (without messing with paging files in a bit at a time, ala. code overlays from back when we were dealing with a few tens of kilobytes.) Seems like going beyond 4gb may similarly simplify dealing with video. (Keep in mind, even if you don't have more than 4gb of memory, a 64 bit address space means you can just memory map an entire drive and not worry about it.)

    10. Re:Detailed Reponse to Cliff and HackWrench by Taladar · · Score: 1
      However,think about how large 2^64 is... Isn't like the same order as the number of atoms in the universe or something like that??
      No, that number is closer to 2^256 (and even then you are missing a few bits I believe).
  11. It depends by sfcat · · Score: 4, Insightful
    There are many factors that go into deciding how to write code. Portability is just one consideration of many. I would say that it is worth it if speed is of critical importantance and development expenses are of no concequence.

    For instance, consider a video game. The faster it is the more likely it is that players will like it. But there are many more important factors including is the game just plain fun. So in video games, there is really a basic threshold of speed that needs to be met and after that is met, other factors are more important.

    Next consider a real time system for trading stocks. This system is all about speed and reliability. You can control the deployment hardware and it is economically worthwhile to spent a lot in development if it makes more money in the long run. So coding your own memory pooler that uses the size of the pointer and a specific struct to make the code allocate and deallocate memory in constant time (it is very possible) is worthwhile because it can save alot of time per transaction.

    But all of these issues come down to what exactly you are writing and both the technical and business requirements of your project. Without knowning those in advance, we can't really answer your question.

    --
    "Those that start by burning books, will end by burning men."
    1. Re:It depends by forkazoo · · Score: 4, Interesting

      A lot of people are writing responses that tend to assume it is impossible to write code that is portable, and also optimised for a specific platform. I recently read a book called "Vector Game Math Processors" (everybody needs a hobby, right?). Looking at how the examples were coded in that book sort of shifted my assumptions about how I should do things.

      Basically, the book covers the major vector instruction sets: Altivec, PS2, SSE, etc. Naturally, a program written with hand optimised SSE assembly won't run very well on a PowerMac G4. So, the approach the author used was to start by coding a vector math function in plain C. He only calls this function by a function pointer. So, instead of calling sw_vector_foo directly, he calls vector_foo. He then goes on to write altivec_foo, and sse_foo, and gamecube_foo. With some simple #ifdefs at compile time, the function pointer is assigned to the most optimal code path for the platform.

      So, the result is that by thinking about portability going in, he doesn't have to do hardly any work to have fairly optimal hand-tuned vector routines for a new architecture.

      In general, code written to be portable is also much cleaner, and better commented, and whatnot, just because the author was forced tos pend an extra few minutes thinking about how things ought to be put together. I really can't think of any normal case where portability shouldn't be a consideration. On some obscure embedded systems, you might really want to optimise to a super specific piece of hardware, but it is seldom worth it.

      Think about writing GUI apps for a Palm pilot before the switch to ARM CPU's. A programmer could have said, "hey, I'm using the Palm OS API's, and they only run on Coldfire CPU's, so I have no reason to make anything portable." Then, a little while later, Palms OS starts running on ARM. If he had invested a smidgen of extra effort to write his code in a portable way, he could easily start to take advantage of the ARM stuff right away. Since most of the issues of portability are in the planning phase, and get handled at compile time, the difference in memory footprint need not be appreciably larger. (Like a bunch of hand coded ASM for a different platform, which get's #ifdef'd away, or sizeof() operators...)

    2. Re:It depends by Anonymous Coward · · Score: 0

      Looking at how the examples were coded in that book sort of shifted my assumptions about how I should do things.
      So, the approach the author used was to start by coding a vector math function in plain C. He only calls this function by a function pointer. So, instead of calling sw_vector_foo directly, he calls vector_foo. He then goes on to write altivec_foo, and sse_foo, and gamecube_foo. With some simple #ifdefs at compile time, the function pointer is assigned to the most optimal code path for the platform.


      Wow! You discovered polymorphism!

    3. Re:It depends by Anonymous Coward · · Score: 0
      Wow! You discovered polymorphism!
      Give the guy a break. He is probably a number cruncher; good with diff eq:s and Fortran but doesn't get the simple stuff.
    4. Re:It depends by forkazoo · · Score: 1

      Yeah, I was aware of function pointers before reading the book. I'd just never used them in quite the way the book does it. The book also gets into all sorts of other stuff, dealing with alignment on different platforms, things like that which I didn't bother to get into.

      Everything that's don e in the book is perfectly understandable to somebody who knows C, but it's not something I usually see done in that way, all together. I've written plenty of software that works on MacOS/Linux/Windows/PPC/x86/SPARC/IRIX... I'd just never done it as well as the ways presented in the book. If you happen to notice it in a book store, it's quite interesting tof lip through, IMHO.

    5. Re:It depends by ultranova · · Score: 1

      Basically, the book covers the major vector instruction sets: Altivec, PS2, SSE, etc. Naturally, a program written with hand optimised SSE assembly won't run very well on a PowerMac G4. So, the approach the author used was to start by coding a vector math function in plain C. He only calls this function by a function pointer. So, instead of calling sw_vector_foo directly, he calls vector_foo. He then goes on to write altivec_foo, and sse_foo, and gamecube_foo. With some simple #ifdefs at compile time, the function pointer is assigned to the most optimal code path for the platform.

      This doesn't make sense. If you are going to use #ifdefs, why not simply put the code for different versions of vector_foo inside them ? Why play around with function pointers, when they are completely unneccessary for this ?

      Or is he detecting the presence of SSE/MMX/3DNow!/Whatever at runtime ? Then a function pointer would make sense...

      --

      Forget magic. Any technology distinguishable from divine power is insufficiently advanced.

  12. Does it matter? by Jah-Wren+Ryel · · Score: 4, Insightful

    It used to be that computers were expensive and people were relatively cheap. Nowadays, the reverse is generally true.

    So, unless these systems have performance critical portions, like high-speed digital signal processing where every FLOP counts, it really isn't worth the extra effort to optimize your code for the platform - you'll just end up having to hand-tweak (or even worse, un-tweak) it again on the next hardware upgrade.

    --
    When information is power, privacy is freedom.
    1. Re:Does it matter? by richg74 · · Score: 5, Insightful
      It used to be that computers were expensive and people were relatively cheap. Nowadays, the reverse is generall

      For most applications, the potential performance gains from hand optimization for a specific platform aren't enough to matter. (And, as I think Brian Kernighan said, trying to outsmart the compiler defeats the purpose of using one.) Big performance gains come, in most cases, from figuring out a better way (~algorithm) to solve the problem, not from tweaks.

      There's another aspect of portability that doesn't get mentioned too much: the portability of the programmer. If you are in the habit of writing portable code, it's much easier to shift to working on a different platform. (I'd also say, from my own experience, that it makes your work less error-prone.) That versatility is potentially of significant value to your employer, and of course is of value to you personally.

  13. This is the compiler's job. by stienman · · Score: 3, Insightful
    There is lots of talk about writing portable programs, but this pursuit has resulted in a lot of processor features going unused.

    This is the compiler's job. If your compiler targets a particular processor poorly, get a better compiler.

    There is no such thing as portable code:
    • There is code that is written according to the language specification (Ansi C, Java, etc), which is what one normally considers "portable" only because standards compliant compilers exist for several platforms.
    • There is code that uses processor/platform/OS/compiler specific extensions, which is normally considered unportable because libraries don't exist for all platforms.

    When most developers talk about portability they are talking about OS portability. The portable-to-other-processors debate has long since left the building largely due to incredible speed increases in processors. There's no reason, apart from esoteric algorithm tweaking, to code something in a processor specific manner.

    Code porting to another OS is only an issue because operating systems and the hardware they run on are still changing at a dramatic pace. There is no standardized language that covers all the common aspects of a modern operating system, because they are aiming at a moving target. Even the ultra-portable Java has to be extended outside of the official specification to cover serial ports, complex sound, complex graphics, etc.

    Portability hasn't been about processor speed for a very long time, and at this point it shouldn't be - a better compiler or a faster processor is a *ton* cheaper (time, money) than writing processor specific code in all but a few extraordinary cases.

    -Adam
    1. Re:This is the compiler's job. by be-fan · · Score: 1

      It's also interesting to point out that more recent processors are designed to not have any particular features that could be supported. Even x86 processors generally only fully-support a RISC-y subset of the ISA, and microcode the weird, complex instructions.

      --
      A deep unwavering belief is a sure sign you're missing something...
    2. Re:This is the compiler's job. by __david__ · · Score: 1
      This is the compiler's job. If your compiler targets a particular processor poorly, get a better compiler.
      Hear hear. You are 100% correct here.
      There's no reason, apart from esoteric algorithm tweaking, to code something in a processor specific manner.
      Agreed, but it sometimes takes some thought to even realize you are coding in a processor specific manner. For instance, if you've ever programmed for a Mac or written networking code on a PC you realize that all binary data formats were created with a certain endian in mind, and if your processor's endianness is different you have to swap some bytes. No, it's not hard to do, but if you're implementing, say, a FAT filesystem on a little endian machine you might not even notice. And then, boom, your lovely code turns out to be not very portable after all.

      -David
  14. I'm looking at it for experiment and isolation, by hackwrench · · Score: 1

    A 16-bit memory access instruction can only access 16-bits of memory, period. It can't trash more than that. That's a rather trivial benefit, but it exists and if it exists there might still be others which would require experimentation. Here's a better one: The instruction is smaller so you can fit more instructions in RAM which means less flushes to disk. Attacking problems from a "every byte counts" perspective can help you decide what you want to do when every byte doesn't count. Besides, all things being equal, why not go for the smaller code size?

    I used to code for QuickBasic. It didn't have routine pointers and a friend wrote a routine that checked the return address on the stack, scanned for the next CALL assembly instruction, put the pointer for the routine into DX:AX, popped the return address and jumped to the instruction after the call to the next routine. You could declare two names for a routine, one with no parameters and one with, and set the pretend call to the parameters name after the address finding routines. It seems that the tools today are setup to make such poking around impossible.

    Oh, and there's also this code: SuperPUT replaces the innards of QB's PUT

    1. Re:I'm looking at it for experiment and isolation, by renoX · · Score: 1

      > A 16-bit memory access instruction can only access 16-bits of memory, period. It can't trash more than that.

      I think that in RISCs, memory access is word aligned, so if you do a load 16, what the HW will do is fetch a 32bit word and then putting 16bit in your register.
      I'm not sure how writes are handled though.

    2. Re:I'm looking at it for experiment and isolation, by __david__ · · Score: 1

      This depends more on the bus than the CPU architecture. I know some 32 bit busses have lines to say whether the transfer is 8, 16, or 32 bits. If you read a byte of data on the bus--lets call it 0x0d, then 0x0d0d0d0d is actually sent across the bus. Other busses will work like you said and transfer 32 bits of data from a rounded address and then let the CPU pick out the correct byte. Some will just transfer 0xXXXXXX0d where XX is something random.

      -David

    3. Re:I'm looking at it for experiment and isolation, by be-fan · · Score: 0

      Here's a better one: The instruction is smaller so you can fit more instructions in RAM which means less flushes to disk.

      If your instructions don't find in RAM completely, then you're screwed. Buy more RAM.

      Attacking problems from a "every byte counts" perspective can help you decide what you want to do when every byte doesn't count.

      I don't see how.

      Besides, all things being equal, why not go for the smaller code size?

      Because, all things are generally not equal. Worrying about this stuff makes sense if you're code is already feature-complete, bug-free, and uses the absolute state-of-the-art in algorithms, but who has such code that they can worry about these things?

      --
      A deep unwavering belief is a sure sign you're missing something...
  15. What Makes An Operating System "Portable"? by hubertf · · Score: 1
    Seeing you talk about mixing 16(?!) and 32bit code, you're probably on a completely different problem set, but maybe this article helps a bit understanding some other problems involved in portability:

    ``As an introduction the properties of a "hardware platform" are described, and it's showen that getting the same behaviour of software on different hardware platforms isn't "portability". After repeating the tasks of an operating system, it is explained what an operating system needs to provide in the lower layers to be portable. The article ends with a case study of the NetBSD operating system.''

    Full article here.

    1. Re:What Makes An Operating System "Portable"? by Anonymous Coward · · Score: 0

      Yeah, I'll trust a source that thinks "showen" is a word?

  16. Answers to your question. by hackwrench · · Score: 1

    16-bit code is code written with 16-bit addressing. 16-bit code is slow on processors designed to perform reads on 32-bit or 64-bit alignment boundaries. 32-bit code has 32-bit addressing. The Intel processors that do 32-bit addressing are designed to read memory 32-bits at a time on 32-bit alignments. For some reason, they can't read 32-bits from the second, third or fourth byte positions. I haven't progressed my understanding beyond this, but there are probably other mechanisms in play. 16-bit addresses means smaller code. Smaller code means less flushes to disk, more calcs per read, and less calcs per instruction.

    1. Re:Answers to your question. by woolio · · Score: 1
      For some reason, they can't read 32-bits from the second, third or fourth byte positions.
      Ah, NO. The Intel x86 ISA allows non-aligned memory accesses... (It is probably one of the few commonly used ISAs that do this).
      16-bit addresses means smaller code. Smaller code means less flushes to disk, more calcs per read, and less calcs per instruction.
      That may be, but are you still refering to the Intel x86 ISA???? It uses variable-length instructions. These are a nightmare decode (for hardware) but are fairly efficient in terms of memory storage... Many instructions are only 1 or 2 bytes long... In terms of designing an x86 processor, this aspect makes the "fetch" and "decode" stages extremely complex (as compared to "trival" for many RISC ISAs)
    2. Re:Answers to your question. by be-fan · · Score: 1

      Intel processors can perform unaligned memory accesses. They just incur an enormous performance hit in doing so.

      --
      A deep unwavering belief is a sure sign you're missing something...
    3. Re:Answers to your question. by Anonymous+Brave+Guy · · Score: 1
      The Intel x86 ISA allows non-aligned memory accesses...

      Yes, it does, but they're significantly slower than optimally aligned accesses. Why do you think good C and C++ compilers on Intel boxes still add padding to structures, even where it's not strictly required to access the members concerned?

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  17. That's a close issue to my point by hackwrench · · Score: 1

    Current ideas that fall into the current portability mindset has more to do with making the program know as little as possible about its environment. The result is a compiler munging your code and data structures into what it is perceived the processor is happiest with while getting the same apparent behavior across machines instead of switching the processor into different modes to deal with code that is more efficient one way or another.

    1. Re:That's a close issue to my point by hubertf · · Score: 1

      Well, so let your compiler pick the right data types, recompile, and be happy on your new platform, and achieve eternal happyness.

      Trying to remain binary compatible between all those platforms is just too much of a PITA. :)

  18. I don't really want my compiler to be very smart. by hackwrench · · Score: 1

    I want to be able to tell the compiler:

    preserveargs funct1(arg1, arg2,arg3)
    preserveargs funct2(arg1, arg2,arg3)
    preserveargs funct3(arg1, arg2,arg3)
    flushargs funct4(arg1, arg2,arg3)
    and be able to call any combination of funct1,2,3 in any order and finalize with 4 instead of depending on whether or not the compiler will figure out that doing this will result in faster code.
    It doesn't hurt for the compiler to pass speculations up to me, or even to generate potentially more efficient sample source code, but I want to have the final decision on the result of my code, and to have optimizations reflected in the code. That way, no matter what compiler I use, I can be sure to get the same optimizations even if one compiler guesses better than another. This also enables me to pass the code through different compilers and adopt the best optimization results from both into my code. Got a new platform? Run it through a compiler for that platform and have it explain to you why optimizations that were better on another platform are now not so good on the new platform. This helps you be knowledgeable about the different systems you work on which can be used to write better code.

  19. The problem I have with portability: by hackwrench · · Score: 1

    The way it is implemented currently, it makes it so that code in no way reflects the computing archetecture. It's like having the abstraction of functional languages without the benefits of functional languages. One portability implementaion can result in code that is equally suitable for callee popped arguements and caller popped arguements, but if the algorithm favors leaving parameters on the stack for several procedures to access, well sorry, that functionality is not generic enough, so you can't specify that in your solution.

    Also, this code is currently impossible:
    routine32bit{
    do 32-bit stuff
    16bit segment border start
    32-bit land call to 16 bit code
    Jump overroutine16code1

    routine16code1{
    16-bit routines
    16-bit return
    }
    overroutine16code1:
    More32bitstuff
    re t32

  20. The performance question by be-fan · · Score: 4, Insightful

    A couple of points about optimization.

    1) Premature optimization is evil. Everybody says this, but so many people do not take it to heart. I'd rather have software that works, than software that is fast but crashes. As a programmer, its nice to work on non-buggy software, even if its not as fast as it could be.

    2) Target-specific optimization is generally evil, unless you're sure your code will not live very long (eg: a game). The thing is that micro-optimizations generally tune for a particular processor, and actually pessimizes the code in the long run. In comparison, if you write good general code, it'll still be fast ten years from now when processors look very different.

    3) The bottlenecks that people, especially C/C++ programmers worry about, are usually not the bottlenecks that usually matter. If you worry that your code could be faster/more memory efficient if you use a 16-bit field here or there instead of a 32-bit one, your algorithms better be absolutely perfect. Most code does not use perfect algorithms. That's why so much software is still so slow. Most programmers just don't get the time to use the best algorithms, much less get down to the level of micro-optimizations.

    That's why I always find language performance debates entertaining. C/C++ programmers will freak out if you tell them language X is very productive, but is maybe two-thirds as fast as C (something that is true of a number of high-level, but compiled, languages). Meanwhile, they will write code that runs at maybe 1/3 of what the machine is capable of, because they spend so much time writing the code they have little time to optimize it.

    --
    A deep unwavering belief is a sure sign you're missing something...
    1. Re:The performance question by Anonymous Coward · · Score: 0

      That's why I always find language performance debates entertaining. C/C++ programmers will freak out if you tell them language X is very productive, but is maybe two-thirds as fast as C (something that is true of a number of high-level, but compiled, languages). Meanwhile, they will write code that runs at maybe 1/3 of what the machine is capable of, because they spend so much time writing the code they have little time to optimize it.

      As it has been written...
      http://www.catb.org/~esr/writings/unix-koans/ten-t housand.html
  21. Re:I don't really want my compiler to be very smar by be-fan · · Score: 2, Insightful

    Generally, the time you spent adding useless annotations to your source code would be better-spent with a pencil and paper trying to figure out a way to improve your algorithm. Compilers, generally, are good enough these days. Especially now that GCC is decent and runs on most of the interesting processors. The gains in performance, and this is is something that even the Linux kernel guys have realized, are going to come from good algorithms. This is especially true because of the recent multi-core phenomenon. More and more, "good code" is going to be code that implements good scalable algorithms. Lower complexity beats smaller constant factors any day of the week.

    --
    A deep unwavering belief is a sure sign you're missing something...
  22. Ever heard of playing just to see what will happen by hackwrench · · Score: 1

    You shouldn't be hand optimizing at all unless you've determined that something is too large or too slow.

    That's not true at all. There is nothing inherently wrong with hand-optimizing just because you feel like it.

    You also say that size and speed are mutually exclusive. While that is generally the case on current x86 architectures, that doesn't always have to be the case. I don't know what causes the penalty for unaligned reads, but Intel could redo its architecture to grab 32 or 64 bits at a time from any base byte, but the current tools that blithly accept the current limitation and don't let coders explore how their code might be different if such a barrier was removed doesn't give Intel an incentive to do so, and that's one of my points.

    ...so you'd have to know your architecture well...

    That's another one of my points. The current focus of portability results in programmers not knowing their hardware well. There's plenty of room for compilers to explain to the coder why the compiler thinks that a given optimization is best suited for the machine, but the current focus has the coder blindly accept whatever the compiler thinks is best.

  23. Both 1 and 2 registers at the same time by hackwrench · · Score: 1

    I don't see any reason why the CPU can't see the register as both 1 32-bit register and 2 16-bit registers. After all, MMX reused the floating point registers.

    The problem with writing portable code as things now stand is that it is oblivious to fitting things into cache, as it must remain cache-size independent. Since current tools are built with that sort of attitude about portable code, the designers refuse to implement features to allow the coder to code to cache sizes.

    1. Re:Both 1 and 2 registers at the same time by be-fan · · Score: 1

      I don't see any reason why the CPU can't see the register as both 1 32-bit register and 2 16-bit registers. After all, MMX reused the floating point registers.

      There are a number of problems with partial registers. At the CPU level, it comes when trying to figure out instruction dependencies. Supporting half registers makes things a lot more complicated when you see that instruction 1 writes to EAX, while instruction 2 reads from AX. Second, it makes register allocation a lot more complicated.

      The problem with writing portable code as things now stand is that it is oblivious to fitting things into cache, as it must remain cache-size independent.

      Being cache-size independent doesn't mean being oblivious about fitting things into cache. The Right Thing (TM) to do is to make your memory access patterns predictable for the cache. That means that no matter how big your data set gets, or how small a cache you run on, you won't suffer catastrophic performances decreases.

      Since current tools are built with that sort of attitude about portable code, the designers refuse to implement features to allow the coder to code to cache sizes.

      You don't want to code to the cache size. That's not what caches are for. All you'll end up doing is screwing your performance when your data gets twice as large, or you want to run on a CPU with less cache. Again, what you want to do is design your algorithms to perform cache-friendly memory accesses. Treat the cache as a cache, not a local memory.

      --
      A deep unwavering belief is a sure sign you're missing something...
  24. Source code annotations, pencil and paper... by hackwrench · · Score: 1

    Done properly they can achieve the same goals. Think of the annotations as ways to improve your algoritm, and you might begin to see what I'm getting at here.

    1. Re:Source code annotations, pencil and paper... by be-fan · · Score: 1

      Annotations are not ways to improve your algorithm. I'm talking about improving your algorithm at the theoretical level, to say run with O(log(N)) complexity instead of O(N), or scale as O(N) with number of processors instead of O(sqrt(N)). The annotations you suggested are nothing more than mucking with the compiler's business.

      --
      A deep unwavering belief is a sure sign you're missing something...
  25. But a compiler is only as good as its language by Anonymous+Brave+Guy · · Score: 1
    This is the compiler's job. If your compiler targets a particular processor poorly, get a better compiler.

    That's true, of course, but the compiler can only be as good as the language it's compiling permits.

    In higher level languages, you can express design intent more completely than you can in lower level languages. C isn't a high level language, it's a portable assembly language. That's a role it plays very well, but as long as programmers are writing in C, the compiler will have to deal with aliasing, for example. In a higher level language, perhaps the compiler could deduce exactly where a specific piece of data would be accessed, know that there won't be any aliasing, and optimise accordingly.

    That's a rather specific optimisation, but more generally, think of compiling for architectures that have multiple processors, dual-core chips, hyper-threading, or some other form of true parallel execution. If your program is written in C, or C++, or Java, it's going to be hard to take advantage of that extra processing ability without touching the code to give the compiler a hint. On the other hand, in many declarative programming languages, it's entirely possible that the compiler could analyse the data flow, find independent paths, and assign them to separate threads on separate (pseudo-)processors algorithmically, without any further help from the programmer.

    So, while it may be possible to write somewhat portable code in any language above assembly, the degree of portability you can have, and the degree to which a compiler can help you with it, will always be limited by the expressive power of the programming language itself.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
  26. Sure, but premature pessimization is evil, too by Anonymous+Brave+Guy · · Score: 1
    Premature optimization is evil. Everybody says this, but so many people do not take it to heart.

    Not quite everyone says that. While I agree with the general principle, premature pessimization is the root of naff code, particularly when insufficient allowance is made for fixing it up once the code is working correctly but slowly.

    Consider, for example, passing a large bit of data as a parameter to a function. In languages that use pass-by-reference semantics, this will typically be cheap. In languages that use pass-by-value semantics, this will typically be expensive. In C++, you have a choice, but the natural (that is, default) is by value. Would you tell a C++ programmer not to use const-reference parameter types from the start, because it's a premature optimization?

    In some types of software, you simply have to plan for performance from the start. Obviously algorithmic improvements make more difference than anything else, but even so, there's a scale between large-scale algorithm and data structure changes and assembly-level micro-optimisation, not a switch. If you write all your code to be beautifully maintainable, yet fail to consider the continuous nature of this scale from the start, you will never catch up with those who did, no matter how much time you invest in micro-optimisations at the end of the project.

    --
    If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    1. Re:Sure, but premature pessimization is evil, too by be-fan · · Score: 2, Insightful

      Consider, for example, passing a large bit of data as a parameter to a function. In languages that use pass-by-reference semantics, this will typically be cheap. In languages that use pass-by-value semantics, this will typically be expensive. In C++, you have a choice, but the natural (that is, default) is by value. Would you tell a C++ programmer not to use const-reference parameter types from the start, because it's a premature optimization?

      I would tell a C++ programmer that worrying about a bit of extra data copy in the function call is generally useless. It's really not that expensive unless your structs are monstrously large. Generally the question you're interested in is semantics. Do the semantics lend themselves to a pass-by-value, or a pass-by-reference? If, after profiling, you find that this is a problem, use the passing style on those few functions that the profiler points out. Doing it for everything else is useless.

      In some types of software, you simply have to plan for performance from the start.

      Yes, but planning for performance from the start doesn't mean optimizing from the start. It means designing good algorithms and implementing them without any grossly stupid performance mistakes. Optimization can happen after implementation, where profiling shows the need for more hand-tuning.

      Obviously algorithmic improvements make more difference than anything else, but even so, there's a scale between large-scale algorithm and data structure changes and assembly-level micro-optimisation, not a switch.

      It's a scale, but one very biased towards high-level optimizations. Compilers do an excellent job of the low-level stuff. Even at the data structure level, you get a lot more benefit from considering things like ordering your access patterns for cache-friendliness than you do from saving a byte or two here or there.

      --
      A deep unwavering belief is a sure sign you're missing something...
    2. Re:Sure, but premature pessimization is evil, too by Anonymous+Brave+Guy · · Score: 1
      I would tell a C++ programmer that worrying about a bit of extra data copy in the function call is generally useless. It's really not that expensive unless your structs are monstrously large. [...] If, after profiling, you find that this is a problem, use the passing style on those few functions that the profiler points out. Doing it for everything else is useless.

      The thing is, that's not true. In isolation, it might be, but the overhead of passing a data structure that is a few words of memory by value becomes significant if your function is called from inside a loop, and potentially horrible if you do it routinely throughout your program.

      The kicker is, profiling won't help you here. Lazily using pass-by-value where pass-by-const-reference would suffice will not cause a huge hit in one function, it will cause an x% hit in all the affected functions, and a profiler won't help you with that. The approach you advocate systematically reduces performance across the whole application, for absolutely no benefit in maintainability or readability, just a saving of a few keystrokes and some basic knowledge of how C++'s object model works.

      In other words, this is exactly why the "premature optimization is evil" argument is fundamentally flawed.

      Yes, but planning for performance from the start doesn't mean optimizing from the start. It means designing good algorithms and implementing them without any grossly stupid performance mistakes. Optimization can happen after implementation, where profiling shows the need for more hand-tuning.

      But the sort of issue I mentioned above isn't a fatal mistake in one place. It's more like death by a thousand cuts. And again, no post-processing with a profiler and hand-tuning will fix a system that is inherently slow because of such lazy coding practices.

      It's a scale, but one very biased towards high-level optimizations. Compilers do an excellent job of the low-level stuff.

      No, they really don't, at least not in many common languages such as C++ or Java. Just take a look at the assembly output from even quite simple functions: while the opcodes are generally well-ordered, a whole heap of slightly higher-level stuff gets missed, often because these languages aren't expressive enough for the compiler to appreciate what is possible. Highly portable compilers, such as GCC, are particularly vulnerable to this; the code from something specialised like Intel's compiler is usually far better at low-level optimisations.

      Sometimes, there are problems with compilers being a little too clever with their low level assumptions, too, as anyone who works with serious floating point maths can no doubt testify. Compilers, and indeed the programming languages we ask them to compile for us, aren't yet sufficiently clever to do this all by themselves. Once again, I stand by my claim that assuming that they are, and that a final profiling and hand-tuning phase is sufficient, will leave you well behind the leading edge in performance-sensitive applications.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    3. Re:Sure, but premature pessimization is evil, too by be-fan · · Score: 1

      The thing is, that's not true. In isolation, it might be, but the overhead of passing a data structure that is a few words of memory by value becomes significant if your function is called from inside a loop, and potentially horrible if you do it routinely throughout your program.

      If its in a critical area, then the profiler will point it out. If its not in a critical area, then it doesn't matter. Plus, do you have any idea what the overhead really is? It's tiny. I just tried a benchmark calling a very simple function (consisting of two floating-point additions) in a loop. One of the parameters of the additions was passed via a struct that took up 32-bytes (8 words). The difference between passing it by value and passing it by reference was less than 10%. On a more complicated function, it would be noise.

      The kicker is, profiling won't help you here. Lazily using pass-by-value where pass-by-const-reference would suffice will not cause a huge hit in one function, it will cause an x% hit in all the affected functions, and a profiler won't help you with that.

      The profiler will tell you what functions use 80% of your runtime. The programmer is smart enough to check those functions and see if the parameter passing convention could possibly be an issue. The other functions don't matter. Even if its a constant 20% hit in all the other functions, we're talking about a 4% overall performance decrease, which is insignificant.

      But the sort of issue I mentioned above isn't a fatal mistake in one place. It's more like death by a thousand cuts. And again, no post-processing with a profiler and hand-tuning will fix a system that is inherently slow because of such lazy coding practices.

      Profiling fixes the code in which performance matters. The rest of the code can be slow, because it doesn't matter!

      the code from something specialised like Intel's compiler is usually far better at low-level optimisations.

      Numerous benchmarks show that Intel's compiler is rarely more than 20% faster than GCC, unless it can take advantage of auto-vectorization. The assembly output doesn't mean much. As far as the CPU is concerned, its just bytecode --- it gets translated and reschedule before getting executed anyway.

      --
      A deep unwavering belief is a sure sign you're missing something...
    4. Re:Sure, but premature pessimization is evil, too by Anonymous+Brave+Guy · · Score: 1
      Even if its a constant 20% hit in all the other functions, we're talking about a 4% overall performance decrease, which is insignificant.

      I understand what you're saying, but I think you're still missing my point. This is just one trivial but routine efficient coding practice, and you've just demonstrated that if a lot of your functions are simple things working on moderately complex data, the overhead of not doing it can be as high as 10%, all because of a stubborn insistence that no optimisation should occur until the end of the project and without the guidance of a profiler, even where it's a well-known and universal good practice that requires a mere seven extra keystrokes. (You're ignoring the hit on the stack in this case, BTW; if any of your functions are recursive, sloppy parameter passing is a crash waiting to happen in C++, too.)

      What I'm obviously not conveying very well here is that this is only one example of the impact of following lazy coding practices throughout a project. What if many functions inhibit things like the named return value optimisation, something that's all too easy to do if you just write code completely naturally, but often very easy to fix if you understand the implications of the code you write? There go another few percent. Now suppose you have a lot of classes that have member functions performing relatively simple calculations, but you do those calculations every time the function is called rather than caching the results? That's a few percent more. And so it goes on, eating away at performance one little piece at a time, using a little unnecessary memory because a few more KB can't possibly matter in a machine with RAM capacity counted in GB.

      As I said before, I agree with you that in isolation such carelessness is unlikely to be a major problem if the area of code where it happens isn't called repeatedly, but the point is that it all adds up. A little here, a little there, and now your program runs twice as slowly, or uses 5x the RAM, and eventually, you get the sort of absurd bloat that I mentioned elsewhere in this discussion, and no profiler on the planet will help you undo the damage. This is the result of assuming there's no need to think about performance until the very end, and IMHO it's a most unwelcome one.

      --
      If you disagree, post your argument. (-1, Overrated) isn't your personal censorship tool for views you don't like.
    5. Re:Sure, but premature pessimization is evil, too by Anonymous Coward · · Score: 0

      The parent posters seems to not understand that systematic programming errors (ie. pass by value a readonly parameter is always an error except when the parameter is smaller that the reference) leads to bloat and that lazy programming is the cause of the actual waste of computing resources.

      Premature optimization is evil.
      Good enough solutions are just bad solutions.
      Programmimg should be something in betweeen.

  27. Re:Ever heard of playing just to see what will hap by __david__ · · Score: 1
    That's not true at all. There is nothing inherently wrong with hand-optimizing just because you feel like it.
    Have you ever heard the saying, "premature optimization is the root of all evil?"

    The problem is that when you profile a piece of code, quote often the slowest routine is something you never would have expected. Hand optimizing a routine that gets run 1% of the time is pointless. Quite often this sort of hand optimization makes code ugly, and why make your code ugly for no reason?

    Another argument is that generally you don't ever have to do crazy platform specific optimizations to make your program run blazingly fast. For instance, if you determine that the slowest point of your program is the part that for loops through 1000 strings looking for a specific one, the smart thing to do isn't to go nuts hand optimizing the assembly of the loop or making the sure the strings are cache aligned or something. Switch your algorithm so the strings are kept sorted and then do a binary search. Or maybe make a hash table. Either way it's going to be an instant order of magnitude speedup--way way more than you could ever get by hand optimizing the original for loop.

    You also say that size and speed are mutually exclusive. While that is generally the case on current x86 architectures, that doesn't always have to be the case. I don't know what causes the penalty for unaligned reads, but Intel could redo its architecture to grab 32 or 64 bits at a time from any base byte, but the current tools that blithly accept the current limitation and don't let coders explore how their code might be different if such a barrier was removed doesn't give Intel an incentive to do so, and that's one of my points.
    I'm not sure I understand your assertion that the tools are holding you back. Take this structure:
    struct {
        char a;
        short b;
        long c;
    } d;
    There is no implied packing in the C standard--C specifically strives to be platform neutral. Every compiler I've seen will put padding bytes in there to make the accesses fast (or doable at all--must ARM busses can't do unaligned transfers). But if Intel magically put together some hardware that made unaligned accesses just as fast as aligned accesses then the compiler would be able to remove the padding bytes on that architecture.

    If you don't need the speed and need your structure smaller then you can tell the compiler to pack the structure (on gcc I think it's something like __attribute__((packed))). So there's nothing holding you back--they just make the default sane for most situations.

    The point of C is to let the compiler do the stupid little architecture optimizations for you. So it seems to me like your main point is that you don't trust the compiler. Maybe you should try some of your ideas out in assembly and see if they are worthwhile. If you can show that a particular optimization is worthwhile write it up and send it to the gcc guys.

    -David
  28. Re:Cache-friendly by hackwrench · · Score: 1

    Lets say you have a 16*32 element array. If you have a 512 element cache, you don't have to have the overhead of logic to group instructions on the data. If you have a 256 element cache, to execute efficiently you would have to employ logic to break your instructions into three pools. One for instruction groups that can work the first half of data and then the second, one for the reverse, and one for instruction groups that thrash the cache, unless of course you can demonstrate that that's always avoidable without adding extra logic. A 128 bit cache incurs even more logic. I would really like to see a demonstration that I am wrong, but this is the point where the other person either doesn't reply or tells me to read the documentation without telling me what the documentation is that I should be reading, or tells me to read something that is highly technical that requires me to have done prior reading, and I can't figure out what those books are.

  29. Re:Ever heard of playing just to see what will hap by hackwrench · · Score: 1
    Have you ever heard the saying, "premature optimization is the root of all evil?"

    Yes, and the first hit for "premature optimization is the root of all evil" demonstrates my point exactly. To paraphrase, a good software developer will have developed a feel for where performance issues will cause problems. Making it easy to hand optimize can only help one to develop the feel.

    You say, "The point of C is to let the compiler do the stupid little architecture optimizations for you." and you also say "Quite often this sort of hand optimization makes code ugly, and why make your code ugly for no reason?"

    C has a conflict of interest. If it has structures to allow you to write beautiful hand-optimization it loses its reason for existing, so guess what, it doesn't. Ugly hand-optimization is a fault of C, not of hand-optimization.

    The statement, "There's no implied packing in C" is a bit inaccurate. It's more accurate to say that there's implied non-packing in C, which gets in the way of beginners trying to write to a file data format.

    My main point is more like, "The C compiler is not an agent of my will. It has a mind of its own and isn't interested in telling me what's on its mind."

    When I programmed in QuickBASIC, I could depend on the fact that a string was coded as a pointer and a length and a memory area described by the two. I could depend on the fact that arrays were coded as all of the dimension entries and a pointer to a memory area. I knew exactly how long a structure was, and if I wanted to write a routine that accepted a variable of a different structure than the one passed to it, I could do that simply by putting a different declaration of the routine in the calling file.
    I would know, for example, that:
    TYPE PalDef1
    a as string *1
    r as string *1
    g as string *1
    b as string *1
    END TYPE

    and

    TYPE PalDef2
    PalEnt as LONG
    END Type
    were identical in size and I could call the same routine, passing one or the other with
    DECLARE SUB RoutDef1 ALIAS "MyRout" (A as PalDef1)
    and
    DECLARE SUB RoutDef1 ALIAS "MyRout" (A as PalDef2)

    Your string optimization routine

    Because QB always maintained the length of a string, I knew that the fastest way to find an unsorted string was to:
    A=LEN(SearchString)
    FOR I = 1 to NumOfStrings
    IF LEN(StringList(I))=A then exit FOR
    NEXT
    Interesting how that doesn't come up as a potential solution for you in your string performance scenario.
  30. Re:I don't really want my compiler to be very smar by Frumious+Wombat · · Score: 1

    The only final decision you should be worried about with a modern compiler is, "is that result correct?" You've fallen into the trap of believing that your programs are the first to be written that way (they aren't), and that you're smarter than the teams of people who write compilers, and the computer scientists whose algorithms those compilers employ (chances are, you aren't).

    Once again, from the real world, I have moved a quarter of a million line parallel Fortran program to a new 64-bit architecture, which was easy because the authors had isolated the machine-specific bits, and had abstracted them as far as possible, so very little actually had to be changed. After that, it was a matter of verifying the code via the official test suite and some personal results (basically, similar jobs, but larger). That code is heavily optimized by algorithmic choice, does a few (carefully isolated) architecture-dependent tricks to save memory, and makes heavy use of system math libraries (i.e. good algorithms, carefully tuned). The kind of bit-twiddling you're advocating, while probably personally fulfilling, is making very little difference in the overall performance of your code, if not actively reducing it.

    Your idea of second-guessing the compiler, optimization by optimization, is (politely) impractical, on anything much over a couple of hundred lines. Go to http://www.g95.org/g95_status.html/, pick one of the simulation codes listed, such as AbInit, and convince yourself that you will consistently hand-generate better code than a modern compiler. I know some groups who do hand-coding in assembler, but that tends to be for mathematical primitives. Higher-level functions are written to call those primitives, and left to the compiler to optimize.

    I said this elsewhere, but it applies here as well, "Machines should work, People should Think".

    --
    the more accurate the calculations became, the more the concepts tended to vanish into thin air. R. S. Mulliken
  31. Re:Cache-friendly by be-fan · · Score: 1

    What happens when you have a 32*32 element array? That's the problem with optimizing to a given data size, you're rarely in the situation (at least in modern programs), where you can count on your data being a given size. It's a far better idea to make your program cache friendly. Depending on your algorithm, perhaps it already is. If you're just reading data linearly, then you probably won't see a significant performance loss going from a 512 element cache to a 256 element cache, even with 1024 elements of data. If you do random access, it'd be worth it to get your program to operate on the data in tiles (say, in 32-element pieces), so your working set will still usually fit in cache.

    --
    A deep unwavering belief is a sure sign you're missing something...
  32. Huh? by joto · · Score: 1
    There is lots of talk about writing portable programs, but this pursuit has resulted in a lot of processor features going unused.

    Name one such processor feature. What on earth are you talking about?

    One example is being able to write a program that purposely uses a combination of 16-bit and 32 bit.

    Huh? You are not making sense. What does this have to do with portability? Are you talking about memory models or sizes of variables holding data. In either case it doesn't make any sense. Nobody "purposely uses a combination of 16-bit and 32 bit" because they want to be "portable". If they "purposely uses a combination of 16-bit and 32 bit", it's because that's what the spec says they should, or because the business logic dictates it.

    I know there are arguments that writing solely in one or the other is a performance advantage, but what are the factors involved? Is the slowness of such a combination inherent in its design or is it a result of current hardware.

    If you are talking about memory models, it's not just a performance advantage, it's an advantage. While a computer can do just about anything, there are things that are easy to do, and there are things that are not so easy to do. Mixing 16- and 32-bit memory models in a program is not easy, it introduces a lot of extra complexity, which will no doubt result in more bugs. It was pretty much what made windows 95/98/ME such a mess (and that was the operating systems division, which are paid to handle this sort of thing, imagine the troubles if someone in the app-division did the same!)

    If you are talking about mixed-mode arithmetic with arguments of different data sizes, then one can argue that this is in fact a limitation of current hardware. But it's also a limitation that's here to stay. Making everything fast results in a combinatorial explosion of complexity in a CPU. The way it's done today is to either make it slow, or to make it impossible (i.e. require a 16-bit variable to be "promoted" to a 32-bit variable before calculations are done). This allows the common case to be fast, without making the processor more complex. It's likely to stay that way forever!

    We are beginning to replace systems and programs designed primarily to run in pure 32-bit mode with systems designed to run in pure 64-bit mode, so I ask: Is such purity really worth it?"

    The purity is definitely worth it (as explained above). Whether it's worth it to replace working 32-bit systems with new 64-bit systems is an entirely different question. Do you need 64 bits now? Will you ever need it in the future? Will you need it within 1 year, or 10 years? How long do you plan to care for your systems? Are you working on embedded systems, desktop systems, or server systems? Etc...

    32-bit systems are going to stay for a long time. Especially in the embedded space, where there is no imminent need for anything more.

  33. be nice to your successors by r00t · · Score: 1

    Your successors will want employment, right? Unportable code provides jobs. Maybe it even provides a future job for you.

  34. Re:Ever heard of playing just to see what will hap by __david__ · · Score: 2, Insightful
    Yes, and the first hit for "premature optimization is the root of all evil" demonstrates my point exactly. To paraphrase, a good software developer will have developed a feel for where performance issues will cause problems.
    Yes, I totally agree with you and the linked essay on this.
    Making it easy to hand optimize can only help one to develop the feel.
    I disagree here. Read the page you linked to again. The point is that you have to have a feel for the overall design of the program you are making and how that design will work in the end. It is not about how fast you can make memcpy() go (for example)--that can only get you so far. Take for example:
    Because QB always maintained the length of a string, I knew that the fastest way to find an unsorted string was to [search linearly]. Interesting how that doesn't come up as a potential solution for you in your string performance scenario.
    That is because in the context of C (which the discussion is about) the lengths of strings are not known (quickly). For large numbers of strings your algorithm is still orders of magnitude slower than keeping the strings sorted and doing a binary search. That was my real point, and Knuth's point. That optimizing your overall algorithm can yield vast improvements that hand optimizing little sections of code just cannot come close to. This is what the linked essay says that good programmers develop a feel for, not silly little tricks to speed up a single for loop. That's the kind of thing you do very last and only if you have an intensely speed critical application and you've already exhausted optimizing your algorithms--because you're only going to speed things up by small percentages.
    C has a conflict of interest. If it has structures to allow you to write beautiful hand-optimization it loses its reason for existing, so guess what, it doesn't. Ugly hand-optimization is a fault of C, not of hand-optimization.
    If the reason you are talking about is some semblance of portability then you are right. Have you ever read the C rationale? It explains the reasoning of the decisions the C committee made and helps you see things from their point of view. It was very enlightening for me when I first read it. It apparently used to be part of the C standard but they broke it off into a separate document at some point.

    -David
  35. Re:Ever heard of playing just to see what will hap by triso · · Score: 1
    Because QB always maintained the length of a string, I knew that the fastest way to find an unsorted string was to:

            A=LEN(SearchString)
            FOR I = 1 to NumOfStrings
            IF LEN(StringList(I))=A then exit FOR
            NEXT

    Interesting how that doesn't come up as a potential solution for you in your string performance scenario.
    Here is a classic example of changing algorithms vs. optimizing your existing algorithms. Clearly at some time this search may become a bottleneck, perhaps when there are over 10,000 names to scan--I'm thinking phone-book here. Only an algorithm change will save this; so consider sorting the list and binary searching, a binary tree, a hash-table or whatever.

    It is up to the developers to find out how many strings there will be in a typical scenario and to profile the program to see where the bottlenecks are hiding.

  36. Re:Cache-friendly by Anonymous Coward · · Score: 0

    So, programming 'cache friendly' is simply reading linearly (tiles have size, so we are back to cache size considerations). What a great idea!

    I still remember when cache was only for high level hardware, and a 0 sized cache was the norm.
    I agree with the parent poster, programming should be free of those kind of considerations (cache sizes, hardware priorizations, etc).

    It would be much better to use our expanding computing capabilities to create self-examining machines... think about a system able to detect that a program is incurring in recurring cache misses and transparently reconfigures itself to improve the program performance.

  37. The registry situation by hackwrench · · Score: 1

    In 32-bit mode you can still accees the upper and lower halves of the lower 16-bits as [abcd]h and [abcd]l. There is also a command to swap the lower 16 bits with the upper 16-bits

    1. Re:The registry situation by Nutria · · Score: 1

      In 32-bit mode you can still accees the upper and lower halves of the lower 16-bits as [abcd]h and [abcd]l. There is also a command to swap the lower 16 bits with the upper 16-bits

      Ok, yeah, you're right.

      Still, there's no way to access the high 16-bits.

      --
      "I don't know, therefore Aliens" Wafflebox1
  38. I don't know, ask the Itanium team... by drgonzo59 · · Score: 1

    They'll tell you all about how portability is not that important and how everyone will just embrace the new 64 bit architecture.

  39. EAX, AX, AH and AL. by Mr+Z · · Score: 1

    EAX vs. AX isn't usually the problem, since references to AX can be understood as references to EAX. The real hiccups come when you use AH and AL relative to either AX or EAX, since you can have a mixture of reads and writes to portions of AX/EAX. I'm pretty sure this is why there's no way to address the upper half of EAX while ignoring the lower half.

  40. Re:Cache-friendly by Mr+Z · · Score: 1

    Caches fundamentally reward spatial and temporal locality. Proper data structure and algorithmic design will help you on any modern system. If you want to go back to the bad-old-days of 10MHz and slower machines, we can get rid of the cache. Caches let you go faster and get the GHz rates we're hooked on.

    I don't think you should hard-code for a cache size unless your target is very specific. One place where you would code to particular cache sizes and layouts is the embedded space. (I should know, I work there.) Even then, you can do so in a parameterized manner: #define is your friend here.

    As for self-examining systems... We have that today, to a certain extent. Look up Intel's VTune. It's hard to see how you'd change a particular data structure's implementation at run time, though, based on such fine-grained detail. That sort of fine detail is best given to the programmer. Really, what you want to do is have a well-known way for CPUs to inform programs of cache sizes so they can size arrays, queues and other scalable structures appropriately as part of the program's initialization. For instance, if you're operating on large matrices and you can tile your accesses to that matrix, you would set your tile size based on the size of your L1D cache. Otherwise you risk having a tile that's too small (to cater to a large number of systems easily, but incurs a larger fraction of loop overhead) or too big (and thus thrashes some systems).

    Other neat tricks include using things like space-filling curves (such as Hilbert curves) to define your traversal through large multidimensional arrays. These sorts of curves have the neat property that they localize accesses pretty close to optimally regardless of the cache size. Their downside is that they're not super general. For instance, I don't see how to readily implement a matrix multiply by traversing the matrix in Hilbert-space order. Large image filters though, sure.

    --Joe

  41. Re:Cache-friendly by Anonymous Coward · · Score: 0

    Yes, promoting locality, and so on... all those things allow us to fine tune the program at the source code level, my proposal is more radical..

    I am thinking on a system (hard/soft) that examines his own behaviour and adapts itself to optimize execution time or power consumption or other external criteria.

    Think about repeating usage patterns, ie, libs are almost always in the same place, but each program keeps searching for each one of them at each execution.
    Why not to speed the lib detecting process from outside the program?
    If a program loads libs A, B, C in that order at start, then why not to allow the system to preload libs B and C while loading lib A at the program start?

    If the boot or loging sequence is not changed, why it is not optimized after some executions?

    A lot of code is about environment detection/usage, tipicaly that code just goes from generic to particular (load libs, reserve mem, file detection, etc) and it is by nature very estatic (1 load lib A, 2 check dir B, etc). Why the system does not take that in consideration en speeds up the whole process?

    Only the system can 'predict' what the program will ask for at a given point and adapt acordingly.
    That's an indirection level that I think it deserves consideration.

  42. Gibberish by cow-orker · · Score: 1

    Whenever you are optimizing, the first thing to do is to use a smarter algorithm or an advanced data structure. No amount of bit twiddling will gain as much as an improvement from say O(n^2) to O(n log n) does. Coding an advanced algorithm on top of low level "performance tuned" code is next to impossible. Therefore, write high level, portable code. After tuning, it is still high level, still portable, and it also performs.

    If performance still is not adequate (don't guess, ask the profiler), isolate the ugly bits behind a clean interface, then code machine specific implementations if there's something to be gained.

    However, this statement alone

    One example is being able to write a program that purposely uses a combination of 16-bit and 32 bit.

    tells me that you're far from "seeing it". If you're counting bits, your code will be slow despite being non-portable.

  43. You don't have a choice in the mainstream by bluGill · · Score: 1

    Odds are you do not have a choice. x86-64 is coming fast. Microsoft has Windows running on it, and is likely to make it mainstream sometime soon. They have promised they will anyway.

    Apple is moving from PPC to x86 (no word that I know of on 32 or 64, but I would assume both).

    Linux runs on so many systems that anything other than portable code will get you flames if you are open source. If it runs on linux it better run on at least all 4 BSDs, and Solaris, if not more.

    This is good. In my experience, the porting your code more than makes up for the costs, even if the portable code isn't used outside of test. There are too many 1 in a million bugs on your target platform that happen every time on the other system.

  44. Portability by The+Real+Stainless · · Score: 1

    I have used many compilers, on many platforms, and not one of them has ever generated code that is ALWAYS better than what I could hand code in assembler. Compilers are large, complicated pieces of software that do a damn good job. So what if you could make some parts of your code run better on this processor or that, do you want to have to sit down and code routines for every processor? In the bad old days we had to do this, we didn't have compilers, we coded in assembler. That's why it took months to port code from one device to another. A friend of mine has just ported two full EA games in 30 hours. Can you imagine how long it would have taken had you to port it from Z80 to 6502? Far more problematic is porting from one language to another, but that's another whole can of worms. You have a straight choice, write production quality code in your prefered language and live with any minor performance issues, or write software for a processor and live with the fact that you will probably have to code the lot again in the future. Having said that there is one alternative, use Elate, then you have binary portability and you shouldn't have to compile the damn thing every again.

  45. Writing portable programs ... by Anonymous Coward · · Score: 0

    has nothing to do with compiler optimisation nor which compiler is used. Of course if you use an environment, a toolset, a framework which isn't well suited for cross-platform development, you won't get optimal results. But if you use a suited environment like wxWidgets (http://www.wxwidgets.org/) you'll achieve identical results as with platform-specific development.

    O. Wyss