Slashdot Mirror


User: James_Intel

James_Intel's activity in the archive.

Stories
0
Comments
16
First seen
Last seen
Profile
(view on slashdot.org)

Comments · 16

  1. Re:Performance Comparison? on Intel Releases Threading Library Under GPL 2 · · Score: 1

    TBB's not a replacement for OpenMP or MPI. If either work for you - you should use them. OpenMP is usually use as a shared memory programming model, and works well for Fortran and much C code. TBB is aimed for C++ and C programs, also shared memory. MPI is a distributed memory programming model.

    First and foremost - TBB is easy to use, and still high performance.

    A critical item for high performance in parallel programs is scaling - and TBB helps get better scaling more easily than you would find with hand coding. But OpenMP and MPI generally encourage/lead to scaling.

    Another key on shared memory machines is managing caches well. TBB, again, does very well.

    Benchmarks... it would be good to get ideas on what we should show. Here is what we have looked at / seen:
    (1) Comparing with code written using pthreads/windows threads, TBB code is much easier to write and debug. We've seen serious programmers who don't write lots of parallel code be unable to get the hand threaded code to work, but get TBB to work. In each case I've seen - the programmer had to stop with hand threading because they just had to go work on something else after their effort to add pthreads to a big program didn't work well enough. We've seen experienced programmers get good scaling with TBB the first time, something they have to spend time on with hand coded threads.
    (2) If a program can be written with OpenMP, we've seen comparable performance between OpenMP and TBB on first implementations of code - but then further tuning can lead to OpenMP out performing TBB if the OpenMP dynamic scheduling can be turned off as an advantage (use 'static scheduling' for a boost in speed). TBB is always dynamic, and will get beat in such cases. Of course, dynamic scheduling can be a huge win for many cases with problems that are even a little bit irregular.
    (3) Distributed memory code (using MPI or a 'cluster' version of OpenMP) - usually out performs everything else. This is because, as a developer, you need to work out how to write a program with minimal dependencies between code running on each node of the computer. This work, is not easy... but once done (if it can be done) - the program has few synchronizations, not memory contention, etc. I don't recommend MPI for anyone thinking of coding for 2-4 cores and starting parallelism for the first time.

  2. Re:Intriguing on Intel Releases Threading Library Under GPL 2 · · Score: 1

    It is fully compatible with gcc - and has been tested with gcc on Linux, Mac OS X and Windows for Intel/AMD processors as well as G5 processors on Mac OS X. I think the FreeBSD and Solaris build on gcc too, but I'm not 100% sure. That's just so far - no reason to expect it to be hard to port/build for other systems.

  3. Re:Memory requirements - bummer on Intel Releases Threading Library Under GPL 2 · · Score: 2, Informative

    Sorry - the requirements are more about the likely memory needed on a system for that OS + a C++ program, etc. In other words - have a system with 512M in general, not for TBB. This is confusing me now!

    TBB really has minimal requirements of its own... using TBB won't really change the memory needs enough to worry about it.

    I'll see if we can find a way to update the web page so it makes sense. Sorry for the confusion.

    The concurrent containers are much more scalable than those in STL - much more scalable. The queue, vector and hash table we provide are much better choices in a threaded application (with or without the other features of TBB) than using STL containers.

    The scalable memory allocator is definitely a gem. The library for it is completely separate from the rest of TBB - so definitely a good place to start if you have a threaded application which still calls malloc()

  4. Re:Compatibility kinda sucks on Intel Releases Threading Library Under GPL 2 · · Score: 1

    Solaris for x86 yes with gcc - download it and it will build. We think it all works (we've tested some) - but we'd obviously love feedback if we missed something (we don't think so - but it hasn't been ported for as long as others for TBB). We're working on making sure the Sun compiler works too - and SPARC is something we're hoping to get some help with (some Sun engineers are in touch and want to help us with it - so hopefully we'll have that working in time).

  5. Re:Compatibility kinda sucks on Intel Releases Threading Library Under GPL 2 · · Score: 4, Informative

    We've been supporting Linux, Windows and Mac OS X for x86, x86-64 and Itanium processors in the commercial product for a year. And, yes, those include Intel and AMD processors. The commercial product information only lists those.

    The commercial product information quoted does not include some ports which were completed for the open source project only days before the open source release.

    Preparing for open source, we were able to get G5 for Mac OS X as well as support for Solaris and FreeBSD (both x86 and x86-64) working before releasing on Tuesday. It was tight - but they made it. I wasn't sure until the week before what we would have - but the team got them working. I think it will be easier now that the project is started - and we can let other join in to help us.

    I should also say we got a bunch more Linux distributions working for builds too. We have tested them enough to see no issues - but we haven't enough experience to call them supported on the product pages (commercial product). Please look for the latest ports on the open source project threadingbuildingblocks.org. We'll work with anyone who has processors/system expertise and needs any advice we can offer. Understandably, we don't have a lot of non-Intel hardware inside Intel to test upon and we are hoping others can help a bit with that.

    For compilers - we have gcc, Intel, Microsoft and Apple (gcc in Xcode environment) compilers all working with the builds. It seems like we may have something to do for Sun's compilers and/or environment working - some Sun engineers are in touch and helping us double check this. No schedule - just working together - which I have faith will get results to put out in an updated open source copy in the not too distant future - non-binding wish - this is not a promise ;-) We're talking about what to do together to add SPARC support to - which shouldn't be too hard but will take some work.

    The biggest issues from processor to processor is knowing how to implement a few key locks, and atomic operations, best in assembly language. Since we have support for processors with both weak and strong memory consistency models - we know TBB is up to the task.

    TBB is very strongly tied to shared memory, and so a port to a Cell processor (or a GPU) would be a bit more challenging - but might be doable for the Cell. We've had only a few discussions/thoughts - no progress I know of figuring out a good approach there. That will almost certainly take someone with more Cell experience than we have at this time. I'm open to learning - but I'd need a teacher for sure.

  6. Re:Difficult to implement on Intel Releases Threading Library Under GPL 2 · · Score: 1

    Definitely use OpenMP if it works for you. We (at Intel) are going to keep supporting OpenMP - we do not see TBB lessening the need for OpenMP. Data parallel programs which have nice loop nests and code up in C and/or Fortran work very well with OpenMP, and do not need the extra complexities of using TBB. The extra complexity of TBB helps with task oriented and C++ code and data structures. TBB can also handle nested parallelism and irregular problems or more complex arrangements like pipelines.

  7. Re:Intel - The Software Company on Intel Updates Compilers For Multicore CPUs · · Score: 1

    For the mix you suggest - optimal is a different binary for each (that is true for every compiler - no matter what anyone claims), near optimal (close enough) is one of the -ax options as you suggest since you know you don't have Pentiums, Pentium II, etc. The real headache is you can't build an optimal binary with lots paths... i-cache kills you... in fact... 2 paths (highly specific and generic) is hard to make run faster, and 3 paths is nearly impossible to make faster reliably with the compiler. Forget a path for every processor - the compiler really gets to make 2. You can force more, but your results are not likely to be good. The options let you pick how specific you want the 'highly specific' to be - the other is generic (the compiler switches let you pick how generic - so you can say 'assume MMX and SSE' so you don't have to be completely braindead). Unfortunately, our default caters to those who don't read the manual - and so generic includes 386 processors. Many of our customers would be upset if we produced binaries which don't run "everywhere." Many would be happy is P6 was the lowest we supported or even Pentium II.

  8. Re:Intel - The Software Company on Intel Updates Compilers For Multicore CPUs · · Score: 1

    There are several processors it will fail for - from several vendors. They are not recent processors, but they exist based on our research. Since they are not Intel processors, I'm not going to list them because I cannot find vendor information acknowledging the issue. Since it actually is not an issue (since that is not the CPUID usage defined as valid) - I'm not surprised it is not documented. At least some were not AMD processors either based on my memory - but I don't have the information at my fingertips any more.

    I do know Intel and AMD document the approved process very carefully. And since our designers have agreed upon the seqeunce they will support - it is not safe to use a different sequence for detection and assume it will work in the future.
    You are proposing a sequence for detection neither Intel nor AMD agree to support. You observe it works for most processors you know of. I agree that is true.
    I just disagree this is the right approach - I prefer to see the approved and support sequence used, all processors covered, and no future issues.

  9. Re:Riddle me this on Intel Updates Compilers For Multicore CPUs · · Score: 1

    Right now, we are designed to assume SSE4 support will include SSE3, SSE2, SSE and MMX. Since that fits all processors I know of - it is a reasonable dependency. If that changes in the future, we could revisit it.

  10. Re:Intel - The Software Company on Intel Updates Compilers For Multicore CPUs · · Score: 1

    The patches you point to will cause the compiled code to fail on older Intel processors and older AMD processors by bypasing checking and forcing code to run regardless of processor. We don't support that nor encourage doing that.

    The 10.0 compilers have SSE3 tested and supported for Intel and AMD - as I said before, the previous version was designed before AMD had SSE3 support. The 10.0 documentation has amore information on this topic, worth a read.

  11. Re:Looks like something they rushed out on Intel Updates Compilers For Multicore CPUs · · Score: 1

    It's not up to our standards. It doesn't deserve all the abuse, soem of that was explained elsewhere - but it's still not our best work. We'll update it (soon). I can't promise we'll fire the person - heck, I write like that on a bad day... I don't think I wrote this piece... but I'd better check.
    Thanks for the encouragement and entertainment. :-)

    P.S. the comments about where our development team is - was not correct... I don't think the sun sets on our team - plenty of our compiler team is in the U.S. - and the threading tools folks are mostly in Illinois (but they are worldwide too). I'm not inclined to believe it was a second language issue.

  12. Re:Moore's law onto programmers?! on Intel Updates Compilers For Multicore CPUs · · Score: 1

    I don't remember programs running twice as fast on a 2GHz processor as a 1GHz processor - oh, unless they fit in cache. I'm not arguing that single threaded gets a lot of benefit from multicore (it gets some from offloading other tasks, from OS services, other services it might use, etc.) - but I can't stand the myth that doubling clock rate was doubling performance of single threaded applications. I drives me NUTS (not far to go?)

    issue 1: Memory wall - memory speed wasn't keeping up with processor speed; multicore slows this nuisance... at least clockspeed-wise... bandwidth is a challenged when caches aren't enough - and solving this isn't easy (at least not on a budget)

    issue 2: ILP wall - out of order execution and more and more exotic hardware to try to maintain a myth of double clock - double perf... at a high cost... multicore stops this rat race

    issue 3: power wall - double clock, double power consumption (okay - this is a lie, double clock means 4X the power... but if you wait 18 months... you can halve the size and get to only 2X) - multi-core stops this (even reversed it a bit)

    So - you got parallelism to deal with. Work - yes. But the three walls in front of us weren't going away.

  13. Re:Intel - The Software Company on Intel Updates Compilers For Multicore CPUs · · Score: 5, Informative

    (Yes - I work for Intel - post for myself - tell it like it is) Cute story if it was true. However - Intel compilers and libraries, are designed to use features - but we don't come out every day with an update. The new compilers support SSE 4, but Intel only. AMD support comes after the processors exist that support it. Libraries aren't quite there yet with SSE 4 (I guess we hate Intel processors too - flame us). But AMD support for SSE 3 is there - now that it is in their processors. It wasn't there when we developed version 9 of the compilers. We do test our compilers/libraries on other implementations - because believe it or not - we care if it works. It doesn't always - and we adjust the compiler/library to make it work. We had a beta a few years ago which blew up on Intel processors and worked on AMD processors (yeap - I said it right - imagine the embarrassment when a customer told us about that combination). Opps. I heard that was because we released support before we tested that it worked on that processor. So we learned not to do that too often. By the time we release product - it should work on all procesors. I would say "does" or "guaranteed to" - but the lawyers would freak - because nothing in life is guaranteed. We are clearly not trying to screw our customers though - you know... the developers who count on our software. It is annoying when people suggest that might be our goal.

    My favorite complaint: Intel checks "CPUID"
    No duh - that's where the feature information is.

    Next favorite: Intel checks for "GenuineIntel".
    Another "no duh" - RTFM from Intel or AMD - the features flags checking has to come AFTER you determine the manufacturer AND family of the processor...
    unless you don't care about running on all processors
    (spare pointing out to be that you can skip the first two checks - look at the SSE flag - and it is usually right - unless say you pick just the right older processor)
    We do the checks the way Intel and AMD manuals say we have to... if that is evil... so be it.
    We even start by testing if the CPUID instruction exists (it didn't before Pentium processors).

  14. Re:Anyone want to... on Intel Updates Compilers For Multicore CPUs · · Score: 3, Interesting

    The compiler will try like crazy to do that - and sometimes it does a great job. Most of the time - you'll have work to do (it won't do it for you). What we've found though - is that anything a programmer can do to express tasks that are splittable - makes the automation more and more possible. OpenMP (11 years old now) has carried that into the multicore world from the world of supercomputing - for loops. Don't have loops? Well, that's there would be a tough one.
    Threading Building Blocks is a good option for C++ developers - because it pushes you to rewrite key parts of the code - for thread safety (too bad C++/C doesn't force that) and for this automation of splitting. Often this is easier than you'd think - and then you're in easy city. I'm not saying it is easy, nor a cure-all - but it is useful to look at it and see if it isn't the best idea so far - and see what else we can do.

  15. Re:And to make vector ops even simpler than in Par on Intel Updates Compilers For Multicore CPUs · · Score: 2, Informative

    You're right - vectorization - by itself can't handle step 11 dependent on step 10... and assuming there isn't a magical way to rewrite the loop to remove the dependence (which it the first thing the ocmpiler will try todiscover and do for you - but usually it can't) - then you need to look at pipelining - software pipelining on a single core, or parallelism on multi-core... but you'll have to have the right interconnect processor to processor to match the work to get multiprocessor pipelining to do what you want. Software pipelining can be very effective on loops with dependencies loop to loop.

  16. Re:Anyone want to... on Intel Updates Compilers For Multicore CPUs · · Score: 5, Informative

    Automagical - we try. Vectorization, paralellization - I dare say the Intel compilers are at least as good at it as any compiler ever has been. Bold statement - yeah. I believe it is true.

    A more interesting question is "Is that good enough?" For vectorization, the answer is 'usually' - so some additional work/headaches happen when it isn't enough. For parallelization - the answer is at best 'sometimes.' So I'll get flamed two ways: (1) by people very happy with it - and say that I've understated how good it is - and it is all they need, (2) by people with programs which don't get magical auto-paralleism to solve there needs. There are more people in #2 than #1 - but this ain't a 1-size-fits-all-world. Not a bad deal if it solves you problems - otherwise - you got work to do... but that ain't the compiler's fault... parallelism requires work for most of us.

    About languages...
    Virtually every Fortran, C and C++ compiler these days support OpenMP, which is not part of the official standard - but is there to use. It is loop oriented, and is very Fortran-like and fits into C well enough... but is definitely not C++ like.

    Fortran and C/C++ don't support threading in the language, you need to write your code to be thread-safe, and you need to use a threading package like Windows threads or POSIX threads (pthreads). Boost thread offer a portable interface to hit on the key threading needs - essentially wrappers for pthreads and Windows threads, etc. - the standards are likely to add a portable interface officially in the future. One thing Java did from the start.

    Intel compilers -> Intel CPUs -> all compatible processors
    The Intel compilers and libraries aim to beat other compilers and libraries regardless of the processor it is run on. No one will get it right all the time - so this is not a dare to find single examples of little code sample to prove me wrong. But if a real program doesn't get the best results from Intel - we want to know. (yeap - I work at Intel - I post for myself)