Slashdot Mirror


Ars Technica on Hyperthreading

radiokills writes "Ars Technica has a highly-informative technical paper up on Hyper-Threading. It's a technical overview of how simultaneous multithreading works, and what problems it will introduce. It also explains why comparing the technology to SMP is Apples to Oranges, in a sense. Starting with the 3 GHz Pentium 4, this tech will be standard in Intel's desktop lines (it's already in the Xeon), so this is important stuff."

14 of 235 comments (clear)

  1. Hyperthreading verses SMT by sielwolf · · Score: 3, Interesting

    I'm personally more partial to calling it Symmetric Multi-Threading as compared to Hyperthreading which is the brandname Intel created for the concept. Sort of like Xerox versus Photocopy. Of course there are some mix-ups for those who seem to think of the multi-threading as OS based and not hardware. Eh, personal preference.

    --
    What is music when you despise all sound?
  2. Oracle, W2K Enterprise by Perdo · · Score: 2, Interesting

    They Love Hyperthreading. Licencing is determined per CPU reported to the OS not per actual piece of silicon.

    Double your licencing cost for a 5% to 30% performance improvement? I don't think so. Hyperthreading is DOA on for enterprise.

    Luckly MS has decided to enable 2 CPUs in XP home so you dont have to ante up another hundred bucks for XP professional for the 5% to 30% performance improvement.

    Junkware.

    --

    If voting were effective, it would be illegal by now.

  3. Terra/Cray MTA by astroboy · · Score: 5, Interesting

    The company that now owns the name Cray does something very much like this on a fairly grand scale on its own architecture, the MTA (Multi-Threaded Architecture). Here, each processor switches between 128(!) hardware threads to take advantage of the sort of concurrancy you can get for waiting for memory access, etc.

  4. Re:SMP performance by FuzzyMan45 · · Score: 2, Interesting

    Actually, i've used a hyper-threaded system (dual 2.0ghz xeons) and it's really not that much faster. Maybe intel fixed some stuff on the final spec, but the chips felt faster not in HT mode...

  5. Re:"It's already in the Xeon" by jc42 · · Score: 4, Interesting

    > developers have to make their programs multithreaded. In the Windows world, this happens already, far less so in the Linux world.

    There's a good reason for this. The biggest problem with debugging multithreaded code is preventing the threads from shooting each other in the foot. On unix-like systems, there's a simple, elegant solution to this: processes. If you use independent processes with shared memory, you can limit the foot-shooting problems to only the shared segments, and the rest of the code is safe. You also have several kinds of inter-process communication that are easy to program and fairly failsafe.

    On Windows, you don't much have these things. Developers don't much take advantage of multiprogramming, because the inter-process communication tools are so complex. So the model is a single huge program that does everything. The natural development is toward an emacs-like system, in which everything is a module in one huge program. In such a model, it makes sense to want to use threads, so that some tasks can proceed when others are blocked.

    One way to get unix/linux developers adopt threads is making it more difficult to use the basic unix multi-processing and IPC tools. If they can be made more complex than threads, then people will adopt the Windows model.

    Alternatively, the threads library could be made as easy to use as the older unix approach. But so far, there's little sign of this happening.

    Threads are a debugging nightmare, and a programmer who has lost months trying to debug a threadized program, and finding that the end result runs even slower than the original, is going to be shy to do it again.

    Also, calling the developers dummies isn't very persuasive. They mostly hear such insults as a euphemism for "It's too complicated for your simple mind." When I hear things like that as answers to my questions, I tend to agree with my critic, and revert to things that I can understand and get to work right.

    --
    Those who do study history are doomed to stand helplessly by while everyone else repeats it.
  6. Re:SMP performance by really? · · Score: 2, Interesting

    ... and benchmarked a bit faster in nonHT mode for me. (FreeBSD 4.6.2, with an 8port 64 bit 3ware controller)

    --

    "Consistency is contrary to nature, contrary to life. The only completely consistent people are the dead." A. Huxley
  7. Increasing pain of Mis Predicts and IO Access by brandido · · Score: 3, Interesting

    When Intel switched from the P3 architecture to the P4 architecture, they increased the depth of their pipeline from 10 to 20, I believe. My understanding was that this significantly increased the performance penalty for mispredicts for branches and whatnot requiring a flush of the pipeline. I am curious if adding SMT to this will increase the penalty for mispredicts even more, if both threads must be flushed or only the one. If this is the case, are there cases where the penalty would outweight the benefit?

    --
    First Falcon-1 to orbit, then Falcon-9. Then I can die a happy man.
  8. Re:SMP performance by Anonymous Coward · · Score: 1, Interesting
    Actually, i've used a hyper-threaded system (dual 2.0ghz xeons) and it's really not that much faster. Maybe intel fixed some stuff on the final spec, but the chips felt faster not in HT mode...

    Don't you really need to recompile to take advantage of hyper-threading? On top of that, current UIs just barely take advantage of conventional multithreading in the first place. So it's not surprising that it didn't feel faster.

  9. Re:Might not speed up benchmarks... by Ost99 · · Score: 2, Interesting

    SMT isn't necessarily a good idea for desktop computers, perhaps espesially from a GUI / responsiveness point of view. SMP machines don't share cache, and have problems with running thightly coupled threads, because the treads have to check the other CPU's cache when reading / writing to a (cached) shared resource. On a SMT this is not a problem, as the cache shared.

    SMP handles does a very good job of "hiding" all the processes the OS runs from a desktop user, you'll never experince slowdowns when the OS / an other app wirtes / reads from disk (if it's not because it's out of memory, and have to use the swap files...). On older systems this could be a significant problem. Playing games from cd-rom was often impossible, as the cd-rom drive used 40%-60% cpu when reading / seeking. With SMP you had another cpu to do your stuff, while the OS did it's stuff on another (not true of course, but close).

    An SMT pc woun't necessarily benefit the same way as a SMP when running such unrelated processes simultaneous, especially cache intensitive processes (cache is a shared, limited resource).

    I think SMT will benefit processor intensitve programs like simulations, and (multitreaded) games.

    If some way of restricting each process / threads use of cache isn't implemented, realtime scheduling on these processors will be all but impossible (it's rather hairy on SMP as well).

    - Ost

    --
    ---- Sig. gone.
  10. Re:Might not speed up benchmarks... by Kashif+Shaikh · · Score: 4, Interesting

    UNIX developers to stop being afraid of multithreading and maybe some of us UNIX users would be able to take advantage of this

    Do you know why they are afraid? In my view, threads re-introduce the problem where you have a bunch of processes that can freely share any memory at will, use any means of communication, and are a pain in the Ass with a capital A to debug/trace properly(without using internal debuggers). Try debugging a single process with dozens of different threads(i.e. threads with diff. entry points), where each thread has another dozen instances of itself. Now try using traditional debugging tools like strace,gprof(for tracing), or gdb.

    In traditional multi-process environments, multiple processes are forced to communicate using well-designed message passing interfaces(pipes, unix domain & net sockets, FIFOs, message queues, shared-memory). Sure you can use share memory, but its done in a more restricted way(you share a buffer) so that it's not abused. Badly written threads in my experience use global variables and literally hundreds of flags(i'm not joking) for communicating what to do,whats the state,etc. Debugging processes are easier IMO, because all processes can dump their core, you can pause a process in action and see exactly what its currently doing(tracing).

    I want to ramble more, but I'm tired. Anyone have more input on threads v.s processes?

  11. Lock Granularity by Ben+Jackson · · Score: 3, Interesting
    All the systems I have seen are either broken or have so many locks in them that they may as well be single-threaded.
    Don't you mean they had so few locks in them thay they might as well be single-threaded? Having more locks isn't a bad thing unless your critical sections have to hold more than one or two locks at a time. After all, you've got to have some kind of mutual exclusion when modifying global data, and you can only have as many threads holding locks as there are locks to hold!

    To scale well you want to lock data rather than code and that can lead to many locks when you are operating on many structures. Ideally these locks each have less contention and better data sharing than "bigger" locks.

    1. Re:Lock Granularity by spitzak · · Score: 3, Interesting

      What I meant is the programs I have seen lock a piece of critical data in such a way that it is impossible for any two threads to be unlocked at any time. The code typically was like this:

      for (;;) {
      lock(big_lock_shared_by_everybody);
      figure_out_what_to_do();
      lock(small_lock_around_my_work);
      do_about_95%_of_the_work();
      unlock(big_lock_shared_by_everybody);
      do_about_5%_of_the_work();
      unlock(small_lock_around_my_work);
      do_a_bit_more_that_should_be_locked_anyway();
      &nb sp; wait_for_next_message();
      }

  12. Re:"It's already in the Xeon" by joib · · Score: 3, Interesting

    Umm, could you elaborate on this? Do you mean some kind of COW (copy-on-write) kind of stuff (i.e. MVCC in database terminology). I.e. if you write to a locked resource, a new copy of the resource is created, and when the writing is completed and noone is reading the original resource the new one is copied over the old one? I think someone was experimenting with this for the Linux kernel, they came to the conclusion that for data structures which are mostly read-only this is faster than the traditional locking approach, but if you need to write a lot, this is slower because of the overhead of copying.

  13. Hyperthreading at UCSD, and why the Tera Sucks by ShakaUVM · · Score: 3, Interesting

    UC San Diego has been a leader in research on hyperthreading. We used to have the Tera MTA, which kinda pioneered the whole field, and we have Dr. Dean Tullsen (and his lab of students), whose hyperthreading architecture was used in the new, now-cancelled, alpha chip.

    References: The Tera: http://www.cs.ucsd.edu/users/carter/Tera/tera.html
    Dean Tullsen: http://charlotte.ucsd.edu/users/tullsen/

    I was one of the first five students to use the Tera after it came out of development. I decided to take a different approach in evaluating its performance. I didn't like what the Tera corporate benchmarkers were doing. Which was taking applications with known parallelism, writing a serial version of the code, and then post with glowing reviews the results of the Tera automatically finding parallelism, ignoring that the number of pragmas they had to put into the code to allow the compiler to discover parallelism was more work that just writing a parallel code oneself.

    I instead called them on their advertising that their compiler could discover latent parallelism in any computation-heavy code. I noticed John Carmack's .plan file at the time openly questioned the same claim, so I took the single threaded, computation-intensive utility for Quake2 (BSP; LIGHT & VIS are multithreaded) and ran them on the Tera. Nutshell: it couldn't find parallelism. The 300Mhz Tera supercomputer ran at the equivalent speed of a 600Mhz Pentium. Which is crap considering the incredible memory bandwidth and number of computational units it had available.

    When I reported the results to Carmack, his response was, "I have never been a big believer in magically parallizing dusty deck codes. I don't mind specifying explicitly parallel activities and threads, especially with the large payoffs involved."

    Cheers,
    Bill Kerney