The one distinction between a vector processor and a GPU used as a vector processor is that the vector CPUs have reasonable scalar performance. Most matrix math programs are MOSTLY vector math, but with a few scalar bottlenecks. What's the latency of running a branch-heavy decission tree through the long pipeline of a GPU? How big of a program can you fit on the graphics card?
The advantage of the GPU is that you already have it on the system. But if you really need to do this complex mathematical analysis, a DSP chip is probably of better use.
If there were many programs that made use of simd-style math, the CPUs would all have co-processors to do that math really well. Oh look! They all do. That's what altivec / SSE / etc. are.
Where is the economic impetus for such a design? Any time you do custom work, it's not cheaper than commodity designs.
An SSD can be made with an embedded microcontroller to handle the fibre channel interfaces, and aggregating the IO. The memory can be off-the shelf ECC RAM, you throw in some redundant power connections, multiple batteries, some off-the-shelf hard-drives and some clever firmware, and off you go. With devices like this you're not necessarily paying for the hardware, the cost comes from all the engineering time divided across a very small market.
wafer-scale integration was once a holy grail for CPU designers, back when transister count was hard to get. By the time anyone got close to figuring out how to connect execution units in a way that you could tolerate having some not function, CPU design had moved on. It is no longer a lack of ALUs that hold back CPUs, but rather the coordination of pipelines. Now the challenge is keeping the pipelines full, due to high memory latency. Thus wafer scale integration is of marginal importance to CPU designers. (note bellow)
As for memory applications, to be successful, wafer scale integration would have to offer something that a multi-chip solution can't offer. Since you have to limit your signal speed and electrical design to talk across the memory bus anyway, does it really hurt much to have a multi-chip modules (a DIMM) anyway? Yes it adds some to the physical packaging size, but it simplifies quality control, removes complexity, and is useable across the technology marketplace. Economies of scale are much more important than the component costs.
One area where waffer-scale integration (sort of) works in the marketplace is cache. On modern CPUs, only a tiny fraction of the transistors are used for the actual processor, the majority of the chip is actually used for L2 cache. This is made of SRAM latches, so it is very bulky. On the athlon64 for sure, and probably on other chips, half the L2 cache can be disabled. This allows AMD to sell chips with a manufacturing defect as a "reduced-performance" chip. They drop the performance rating by 200 compared with the same speed chip with a full-sized cache. This is the basic idea of waffer-scale, just slightly more specific. As cache sizes increase, (they likely will, as chips are only becoming more vulnerable to RAM latency) this will likely become more important.
You're right, and for most people, as well as the bulk of corporate profits, the real question is how well can they push the technology down-market, and at what rate will they do that? Most gamers buy $150 graphics cards. Most consumers use the one that comes on the motherboard. Laptops make up almost half of computer purchases these days. What might be more interesting is a 4-pipe version of this card, clocked at 300mhz, but integrated on a southbridge, or with embeded DVR functionality. Features and price are a better benchmark for market success, rather than raw speed.
I'm thinking more like the SGI InfiniteVision system. Which really is a bunch of Radeon (9X00 series or some derivative) chips ganged together and connected to the CrayLink Routers. Doesn't make one screen any faster, but it matters a lot when you're doing immersive enviornments with 30 million pixels.
Silicon Graphics, another early bay-area unix workstation success, was in a much smaller niche, even at its peak. SGI has been circling the bowl now since the late 90's and still hasn't gone away. They barely even lost any money last quarter.
Sun has a much more stable market of business buyers. They have to be selective to get back to profitability, but it's definitely possible, even without a radical change in market. People still pay big money for mid-range and high-end servers. People still pay big money for solid enterprise software. Business customers are willing to pay real money for real solutions. A company like sun just needs to make sure that it solves today's hard problems, and does it at a price that's similar to the competition.
A slump doesn't mean a fall. A re-org doesn't mean a death knell. Sun has lots of chances left to redefine itself, and figure out how to be profitable. They just might have to lose market share and girth in the process.
Oh, and SGI makes you buy their RAIDs. (Rebranded LSIs I think). They aren't bad storage, but they are definately a LOT more expensive. SGI is selling a little further up-market than Xraids.
Having played with the Xraid, I can say that it's no dog. It's a very nice entry-level Raid-box. It's very dense (GB/rack-unit), very inexpensive, and performs pretty well. It does lack redundant controllers, though this is true of most entry-level raids. I wouldn't be surprised to see a respin of this product in the next 12 months.
The SAN market is changing. There are more switch vendors, and they are all having to compete with iSCSI, so the cost per port is coming down. While it's true that apples aren't competing in the market space where SANs have TRADITIONALLY been deployed, they are competing in the area where SANs are beginning to be deployed. "Only for graphic artists" isn't a joke. They announced these products at NAB (National Association of Broadcasters). This is no small trade show. This is a full convention center with multi-million dollar booths. TV and movie houses buy billions of dollars worth of computer equipment, and a sizable chunk of this is apple.
These little apple clusters are a joke compared to enterprise SANs of symmetrix, Sunfire, and P690 boxes, but it's still a many billion dollar market. (Especially if you sell both the hardware and the software).
Which is why this is important news. Apple can't be left behind on SANs. They don't have to dominate the world to be successful. It's just important that a customer doesn't ever have to say: "Gee. I really like this final cut pro thing, but without a cluster filesystem I just don't see us using it. Lets go buy some HPs instead."
Which is why the xserve raid uses IDE disks. Only the back-end is fibre channel. This is a low-end, limited-feature RAID box. However the price is excellent. It's WAY less expensive than entry level RAIDs from Dell/EMC, sun, IBM, LSI, or hitachi. It's almost at JBOD (Just a bunch of disks) prices.
according to the apple.com/xsan pages it looks like you can use an xserve as the metadata server. It even refers to using several of them with failover for high redundancy. Look at the video editing cluster section: It looks like you can use one of the clients as the metadata server. (It talks about an "optional" dedicated metadata server). Considering how little a extra xserv costs compared to the rest of the setup, I don't know why you would choose not to have a dedicated server.
Which will be interesting. To cut the costs and electrical requirements of a low-end / midrange card what will they do? The strength of this card seems to be the 16 pipelines, and sophisticated vertex units. How do you preserve the advances over the NV3x series without keeping most of those transistors?
The top-of-the-line card is always cool to drool over, and a few people with too much money will undoubtedly run out and buy this monster. However the mid-range and budget derivatives are generally much more interesting. (compare the number of GF5600/RA9600 cards sold to the number of GF5950/RA9800 sold)
They made this haul ass by doubling the number of pipes, but the first thing they are going to do when they put out a mid-range card is to halve, or quarter the number of pipes. How much has been done to refine this card, and how much impact will the new design have for those of us with $150 to spend on a video card?
Cray uses linux, both on the X1D and on Red Storm. Cray wants nothing more than to ditch unicos/MP (an irix derivative) which was forced upon them by SGI when SGI owned the company.
Cray doesn't give a damn about linux clusters, rather clusters in general. They are trying to dispell the (partial) myth that clusters are the only way to do supercomputing.
Maybe. Maybe not. What are the real world costs of running a linux cluster. (Don't just include the raw cost in hardware.) When you add up the cost of compute nodes, interconnect, storage, and the computer-room space, then add the recuring costs of managing, cooling, powering, the system clusters get less attractive. The number of administrators required to run a cluster is generally MUCH higher than vecotr/MPP systems. Then you have to figure out how much work ACTUALLY gets done. What is the mean time between failure of these systems? If one node fails, does that bring down the entire program? If so how often do you need to do checkpoints? How much does that lower efficiency. Check out www.ahpcrc.org for an interesting study of the total costs of deploying a cluster. It's pretty high.
Which is most curious, as the XD1 IS a linux cluster. It's a very well designed linux cluster, with very high bandwidth DMA between nodes. However, it is programmed, and behaves very much like more traditional clusters of microprocessors.
The XD1 is NOT the same as the big vector-processor X1s.
Cray could easily be at or close to the top of the top500 list, their X1 architecture will extend that far. However, for a lot of really important supercomputing codes, it's no contest: The cray will trounce the clusters (linux or otherwise). Those #19 crays are only 256 processors. To get similar performance a stack of xeons requires thousands of processors. Some tasks just can be split appart that easily.
A cray processor has eight floating-point units running at 800Mhz. The big Mac cluster (for example) uses G5 processors which have 2 FPUs at 2000Mhz. Thus the cray has a ~40% advantage. However, the G5 processor has ~4GB/s memory bandwidth. The Cray has ~50GB/s memory bandwidth. If you have a problem that needs to do a HUGE amount of math on a tiny amount of data, the G5 will rock. If you have a problem that needs to do a HUGE amount math on a GINORMOUS amount of data, buy the cray. (for a GINORMOUS amount of money too)
Similaraly infiniband (ala the big mac) is really hot in the cluster interconnect space because it gives 2.5GB/s per node. The Cray gives you 51GB/s. You need to move a little data, buy a cluster. You need to move a lot of data, buy the Cray.
All PC makers are struggling with profitability in the consumer space. On a $350 ipod, you just can't make a lot of money. Even if $100 of that is profit, you first have to pay for all your R&D and marketing costs. Whatever is left has to first pay generic corporate overhead, and then maybe some profit. In short you can sell a lot of ipods (or ibooks for that matter), at a unit profit of $100, before you generate ANY corporate profit. Gateway is hurting, compaq is hurting, Dell's server business is good, but in the consumer space they are only making money because they move A LOT of volume. Selling into the consumer space SUCKS.
This is one reason you see apple struggling to enter the low-end server space. IF they can charge $4500 for a reasonably equiped xserve with a service contract, (lets suppose a per unit profit of $1100) then you can afford to sell significantly fewer units, and still make proffit. The server space isn't a gimme, and they are going up against a lot of competition, but it's a much more attractive space than the consumer market.
I'm personally impressed with Apple's current server offerings. Now if they'd bring in a 4-CPU box and a RAID box with redundant controllers, I think you'd begin to see them make some headway. I'm cautiously optimistic.
Most opteron users are also going to be running 32-bit OSes, so benchmarking them as such is probably more usefull than otherwise. Additionally it's nice to see benchmarks that attempt to test with 1 variable (hardware) instead of comparing microsofts variations of software as well.
The fact that opteron is 64bit-capable is cool, but in performance analysis it's the underlying architecture that matters, not the register width.
They don't have hypertransport from every CPU to every other. An opteron has 3 ht links. Since some of those need to be used to connect to the system I/O devices, you have 2 left for your mini-numa system. Thus the processors would have to be connected in a ring.
Linux is capable of intelligent memory layout. It can migrate data to the processor on which the threads are running, is intelligent about which processor runs which threads, and can make duplicate copies of read-only data. It works reasonably well. (some of this is the stuff SCO is in a huff about) However, I doubt this functionality is turned on in any off-the-shelf distros. If the benchmarker compiled a kernel with NUMA in mind, this would work, otherwise I doubt it.
Incidently, since the two streams of a hyperthreading-capable xeon share the L2 and L3 caches, they benefit from NUMA grouping also.
Itaniums are expensive, but not outrageous compared to other high-end processors like Power4 or ultrasparc. They also perform quite well. They are definately better performers than xeons for most of our apps.
The problem with itanium is not that they aren't a good technology, but rather that intel is trying to shove them into the high-end of the market, which is a difficult place to compete. sparc, power, pa-risc, alpha have all been around for years, have established customer bases, and lots of businesses have invested tons of money in running them. It's a difficult place to introduce any new products.
Intel has been stymied trying to sell ia64 into that space, and has undercut itself, by continuing to improve xeon, which performs pretty well and is comparatively inexpensive. Most segments are going to migrate to the all-american mantra of "GOOD-ENOUGH, and CHEAP!" which describes xeons/opterons perfectly. The market segments that won't migrate in that direction are willing to pay the big bucks for stability, and reliability. They are very slow movers. Intel might sell some itaniums to these customers, but they'd better be willing to wait a long time.
I think a lot of people judge itanium by the yardstick of Xeon, and maybe should not. If itanium ends up simply as a replacement for pa-risc, alpha, and MIPS in the SGI and HP portpholio, that may be a success by some measures.
You will find that most high-end Xeon systems are also NUMA systems. IBM, Unisys, HP all construct their really big xeon boxes as NUMA-clusters of 4-processor SMPs. They create a distributed memory machine at the chip-set level. This is actually what the opteron does, except that the chip-set (well, the memory controller part of it) is built into the processor.
I think the above poster had the correct idea about NUMA, but worded it in a misleading way. A NUMA design (either of opterons, or of Xeon-quads) will have to do some memory access through the memory controllers on other nodes. This increases the latency of memory access, and can clog up the inter-processor links if lots of memory loads/stores go to remote memory. Thus NUMA-aware operating systems and system libraries are necessary to maximize the amount of memory access that is local, and minimize the usage of the inter-processor links.
While the opteron design is elegant, and fast, it is not the only smart way to do things. It offers great aggregate memory bandwidth, but can slow things down in the worst case. Most large NUMA systems are created by linking 4-way SMP nodes. (Examples: Sunfire, HP alphaservers, Cray X1, NEC SX-6, Unisys 7000, IBM xseries 4xx xeon, IBM xseries 4xx itanium,...) Apart from opteron systems, the only systems I can think of that do NUMA per processor are the cray T3E, SGI origin, and intel paragon, all of which are Massively parallel supercomuters.
It is safe to say, however, that a shared bus system does not scale well beyond a few processors. This is best demonstrated by the 36 processor SGI-challengeXL, which was significantly bottle-necked at the memory bus.
Stamps are Cheap. This was true in the paper / postal service days of job hunting too. It's part of the cost of doing business.
These recruiting sites arose at the end of the.com boom, when the supply and demand suddenly reversed. Thus, the number of applicants per posting has jumped hugely. If you're looking for a specific skill set you have to cast a wide net and then be selective. If you're more concerned about a good employee, but have broader experience requirements, get a referral. Existing employees are still an employers best filter - nobody wants to work with a jerk.
There are no commercial interconnects that can out-pace even baguely modern CPUs. In clusters the interconnect is always slower than the CPUs/RAM.
Infiniband isn't exactly like a bus to the CPUs, it's just a higher-bandwidth, lower-latency network, that serves the same purpose as ethernet. Infiniband off-loads some of the overhead from the CPUs onto a processor on the PCI card, but it is really just a network card.
The importance of the interconnect varies greatly from program to program. For some tasks ethernet is acceptable. Seti@home is a distributed computer for which 56K modems are an acceptable interconnect. For more complex problems infiniband is necessary. For the most complicated of problems, even infinaband is inadequate. For those problems one must use a MPP system like those from NEC, CRAY, SGI. Even these systems are running on something that is very similar to an interconnect, though at MUCH higher-bandwidth / lower-latency than even infinaband.
The connection machines never were real speed deamons in their day. They were built to be used for AI codes (lots of one-bit integer ALU's), but AI groups don't have any money so they re-tooled it to do floating-point math, but they really only sold them with the help of DARPA subsidies.
http://www.cray.com/company/history.html
As for the old crays, you probably don't want any of those from the 80's. Even the New cray X1 processor's have a theoretical peak of 12.8 Ghz, a little less than twice the G5. But it's important to remember that this doesn't tell the whole story. The X1 has 34 GBps/CPU memory bandwidth and 77 GBps/CPU to cache. It also has a 400GBps of interconnect bandwidth for a single cabinet. It's always been cheaper per flop to buy small computers and gang them together. It's not ALWAYS the best solution.
What the big mac performs really well on are hugely parallel computations with few dependancies between each piece of the computation. (like Linpak for example.) When there is a lot of dependancies between peices of the computation, large shared-memory machines work much more efficiently. Thus a bunch of DOD and DOE labs (plus meteorological sites and boeing) are still interested in paying the premium for custom vector supercomputers.
The one distinction between a vector processor and a GPU used as a vector processor is that the vector CPUs have reasonable scalar performance. Most matrix math programs are MOSTLY vector math, but with a few scalar bottlenecks. What's the latency of running a branch-heavy decission tree through the long pipeline of a GPU? How big of a program can you fit on the graphics card?
The advantage of the GPU is that you already have it on the system. But if you really need to do this complex mathematical analysis, a DSP chip is probably of better use.
If there were many programs that made use of simd-style math, the CPUs would all have co-processors to do that math really well. Oh look! They all do. That's what altivec / SSE / etc. are.
Where is the economic impetus for such a design? Any time you do custom work, it's not cheaper than commodity designs.
An SSD can be made with an embedded microcontroller to handle the fibre channel interfaces, and aggregating the IO. The memory can be off-the shelf ECC RAM, you throw in some redundant power connections, multiple batteries, some off-the-shelf hard-drives and some clever firmware, and off you go. With devices like this you're not necessarily paying for the hardware, the cost comes from all the engineering time divided across a very small market.
wafer-scale integration was once a holy grail for CPU designers, back when transister count was hard to get. By the time anyone got close to figuring out how to connect execution units in a way that you could tolerate having some not function, CPU design had moved on. It is no longer a lack of ALUs that hold back CPUs, but rather the coordination of pipelines. Now the challenge is keeping the pipelines full, due to high memory latency. Thus wafer scale integration is of marginal importance to CPU designers. (note bellow)
As for memory applications, to be successful, wafer scale integration would have to offer something that a multi-chip solution can't offer. Since you have to limit your signal speed and electrical design to talk across the memory bus anyway, does it really hurt much to have a multi-chip modules (a DIMM) anyway? Yes it adds some to the physical packaging size, but it simplifies quality control, removes complexity, and is useable across the technology marketplace. Economies of scale are much more important than the component costs.
One area where waffer-scale integration (sort of) works in the marketplace is cache. On modern CPUs, only a tiny fraction of the transistors are used for the actual processor, the majority of the chip is actually used for L2 cache. This is made of SRAM latches, so it is very bulky. On the athlon64 for sure, and probably on other chips, half the L2 cache can be disabled. This allows AMD to sell chips with a manufacturing defect as a "reduced-performance" chip. They drop the performance rating by 200 compared with the same speed chip with a full-sized cache. This is the basic idea of waffer-scale, just slightly more specific. As cache sizes increase, (they likely will, as chips are only becoming more vulnerable to RAM latency) this will likely become more important.
You're right, and for most people, as well as the bulk of corporate profits, the real question is how well can they push the technology down-market, and at what rate will they do that? Most gamers buy $150 graphics cards. Most consumers use the one that comes on the motherboard. Laptops make up almost half of computer purchases these days. What might be more interesting is a 4-pipe version of this card, clocked at 300mhz, but integrated on a southbridge, or with embeded DVR functionality. Features and price are a better benchmark for market success, rather than raw speed.
I'm thinking more like the SGI InfiniteVision system. Which really is a bunch of Radeon (9X00 series or some derivative) chips ganged together and connected to the CrayLink Routers. Doesn't make one screen any faster, but it matters a lot when you're doing immersive enviornments with 30 million pixels.
Silicon Graphics, another early bay-area unix workstation success, was in a much smaller niche, even at its peak. SGI has been circling the bowl now since the late 90's and still hasn't gone away. They barely even lost any money last quarter.
Sun has a much more stable market of business buyers. They have to be selective to get back to profitability, but it's definitely possible, even without a radical change in market. People still pay big money for mid-range and high-end servers. People still pay big money for solid enterprise software. Business customers are willing to pay real money for real solutions. A company like sun just needs to make sure that it solves today's hard problems, and does it at a price that's similar to the competition.
A slump doesn't mean a fall. A re-org doesn't mean a death knell. Sun has lots of chances left to redefine itself, and figure out how to be profitable. They just might have to lose market share and girth in the process.
Oh, and SGI makes you buy their RAIDs. (Rebranded LSIs I think). They aren't bad storage, but they are definately a LOT more expensive. SGI is selling a little further up-market than Xraids.
Having played with the Xraid, I can say that it's no dog. It's a very nice entry-level Raid-box. It's very dense (GB/rack-unit), very inexpensive, and performs pretty well. It does lack redundant controllers, though this is true of most entry-level raids. I wouldn't be surprised to see a respin of this product in the next 12 months.
The SAN market is changing. There are more switch vendors, and they are all having to compete with iSCSI, so the cost per port is coming down. While it's true that apples aren't competing in the market space where SANs have TRADITIONALLY been deployed, they are competing in the area where SANs are beginning to be deployed. "Only for graphic artists" isn't a joke. They announced these products at NAB (National Association of Broadcasters). This is no small trade show. This is a full convention center with multi-million dollar booths. TV and movie houses buy billions of dollars worth of computer equipment, and a sizable chunk of this is apple.
These little apple clusters are a joke compared to enterprise SANs of symmetrix, Sunfire, and P690 boxes, but it's still a many billion dollar market. (Especially if you sell both the hardware and the software).
Which is why this is important news. Apple can't be left behind on SANs. They don't have to dominate the world to be successful. It's just important that a customer doesn't ever have to say: "Gee. I really like this final cut pro thing, but without a cluster filesystem I just don't see us using it. Lets go buy some HPs instead."
Which is why the xserve raid uses IDE disks. Only the back-end is fibre channel. This is a low-end, limited-feature RAID box. However the price is excellent. It's WAY less expensive than entry level RAIDs from Dell/EMC, sun, IBM, LSI, or hitachi. It's almost at JBOD (Just a bunch of disks) prices.
CVFS is not made by veritas. It is made by ADIC. FYI.
according to the apple.com/xsan pages it looks like you can use an xserve as the metadata server. It even refers to using several of them with failover for high redundancy. Look at the video editing cluster section: It looks like you can use one of the clients as the metadata server. (It talks about an "optional" dedicated metadata server). Considering how little a extra xserv costs compared to the rest of the setup, I don't know why you would choose not to have a dedicated server.
Which will be interesting. To cut the costs and electrical requirements of a low-end / midrange card what will they do? The strength of this card seems to be the 16 pipelines, and sophisticated vertex units. How do you preserve the advances over the NV3x series without keeping most of those transistors?
The top-of-the-line card is always cool to drool over, and a few people with too much money will undoubtedly run out and buy this monster. However the mid-range and budget derivatives are generally much more interesting. (compare the number of GF5600/RA9600 cards sold to the number of GF5950/RA9800 sold)
They made this haul ass by doubling the number of pipes, but the first thing they are going to do when they put out a mid-range card is to halve, or quarter the number of pipes. How much has been done to refine this card, and how much impact will the new design have for those of us with $150 to spend on a video card?
NO!
Cray uses linux, both on the X1D and on Red Storm. Cray wants nothing more than to ditch unicos/MP (an irix derivative) which was forced upon them by SGI when SGI owned the company.
Cray doesn't give a damn about linux clusters, rather clusters in general. They are trying to dispell the (partial) myth that clusters are the only way to do supercomputing.
Maybe. Maybe not. What are the real world costs of running a linux cluster. (Don't just include the raw cost in hardware.) When you add up the cost of compute nodes, interconnect, storage, and the computer-room space, then add the recuring costs of managing, cooling, powering, the system clusters get less attractive. The number of administrators required to run a cluster is generally MUCH higher than vecotr/MPP systems. Then you have to figure out how much work ACTUALLY gets done. What is the mean time between failure of these systems? If one node fails, does that bring down the entire program? If so how often do you need to do checkpoints? How much does that lower efficiency. Check out www.ahpcrc.org for an interesting study of the total costs of deploying a cluster. It's pretty high.
Which is most curious, as the XD1 IS a linux cluster. It's a very well designed linux cluster, with very high bandwidth DMA between nodes. However, it is programmed, and behaves very much like more traditional clusters of microprocessors.
The XD1 is NOT the same as the big vector-processor X1s.
Cray could easily be at or close to the top of the top500 list, their X1 architecture will extend that far. However, for a lot of really important supercomputing codes, it's no contest: The cray will trounce the clusters (linux or otherwise). Those #19 crays are only 256 processors. To get similar performance a stack of xeons requires thousands of processors. Some tasks just can be split appart that easily.
A cray processor has eight floating-point units running at 800Mhz. The big Mac cluster (for example) uses G5 processors which have 2 FPUs at 2000Mhz. Thus the cray has a ~40% advantage. However, the G5 processor has ~4GB/s memory bandwidth. The Cray has ~50GB/s memory bandwidth. If you have a problem that needs to do a HUGE amount of math on a tiny amount of data, the G5 will rock. If you have a problem that needs to do a HUGE amount math on a GINORMOUS amount of data, buy the cray. (for a GINORMOUS amount of money too)
Similaraly infiniband (ala the big mac) is really hot in the cluster interconnect space because it gives 2.5GB/s per node. The Cray gives you 51GB/s.
You need to move a little data, buy a cluster. You need to move a lot of data, buy the Cray.
There's no one solution for all problems.
All PC makers are struggling with profitability in the consumer space. On a $350 ipod, you just can't make a lot of money. Even if $100 of that is profit, you first have to pay for all your R&D and marketing costs. Whatever is left has to first pay generic corporate overhead, and then maybe some profit. In short you can sell a lot of ipods (or ibooks for that matter), at a unit profit of $100, before you generate ANY corporate profit. Gateway is hurting, compaq is hurting, Dell's server business is good, but in the consumer space they are only making money because they move A LOT of volume. Selling into the consumer space SUCKS.
This is one reason you see apple struggling to enter the low-end server space. IF they can charge $4500 for a reasonably equiped xserve with a service contract, (lets suppose a per unit profit of $1100) then you can afford to sell significantly fewer units, and still make proffit. The server space isn't a gimme, and they are going up against a lot of competition, but it's a much more attractive space than the consumer market.
I'm personally impressed with Apple's current server offerings. Now if they'd bring in a 4-CPU box and a RAID box with redundant controllers, I think you'd begin to see them make some headway. I'm cautiously optimistic.
Most opteron users are also going to be running 32-bit OSes, so benchmarking them as such is probably more usefull than otherwise. Additionally it's nice to see benchmarks that attempt to test with 1 variable (hardware) instead of comparing microsofts variations of software as well.
The fact that opteron is 64bit-capable is cool, but in performance analysis it's the underlying architecture that matters, not the register width.
They don't have hypertransport from every CPU to every other. An opteron has 3 ht links. Since some of those need to be used to connect to the system I/O devices, you have 2 left for your mini-numa system. Thus the processors would have to be connected in a ring.
Linux is capable of intelligent memory layout. It can migrate data to the processor on which the threads are running, is intelligent about which processor runs which threads, and can make duplicate copies of read-only data. It works reasonably well. (some of this is the stuff SCO is in a huff about) However, I doubt this functionality is turned on in any off-the-shelf distros. If the benchmarker compiled a kernel with NUMA in mind, this would work, otherwise I doubt it.
Incidently, since the two streams of a hyperthreading-capable xeon share the L2 and L3 caches, they benefit from NUMA grouping also.
Itaniums are expensive, but not outrageous compared to other high-end processors like Power4 or ultrasparc. They also perform quite well. They are definately better performers than xeons for most of our apps.
The problem with itanium is not that they aren't a good technology, but rather that intel is trying to shove them into the high-end of the market, which is a difficult place to compete. sparc, power, pa-risc, alpha have all been around for years, have established customer bases, and lots of businesses have invested tons of money in running them. It's a difficult place to introduce any new products.
Intel has been stymied trying to sell ia64 into that space, and has undercut itself, by continuing to improve xeon, which performs pretty well and is comparatively inexpensive. Most segments are going to migrate to the all-american mantra of "GOOD-ENOUGH, and CHEAP!" which describes xeons/opterons perfectly. The market segments that won't migrate in that direction are willing to pay the big bucks for stability, and reliability. They are very slow movers. Intel might sell some itaniums to these customers, but they'd better be willing to wait a long time.
I think a lot of people judge itanium by the yardstick of Xeon, and maybe should not. If itanium ends up simply as a replacement for pa-risc, alpha, and MIPS in the SGI and HP portpholio, that may be a success by some measures.
You will find that most high-end Xeon systems are also NUMA systems. IBM, Unisys, HP all construct their really big xeon boxes as NUMA-clusters of 4-processor SMPs. They create a distributed memory machine at the chip-set level. This is actually what the opteron does, except that the chip-set (well, the memory controller part of it) is built into the processor.
I think the above poster had the correct idea about NUMA, but worded it in a misleading way. A NUMA design (either of opterons, or of Xeon-quads) will have to do some memory access through the memory controllers on other nodes. This increases the latency of memory access, and can clog up the inter-processor links if lots of memory loads/stores go to remote memory. Thus NUMA-aware operating systems and system libraries are necessary to maximize the amount of memory access that is local, and minimize the usage of the inter-processor links.
While the opteron design is elegant, and fast, it is not the only smart way to do things. It offers great aggregate memory bandwidth, but can slow things down in the worst case. Most large NUMA systems are created by linking 4-way SMP nodes. (Examples: Sunfire, HP alphaservers, Cray X1, NEC SX-6, Unisys 7000, IBM xseries 4xx xeon, IBM xseries 4xx itanium,...) Apart from opteron systems, the only systems I can think of that do NUMA per processor are the cray T3E, SGI origin, and intel paragon, all of which are Massively parallel supercomuters.
It is safe to say, however, that a shared bus system does not scale well beyond a few processors. This is best demonstrated by the 36 processor SGI-challengeXL, which was significantly bottle-necked at the memory bus.
food for thought.
Stamps are Cheap. This was true in the paper / postal service days of job hunting too. It's part of the cost of doing business.
.com boom, when the supply and demand suddenly reversed. Thus, the number of applicants per posting has jumped hugely. If you're looking for a specific skill set you have to cast a wide net and then be selective. If you're more concerned about a good employee, but have broader experience requirements, get a referral. Existing employees are still an employers best filter - nobody wants to work with a jerk.
These recruiting sites arose at the end of the
There are no commercial interconnects that can out-pace even baguely modern CPUs. In clusters the interconnect is always slower than the CPUs/RAM.
Infiniband isn't exactly like a bus to the CPUs, it's just a higher-bandwidth, lower-latency network, that serves the same purpose as ethernet. Infiniband off-loads some of the overhead from the CPUs onto a processor on the PCI card, but it is really just a network card.
The importance of the interconnect varies greatly from program to program. For some tasks ethernet is acceptable. Seti@home is a distributed computer for which 56K modems are an acceptable interconnect. For more complex problems infiniband is necessary. For the most complicated of problems, even infinaband is inadequate. For those problems one must use a MPP system like those from NEC, CRAY, SGI. Even these systems are running on something that is very similar to an interconnect, though at MUCH higher-bandwidth / lower-latency than even infinaband.
The connection machines never were real speed deamons in their day. They were built to be used for AI codes (lots of one-bit integer ALU's), but AI groups don't have any money so they re-tooled it to do floating-point math, but they really only sold them with the help of DARPA subsidies.
http://www.cray.com/company/history.html
As for the old crays, you probably don't want any of those from the 80's. Even the New cray X1 processor's have a theoretical peak of 12.8 Ghz, a little less than twice the G5. But it's important to remember that this doesn't tell the whole story. The X1 has 34 GBps/CPU memory bandwidth and 77 GBps/CPU to cache. It also has a 400GBps of interconnect bandwidth for a single cabinet. It's always been cheaper per flop to buy small computers and gang them together. It's not ALWAYS the best solution.
What the big mac performs really well on are hugely parallel computations with few dependancies between each piece of the computation. (like Linpak for example.) When there is a lot of dependancies between peices of the computation, large shared-memory machines work much more efficiently. Thus a bunch of DOD and DOE labs (plus meteorological sites and boeing) are still interested in paying the premium for custom vector supercomputers.