IPv6 traffic has been doubling year over year for the past 3 years, but total traffic has also been growing quite rapidly, so it makes IPv6 growth seem less.
What he's saying is that some places determine the closest server by the single registered location of the DNS server. A few days back, I was getting routed to a CDN in Europe from the USA. Changed my DNS back to my ISP, problem "fixed", now going to Chicago.
Many of these optimizations that are done "manually" can be done because I know certain things that the RDMBS does not know about the usage case. It can guess about things and use current meta data, but those guesses are not always correct.
Lets make an example. Say table A is a small table with a relation to table B, and table B is several magnitudes larger than A. Now say table B has a relation to table C, but table C is only a few factors larger than B.
Lets assume there is also a reverse, where table D is a few factors small than table C, and table E is several magnitudes smaller than table D. Say I have a join that returns a large data set and I need to filer on tables A and E. It is hard for the optimizer to know about the logical relations between these tables and may join A->B->C->D->E or the reverse, E->D->C->B->A. The problem that arises is the set of ABC is relatively small and the same for EDC, but the set of ABCD is HUGE and so is EDCB.
One way around this is to manually join ABC with DE or EDC with AB. It would be hard for the optimizer to figure this out. The DB probably could have had a better design, but that's what I was given.
This is what happens when someone designs a database that is technically properly normalized, but does not play well with actual usage. In their normal use cases, they're just seeking, so it works fine, but when I have to do large joins with few seeks and mostly merge joins, then it's crazy slow. This is why designers need to understand how their stuff is actually used. Being technically correct doesn't mean it's a good design.
Actual question, how does "mandatory SSE2" play into open source where the compiler can detect SSE2 and compile for it? If you can properly detect and target SSE2 at compile time, assumptions aren't needed.
Don't complain about memory prices when you company only buys from price gouging companies. Like I posted above, the going rate of DDR3-1600 ECC RDIMMs is about $50/4GB.
You can get DDR3-1600 16GB Crucial for $200 brand new with a life time warranty for parts and labor. That's $50/4GB. If you're paying $120/4GB, it's because you don't know how to shop or you'll void the warranty on your over priced POS server.
Yes and no. The larger your cache, the higher its latency. Can't get around this. L1 caches tend to be small to keep the execution units fed with typically 1 or 2 cycle latencies. L2 caches tend to be about 16x larger, but have about 10x the latency.
L2 cache may have high latency, but it still has decent bandwidth. To help hide the latency, modern CPUs have automatic pre-fetching and also async low-priority pre-fetching instructions that allow programmer to tell the CPU to attempt to load data from memory into L1 prior to needing it, and only if the CPU finds an open slot for memory access.
After a certain size, "normal" cache is slower than main-memory. That's why we're starting to see integrated eDRAM, which is mostly just system memory built into the CPU or package. The other issue you need to be careful about is each layer of cache adds accumulative fixed latency.
The easiest way to hide high latency is to have lots of concurrent work going on. Hyper-threading banks a lot on this. When one virtual code is stalling on memory access, the other code can step in and make use of any free execution units on a per cycle basis. Because there are lots of units, there is usually an idle one somewhere that can be used. Ironically, having two virtual cores sharing the same resources means resources are split, primarily the L1 cache. While hyper-threading helps hide latency caused by memory access, it also increases the chance of an address getting evicted from L1.
To help with this, Intel increased the size of their L1 cache on the more recent CPUs, but this also increased the latency from 1 cycle to 2 cycle. To help compensate, they increased the bandwidth of the L1 and allow larger loads. Twice the bandwidth, twice the size, but twice the latency. Single thread code takes a minor hit, but concurrent work stands to gain a decent amount.
Increasing cache sizes is not as simple as it seems.
Until the cache is so big that everything fits in it, you always win if you can double what you can cram into it.
Which is all nice and good except this implies your data structure was mostly pointers to begin with, so if you want to increase cache efficiency forget about pointer size and redesign them for better locality.
I suspect this is the real reason why this ABI has not caught wind: anyone who cares has already taken steps that render it pointless.
Exactly. Their target audience learned to use a single array of structs with a single pointer instead of allocating thousands of individual objects and tracking their individual pointers. Then they can just use a which ever power of two offset they want. 16bit offsets if they want, that's even smaller than 32bit pointers. A single large array should have better page hits than a bunch of objects. The allocator can easily see a single large allocation and use a large 2MB page instead of 4KB.
You need to run in 64bit mode if you want to take advantage of many cache eviction reducing IPC increasing instructions. If you want to gain this benefit while keeping your pointer size to a minimum, then you need the x32 mode. aka, 64bit mode with truncated pointers. You can probably gain 10%-15% performance with few changes over true 32bit mode. A lot of that is hidden when using 64bit pointers because of the reducing data density for some work loads.
x32 mode is great for anything that can take advantage of the new 64bit specific instruction, but does not need 64bit addressing. 32bit mode has a lot of weird backwards compatibility issues, so to keep things simple, they reserved some features for 64bit mode only where they deprecated some of the most annoying aspects of 32bit.
When your queries start getting into the 10 table joins, the join optimizer starts to attempt to make educated guesses because of the number of possible join arrangements. The metadata used is based on samples of the current data. To mitigate having to keep these metadata perfectly up to date, which would be very expensive and slow, the RDMBS only samples a subset.
While this works most of the time, there are some cases that don't. I've had quite a few times where I had to force join orders and/or join types to get the query to work correctly. Talking about 1-4 magnitude differences in performance many times. Since I have control over the DB, I can know how the data will relate and can force the DB into certain join orders.
Some times, breaking the query up and loading the output into temp tables can speed things up. I do not recommend this as "normal", but some cases warrant it.
For databases that fit in memory GPU makes a lot of sense.
A bit more selective that that. For datasets that fit in memory, where memory patterns are sequential, and the queries have almost no branching. GPUs are very picky.
Rule of thumb, if your dataset can fit in memory, it probably won't benefit from GPUs. Talking about 10TB+ datasets and few long running Data Warehouse style queries, not small OLTP style queries. GPUs take a crap if you have any branching, so all queries used must not have any conditions that can cause different rows to take different branches to be useful, so very basic WHERE statements.
Yay, HTML5, with a DRM plugin so you can use a DRM blob that can make system calls. At least Netflix will work. chroot that browser though. That's assuming someone makes a Netflix blob to use with HTML5. Many content providers won't allow their content to be used on any platform that doesn't support a "secure" path for the DRM. This may require a custom kernel blob module.
Yes, running Linux with a custom DRM binary. You can get the GPL'd Linux code form the Bluray maker, but you won't get the part that's required to decode the streams.
100 engineers cranking away 40 hours per week of code for the past 10 years, all targeting MS-SQL, ASP.Net, Windows, and several other MS specific services. Yes, that's easy to just switch over./sarc
Strangely enough, as we move more into RESTful services, the easier it will get to whole-sale replace MS in modules, but this is a more recent change. Maybe we can switch to a Mono+nginx setup.
... because they can upgrade for free, which is not the case with windows...
That is not the case for anything. NOTHING in this world is free. Everything costs at least time and time is money. In the case of servers, Linux is mostly free because automation, but when it comes to end users, any changes at all require training.
Yeah, Microsoft should have to support XP for all of eternity. Pay $100 once, get 10,000,000,000 years of support. Get $100 ever!
From Microsoft's position, supporting XP is not only costing money in the form of programmer time, but it is taking away programmer time from new projects. And once you get to the size of Microsoft, you can't just hire more programmers, because you get negative scaling from management overhead.
I have no *nix machines at home or work, but even I know Linux and FreeBSD run on nearly every server. You almost have to try to build one that doesn't work. Nearly every HBA and server grade NIC has *nix support.
IPv6 traffic has been doubling year over year for the past 3 years, but total traffic has also been growing quite rapidly, so it makes IPv6 growth seem less.
What he's saying is that some places determine the closest server by the single registered location of the DNS server. A few days back, I was getting routed to a CDN in Europe from the USA. Changed my DNS back to my ISP, problem "fixed", now going to Chicago.
Audi uses a combination of electronically lockable Torsen differentials and the breaks.
Many of these optimizations that are done "manually" can be done because I know certain things that the RDMBS does not know about the usage case. It can guess about things and use current meta data, but those guesses are not always correct.
Lets make an example. Say table A is a small table with a relation to table B, and table B is several magnitudes larger than A. Now say table B has a relation to table C, but table C is only a few factors larger than B.
Lets assume there is also a reverse, where table D is a few factors small than table C, and table E is several magnitudes smaller than table D. Say I have a join that returns a large data set and I need to filer on tables A and E. It is hard for the optimizer to know about the logical relations between these tables and may join A->B->C->D->E or the reverse, E->D->C->B->A. The problem that arises is the set of ABC is relatively small and the same for EDC, but the set of ABCD is HUGE and so is EDCB.
One way around this is to manually join ABC with DE or EDC with AB. It would be hard for the optimizer to figure this out. The DB probably could have had a better design, but that's what I was given.
This is what happens when someone designs a database that is technically properly normalized, but does not play well with actual usage. In their normal use cases, they're just seeking, so it works fine, but when I have to do large joins with few seeks and mostly merge joins, then it's crazy slow. This is why designers need to understand how their stuff is actually used. Being technically correct doesn't mean it's a good design.
Actual question, how does "mandatory SSE2" play into open source where the compiler can detect SSE2 and compile for it? If you can properly detect and target SSE2 at compile time, assumptions aren't needed.
It's entirely a data locality issue and L1 caches aren't getting much larger.
Don't complain about memory prices when you company only buys from price gouging companies. Like I posted above, the going rate of DDR3-1600 ECC RDIMMs is about $50/4GB.
Sorry, DDR3-1600 16GB ECC Registered
You can get DDR3-1600 16GB Crucial for $200 brand new with a life time warranty for parts and labor. That's $50/4GB. If you're paying $120/4GB, it's because you don't know how to shop or you'll void the warranty on your over priced POS server.
Yes and no. The larger your cache, the higher its latency. Can't get around this. L1 caches tend to be small to keep the execution units fed with typically 1 or 2 cycle latencies. L2 caches tend to be about 16x larger, but have about 10x the latency.
L2 cache may have high latency, but it still has decent bandwidth. To help hide the latency, modern CPUs have automatic pre-fetching and also async low-priority pre-fetching instructions that allow programmer to tell the CPU to attempt to load data from memory into L1 prior to needing it, and only if the CPU finds an open slot for memory access.
After a certain size, "normal" cache is slower than main-memory. That's why we're starting to see integrated eDRAM, which is mostly just system memory built into the CPU or package. The other issue you need to be careful about is each layer of cache adds accumulative fixed latency.
The easiest way to hide high latency is to have lots of concurrent work going on. Hyper-threading banks a lot on this. When one virtual code is stalling on memory access, the other code can step in and make use of any free execution units on a per cycle basis. Because there are lots of units, there is usually an idle one somewhere that can be used. Ironically, having two virtual cores sharing the same resources means resources are split, primarily the L1 cache. While hyper-threading helps hide latency caused by memory access, it also increases the chance of an address getting evicted from L1.
To help with this, Intel increased the size of their L1 cache on the more recent CPUs, but this also increased the latency from 1 cycle to 2 cycle. To help compensate, they increased the bandwidth of the L1 and allow larger loads. Twice the bandwidth, twice the size, but twice the latency. Single thread code takes a minor hit, but concurrent work stands to gain a decent amount.
Increasing cache sizes is not as simple as it seems.
Which is all nice and good except this implies your data structure was mostly pointers to begin with, so if you want to increase cache efficiency forget about pointer size and redesign them for better locality.
I suspect this is the real reason why this ABI has not caught wind: anyone who cares has already taken steps that render it pointless.
Exactly. Their target audience learned to use a single array of structs with a single pointer instead of allocating thousands of individual objects and tracking their individual pointers. Then they can just use a which ever power of two offset they want. 16bit offsets if they want, that's even smaller than 32bit pointers. A single large array should have better page hits than a bunch of objects. The allocator can easily see a single large allocation and use a large 2MB page instead of 4KB.
You need to run in 64bit mode if you want to take advantage of many cache eviction reducing IPC increasing instructions. If you want to gain this benefit while keeping your pointer size to a minimum, then you need the x32 mode. aka, 64bit mode with truncated pointers. You can probably gain 10%-15% performance with few changes over true 32bit mode. A lot of that is hidden when using 64bit pointers because of the reducing data density for some work loads.
x32 mode is great for anything that can take advantage of the new 64bit specific instruction, but does not need 64bit addressing. 32bit mode has a lot of weird backwards compatibility issues, so to keep things simple, they reserved some features for 64bit mode only where they deprecated some of the most annoying aspects of 32bit.
When your queries start getting into the 10 table joins, the join optimizer starts to attempt to make educated guesses because of the number of possible join arrangements. The metadata used is based on samples of the current data. To mitigate having to keep these metadata perfectly up to date, which would be very expensive and slow, the RDMBS only samples a subset.
While this works most of the time, there are some cases that don't. I've had quite a few times where I had to force join orders and/or join types to get the query to work correctly. Talking about 1-4 magnitude differences in performance many times. Since I have control over the DB, I can know how the data will relate and can force the DB into certain join orders.
Some times, breaking the query up and loading the output into temp tables can speed things up. I do not recommend this as "normal", but some cases warrant it.
For databases that fit in memory GPU makes a lot of sense.
A bit more selective that that. For datasets that fit in memory, where memory patterns are sequential, and the queries have almost no branching. GPUs are very picky.
Rule of thumb, if your dataset can fit in memory, it probably won't benefit from GPUs. Talking about 10TB+ datasets and few long running Data Warehouse style queries, not small OLTP style queries. GPUs take a crap if you have any branching, so all queries used must not have any conditions that can cause different rows to take different branches to be useful, so very basic WHERE statements.
Performance and scaling should have been addressed in the design phase
Yay, HTML5, with a DRM plugin so you can use a DRM blob that can make system calls. At least Netflix will work. chroot that browser though. That's assuming someone makes a Netflix blob to use with HTML5. Many content providers won't allow their content to be used on any platform that doesn't support a "secure" path for the DRM. This may require a custom kernel blob module.
Yes, running Linux with a custom DRM binary. You can get the GPL'd Linux code form the Bluray maker, but you won't get the part that's required to decode the streams.
100 engineers cranking away 40 hours per week of code for the past 10 years, all targeting MS-SQL, ASP.Net, Windows, and several other MS specific services. Yes, that's easy to just switch over. /sarc
Strangely enough, as we move more into RESTful services, the easier it will get to whole-sale replace MS in modules, but this is a more recent change. Maybe we can switch to a Mono+nginx setup.
... because they can upgrade for free, which is not the case with windows...
That is not the case for anything. NOTHING in this world is free. Everything costs at least time and time is money. In the case of servers, Linux is mostly free because automation, but when it comes to end users, any changes at all require training.
Yeah, Microsoft should have to support XP for all of eternity. Pay $100 once, get 10,000,000,000 years of support. Get $100 ever!
From Microsoft's position, supporting XP is not only costing money in the form of programmer time, but it is taking away programmer time from new projects. And once you get to the size of Microsoft, you can't just hire more programmers, because you get negative scaling from management overhead.
8bit IO is just the width, kind of like how modern Intel CPUs use a 4bit interconnect.
If you have a sweet server
I have no *nix machines at home or work, but even I know Linux and FreeBSD run on nearly every server. You almost have to try to build one that doesn't work. Nearly every HBA and server grade NIC has *nix support.
what it's like to do back breaking manual labor in the freezing cold for 12 hours a day
Sounds like something Google should automate. We need to completely replace physical labor with robots, no one should have to do that stuff.
Dark matter creates issues by messing with mass distribution.