This is marketing. It is marketers capitalizing on the Streisand effect. Why else tell everyone you contained the situation?
There are at least 100,000 copies of the movie out there. This must be the Three Mile Island version of "containment". The release version of the film was contained, but everyone who wants to see the pre-release version will have it by Easter.
I think we should bring back segmented addressing while we are at it too. I was so much more productive when I had to use DS, SS, GS, FS, and CS. I miss them so. All those segments made the computer run so fast, too!
Is there any other part of a 20-year-old computer that you could still use for day-to-day tasks?
At my university, a professor still uses WordPerfect 4.2 for DOS running on a vintage TI PC from Circa 1984. It has an 8088 microprocessor, and runs MS-DOS 2.11 (I think). It is Texas Instrument's attempt copy of the original IBM PC. The computer was somewhat famous at the time for its high resolution graphics. Later, Hercules emulation and VGA took over in the PC world, but that took a long time.
The keys on the keyboard click loudly too. The professor bought a few extra PCs to work as spares. I think he also ordered some extra floppy disks, because those 5.25" disks are getting really hard to find.
The professor retired a few years back. I wouldn't want to be the person that makes him upgrade.
Ammonium Nitrate was also used in the Oklahoma city bombing. The yield was in excess of what some people might have expected too. The mixing and quality of the explosive varies its effectiveness, and hence blast radius.
I picked the Covariance as an example because it is an "embarrassingly parallel" problem.
The difference between your approach and my approach might relate to difference between Computer Science and Computer Engineering. Try examining this problem from an engineering perspective.
For example, solving the Covariance matrix is an O(n^2 * m) problem, requiring about 12*n^2 RAM for my application, and data on disk occupies 32*n*m. Assume disk speed is 100MB/s of which only 10% can be easily utilized, RAM is 2 GB/computer, floating point speed is about 1E9/s given the constants from the O(n^2*m) calculation. This gives:
- If n=1E3, m=1E6, we need 12 MB of RAM, 32 GB of files, 0.9 hours disk time, and 0.3 hours CPU time. Disk is the bottleneck.
- If n=1E4, m=1E6, we need 1.2 GB of RAM, 320 GB of files, 9 hours disk time, and 27 hours CPU time. CPU and RAM are bottlenecks.
- If n=1E4, m=1E9, we need 1.2 GB of RAM, 320 TB of files, 9000 hours disk time, and 27000 hours CPU time. Supercomputer/cluster time.
The goal for this application was n=1E6, m=1E9, so obviously algorithmic improvements are required too.
The problem with your approach, is that it obfuscates what the real issues are. For really hard problems, you need to start at the bare metal and work up. Otherwise, you can code your multi-threaded application, watch the user load a real data set, and have to program bomb on out-of-memory before you even encounter the disk space and CPU speed issues.
In practice, a balance between the top-down (high-level abstraction down to machine code) and bottom up (machine code to high level) are needed. I just don't see anything even remotely on the horizon that lets a programmer do this. The newer programming languages are making performance problems very non-obvious. In one scenario above, with 2 GB of RAM/computer, we were both CPU and RAM limited, so adding more cores and more threads wouldn't help. Some programmers are shocked to discover that a single-threaded algorithm can out perform a multi-threaded algorithm by a large margin. Sometimes statically allocated variables outperform code using new and/or malloc by a large margin, because new and malloc are expensive calls and zero-initialized static memory is cheap. For any application, critical design trade-offs can become very non-obvious, very easily. Parallelization makes many of these issues much more complex. How are the languages helping us to understand these problems?
When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.
Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
- Every UNIX machine has a UPS.
- Production servers run UNIX. What's this Linux you are talking about?
- Disk space is expensive. No one will pay for unused disk space.
- RAM is expensive. As such, it can be quickly flushed to disk.
- No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.
Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.
Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.
If you try doing some of the harder data intensive problems, anything that doesn't generate machine code can really suck. But to give you an idea of some of the obstacles, take a simple algorithm:
for each FileName in BigListOfFileNames {
OpenFile(FileName);
for i = 1 to NumberOfElementsInFile {
for j = i to NumberOfElementsInFile {
ComputeCovariance(i, j);
}// for j
}// for i
}// for each FileName
SumAllCovarianceCalcsFromAllRuns();
Apologies for the pseudo-code, but even a simple numerical routine like this can become a very complex computational nightmare. There is no guarantee that the Covariance matrix fits in RAM. For speed, you want to distribute each file across the network. It is necessary to organize, as best as possible, the RAM used in the routine such that locality is preserved to minimize page hits. Appropriate use of the MMX instructions is desirable. It is also a good idea to ensure the compiler isn't adding function calls in unexpected places, like if you have a IsFinite() function call inside the CoVariance routine. Additionally, the MMX optimizations may only become obvious if the CoVariance routine is pulled out of the function and placed in the main body of the loop, so it can be easily parallelized.
Even trivial looking statements, like OpenFile are complex. For instance, the OpenFile routine should preferentially process files stored on the local computer first, and then process files sitting on other computers elsewhere on the network. To completely confuse the issue, the path names for the same file are different in Windows, based on whether the file is located locally or on the network. (Pathname translation isn't as much of a problem on Linux.) Finally, for speed, and when the covariance matrix is small, it is desirable to have OpenFile open the file as a memory mapped file, and then have a background process hit all the pages so they are paged into memory. The foreground process can then happily process the data in the file without waiting for page hits. On the other hand, if the covariance matrix doesn't fit in RAM, then you don't want to have the OpenFile read ahead, because the hard drive will be busy doing other work instead.
Additionally, it is convenient to do some of the operations with software transactional memory (STM). That adds another layer of library complexity.
From a practical point of view, even the few lines of code above, can quickly become a parallelization and optimization nightmare when applied to cluster scale computing. Incidentally, those lines of code were one of the rate limiting steps in a data processing application that I was working on, so they are worth clustering. But clustering only makes sense, if the inner loop is tight. Otherwise, the hand tuned C code clobbers Java code for speed.
I don't really mean to pick on Java and.NET so hard, but it is simply amazing how slow can get if you don't ever look at an assembly print out of what you told the computer to do. It really isn't hard to have 6 or 8 function calls per inner loop of the above code, especially if you don't very carefully keep track of what you just told the compiler to do.
I don't think you can blame the choice of Serenity on Joss Whedon fans. NASA simply had a rather lame list of names, and Serenity was the best by far. The other choices were: Earthrise (9%), Legacy
(13%), and Venture (8%).
I just couldn't bring myself to vote for anything other than Serenity or Colbert. Colbert is better than Legacy, NASA's second best choice. I really don't want mankind's Legacy to be a room on a space station. Surely we can come up with something better than that...
Fan's of the BBC show, "Yes, Minister", will have a different perspective. The will argue that the choice was a conjuring trick. Someone at NASA loved Serenity and rigged the contest for it to win. Although, I'm not sure any/. readers will remember the political ins and outs of Yes, Minister.
I think I agree with you, BUT... don't fall into the old trap: If ten machines can do the job in 1 month, 1 machine can do the job in 10 months. But it doesn't necessarily follow that if one machine can do the job in 10 months, 10 machines can do the job in 1 month.
Unfortunately, this effect just makes the programmers job worse. It means that if he can only get the complexity estimate to within a factor of 100 for CPU usage, by the time Amdahl's law is done, his estimate will only good within a factor of 1000. To me, this screams, if you really need multi-core capability, you probably need a cluster too.
How likely is it that if a programmer shows a user some code, and the feedback is the code is too slow, that the user will be satisfied with a 2:1 or a 4:1 speedup?
HP's/Intel's EPIC idea (which is now Itanium) wasn't stupid, but it has a hard limitation on how far it scales (currently four instructions simultaneously).
I don't have a final solution quite yet (though I am working on it as a thought project), but the problem we need to solve is getting a new instruction set which is inherently capable of parallel operation, not on adding more cores and pushing the responsibility onto the programmers for multi-threading their programs.
The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.
With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW
processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.
In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.
The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like.NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.
The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.
I think this problem will take longer than a year or two to solve. Modern computers are really fast. They solve simple problems, almost instantly. A side-effect of this, is that if you underestimate the computational power required for the problem at hand, then you are likely to be off by large amounts.
If you implement an order n-squared algorithm, O(n^2), on a 6502 (Apple II), if n was larger than a few hundred, you were dead. Many programmers wouldn't even try implementing hard algorithms on the early Apple II computers. On the other hand, a modern processor might tolerate O(n^2) algorithms with n larger than 1000. Programmers can try solving much harder problems. However, the programmers ability to estimate and deal with computational complexity has not changed since the early days of computers. Programmers use generalities. They use ranges: like n will be between 5 and 100, or n will be between 1000 and 100,000. With modern problems, n=1000 might mean the problem can be solved on a netbook, and n=100,000 might require a small multi-core cluster.
There aren't many programming platforms out there that scale smoothly between applications deployed on a desktop, to applications deployed on a multi-core desktop, and then to clusters of multi-core desktops. Perhaps most worrying, is that the new programming languages that are coming out, are not particularly useful for intense data analysis. The big examples of this for me are:.NET and Functional Languages..NET deployed at about the same time multi-core chips showed up, and has minimal support for it. Functional languages may eventually be the solution, but for any numerically intensive application, tight loops of C code are much faster.
The other issue with multi-core chips, is that as a programmer, I have two solutions to making my code go faster:
1. Get out the assembly print outs and the profiler, and figure out why the processor is running slow. Doing this, helps every user of the application, and works well with almost any of the serious compiled languages (C, C++). Sometimes, I can get a 10:1 speed improvement.(*) It doesn't work so well with Java,.NET, or many functional languages, because they use run-time compilers/interpreters and don't generate assembly code.
2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.
The problem with both of the above approaches, is that from a tools perspective, they are the worst choice for multi-core optimizations. Approach 1 will force me into using C and C++, which doesn't even handle threads really well. In particular, C and C++ lacks an easy implementation of Software Transactional Memory, NUMA, and clusters. This means that approach 2 may require a complete software redesign, and possibly either a language change or a major change in the compilation environment. Either way, my days of fun loving Java and.NET code are coming to a sudden end.
I just don't think there is any easy way around it. The tools aren't yet available for easy implementation of fast code that scales between the single-core assumption and the multi-core assumption in a smooth manner.
Note: * - By default, many programmers don't take advantage of many features that may increase the speed of an algorithm. Built-in special purpose libraries, like MMX, can dramatically speed up certain loops. Sometimes loops contain a great deal of code that can be eliminated. Maybe a function call is present in a tight loop. Anti-virus software can dramatically affect system speed. Many little things can sometimes make big differences.
Make sure your keyboard driver is set correctly. My multi-lingual keyboard drivers take a little while to load. Consequently, I can enter a shell, type cd/etc/X11, and get something completely different. The problem is fairly easy to spot, as the slash characters come out as French A symbols.
Easter is less than one month away! There will be thousands of fake bunnies! with millions of fake eggs! Who will protect the citizens from the fake eggs?
How will we find the killer rabbit? What will the police do?
PS2 Those computer users saying an fsync will kill performance need to get cluebat applied to them by the nearest programmer. 1st. There will be no fsyncs of config files at startup once the KDE startup is fixed.
KDE isn't fixed right now. Additionally, KDE is not the only application that generates lots of write activity. I work with real-time systems, and write performance on data collection systems is important.
2nd. fsyncs on modern filesystems are pretty fast, ext3 is the rare exception to that norm; this will be non-noticable when you apply a settings change.
I did some benchmarks on the ext3 file system, the ext4 system without the patch, and the ext4 system with the patch. Code followed the open(), write(), close() sequence was 76% faster than the code with fsync(). Code that followed the open(), write(), close(), rename() sequence was 28% faster than code with that followed the open(), write(), fsync(), close(), rename() sequence. Additionally, the benchmarks were not significantly affected by the presence which file system was used (ext3, ext4, or ext4 patched.) You can look up the spreadsheet and the discussion at the launchpad discussion.
3rd. These types of programming errors are not the norm; I've graded first and second year computer science classes and each of the three major mistakes made would have lost you 20-30% of your score for the assignment.
Major Linux file backup utilities, like tar, gzip, and rsync don't use fsync as part of normal operations. The only application of the three, tar, that uses fsync, only uses it when verifying data is physically written to disk. In that situation, it writes the data, calls fsync, calls ioctl(FDFLUSH), and the reads the data back. Strictly speaking, that is the only way to make sure the file is written to disk, and is readable.
Finally, as Theodore Ts'o has pointed out, if you really want to make sure the file is saved to disk, you also have to fsync() the directory too. I have never seen anyone do that, as part of a normal file save. Most C programming textbooks simply have fopen, fwrite, fclose as the recommended way to save files. Calling fsync this often is unusual for most C programmers.
I would hate to be in your programming class. Your enforcing programming standards that aren't followed by key Linux utilities, aren't in most textbooks, and aren't portable to non-Linux file systems.
If you require your students to fsync() the file and the directory, as part of a normal assignment, you are requiring them to do things that aren't done by any Linux utility out there. Further, if you are that paranoid, you better follow the example from the tar utility, and after the fsync completes, read all the data back to verify it was successfully written.
If this is buggy code, then this must affect about every 'C' program ever written.
If this is about cases where fclose() does not get called because of a crash, then it is definitely an application bug.
You are correct on the first statement, and wrong on the second. This bug affects almost every 'C' program ever written. Essentially, POSIX allows for a successful fclose, even if the file has not been written to disk. This permits a file system to implement a write-back cache.
Many UNIX and Linux file systems will completely screw up if the system suddenly crashes before the data has been successfully written to disk. The complaint is that the Ext4 system had a bug that did this in a very egregious way, and this bug would likely cause serious data loss on any system that is not using a UPS. Ext3 was usually mounted with the "data=ordered" option. For most realistic scenarios, Ext3 will give data loss failures that a normal person would expect. Specifically, with Ext3, you might loose your most recent files. With Ext4, the complaint was that you can loose fairly old files. Some of the old UNIX file systems, would become unreadable if the system crashed suddenly. The problems with Ext4 are a matter of scale.
To make matters worse, the fsync() remarks are incendiary. They would force modifications to almost every program on Linux. Your fopen, fwrite, fclose example is in almost every C programming textbook in existence. The fact they were made by an individual working on the Ext4 system, didn't help things either. Saying "All applications should fsync() the file and the containing directory if they don't want data loss!", when you file system has a data loss bug, creates a sudden and severe reaction...
Calling fsync() excessively completely trashes system performance and usability. Essentially, operating systems have write back caches to speed code execution. fsync() disables the write back cache by writing data out immediately, and making your program wait while the flush happens. Modern computers can do activities that involve rapidly touching hundreds of files per second. Forcing each write to use an fsync() slows things down dramatically, and makes for a poor user experience.
To make matters worse, from a technical point of view, it is necessary for strict POSIX compliance to fsync() the file and then fsync() the containing directory. I have never seen a piece of normal application code that fsync() the containing directory. Even common linux utilities like rsync, and gzip don't use fsync anymore. tar uses fsync in one special case: for file verification before calling ioctl(FDFLUSH). The documentation on tar is instructive:
/* Verifying an archive is meant to check if the physical media got it
correctly, so try to defeat clever in-memory buffering pertaining to
this particular media. On Linux, for example, the floppy drive would
not even be accessed for the whole verification.
The code was using fsync only when the ioctl is unavailable, but
Marty Leisner says that the ioctl does not work when not preceded by
fsync. So, until we know better, or maybe to please Marty, let's do it
the unbelievable way:-). */
In general, application writers are interested in making sure the file is readable. Unless you are really determined, and willing to go through the file verification like in the tar command, fsync() does little to guarantee a file will be readable at a later date. Under modern file systems, there are so many reasons why a file may become unreadable, and so few of them are fixed with fsync(), that one has to ask: Why bother with fsync()?
In fact, there are so few good reasons to use fsync(), that many applications have completely given up on fsync(). fsync() is disabled on Apple Macs running OSX. If you run NFS, fsync() will probably flush your data to the network, but not to the hard disk. If you are running a PC with a modern hard drive, the hard drive probably has a write back cache. As such, fsync() doesn't guarantee your data is physically on the disk. fsync() is disabled in laptop mode.
For most applications, using fsync() will only slow down your C code. It is useful for certain applications, like databases. Many other programming languages have no equivalent to fsync(). For most programs, fsync() is an infrequently used call, and is primarily used in special purpose libraries like databases.
I was recently discussing the music copyright situation with my music teacher. We were looking at the copyright notice on the music book. Essentially, they were only selling a copy of the sheet music. They weren't actually selling the rights to play the music. Effectively, they were almost creating a misrepresentation case, in that why would you sell a music book to students, without giving them permission to play the music?
You might want to check the copyright notices. You might find that not only are you not supposed to copy the music, you aren't supposed to play it either.
People really underestimate how those embedded applications can quickly add up. Linux in your smartphone, Linux in your car navigation system, Linux in your wireless router, Linux in your playstation, before long it is "Linux everywhere."
Yes, some of these applications might not be as visible as Windows on the desktop, but they do add up. Sooner or later, someone is going to come out with a statistic like Linux outsells Windows 2:1 !!! Everyone will be wondering how that is. It will be all the embedded applications.
I bet GCC has more running applications in the world than any other compiler. It is used as a back-end compiler for a great many embedded systems.
It all depends on the target country. Afghanistan and Iraq have constant Predator overflights. I expect the blimp will offer a stationary surveillance over relatively unarmed or poorly armed countries. It might also be use for UN crisis zones, like Sudan and Somalia, or where the local government has largely broken down.
Alternatively, the blimp could be used to patrol U.S. air space. There is always the coast guard, border patrol, war on drugs, war on terrorism, war on crime, and even coastal rescue. A stationary surveillance platform might be really useful for those applications.
The main target of this platform might be here at home in America.
Novell should have already known the don't be a Microsoft Partner lesson. Novell owned WordPerfect when Windows 95 came out. Microsoft gave so much incorrect documentation to the WordPerfect developers, that the lawsuit was still going on 13 years later. In fact, the lawsuit on-going when Novell signed the Linux deal with Microsoft.
Having file system metadata not match file system data is a pretty big bug. Ext3 defaulted to having everything mounted such that the writes to the disk were "ordered" ie: (data=ordered). Ext4 does not force "ordered".
Userland can solve this problem by calling fsync() all over the place, like before every close. However, that completely defeats the purpose of having a buffered write-back file system. If the new rule is to change every userland program to force all data to be flushed to disk after every close, then we might as well mount the filesystem "o=sync", and flush our performance down the toilet. (Pun intended.)
The problem here is no call exist to force writes to disk to be "ordered". fsync() is not a substitute for ordered writes to disk. There are just too many ways an application can get into trouble if writes to disk aren't ordered. Having situations where neither the backup file nor the new file are valid is just beginning of the problems.
I write data acquisition applications that write lots of data in many files to disk. I don't care if my newest file is blank. This "bug" could mean that I have a pseudo-random number of blank files, and they might not even be ordered. My only solution, fsync(), will tank the applications performance, by causing huge amounts of disk activity. fsync() is not a substitute for ordered writes.
The legal doctine in common law countries is Force Majeure. If something sufficiently big happens, all bets are off.
The other business doctrine is that a big company shall not bankrupt the organizations selling their products:
No sales companies = No salesmen = No sales.
I think the complaint of the MCPs is Microsoft is demanding payment for product the customer isn't paying for. Specifically, my impression is that Microsoft wants to be payed for the full 3 year contract (over 3 years), even though the customer that purchased the software went bankrupt after the first year. It's a good deal from Microsoft's point of view...
Business People tend to remember the company that pushed them into bankruptcy. They don't forgive and forget easily.
I can't see everyone "just switching" to Linux, but this could create much motivation to try. Survival in business is a strategic imperative. If someone threatens that survival, then business people tend to connect the dots, and adapt accordingly.
This is marketing. It is marketers capitalizing on the Streisand effect. Why else tell everyone you contained the situation?
There are at least 100,000 copies of the movie out there. This must be the Three Mile Island version of "containment". The release version of the film was contained, but everyone who wants to see the pre-release version will have it by Easter.
So true.
I think we should bring back segmented addressing while we are at it too. I was so much more productive when I had to use DS, SS, GS, FS, and CS. I miss them so. All those segments made the computer run so fast, too!
At my university, a professor still uses WordPerfect 4.2 for DOS running on a vintage TI PC from Circa 1984. It has an 8088 microprocessor, and runs MS-DOS 2.11 (I think). It is Texas Instrument's attempt copy of the original IBM PC. The computer was somewhat famous at the time for its high resolution graphics. Later, Hercules emulation and VGA took over in the PC world, but that took a long time.
The keys on the keyboard click loudly too. The professor bought a few extra PCs to work as spares. I think he also ordered some extra floppy disks, because those 5.25" disks are getting really hard to find.
The professor retired a few years back. I wouldn't want to be the person that makes him upgrade.
Ammonium Nitrate was also used in the Oklahoma city bombing. The yield was in excess of what some people might have expected too. The mixing and quality of the explosive varies its effectiveness, and hence blast radius.
I picked the Covariance as an example because it is an "embarrassingly parallel" problem.
The difference between your approach and my approach might relate to difference between Computer Science and Computer Engineering. Try examining this problem from an engineering perspective.
For example, solving the Covariance matrix is an O(n^2 * m) problem, requiring about 12*n^2 RAM for my application, and data on disk occupies 32*n*m. Assume disk speed is 100MB/s of which only 10% can be easily utilized, RAM is 2 GB/computer, floating point speed is about 1E9/s given the constants from the O(n^2*m) calculation. This gives:
- If n=1E3, m=1E6, we need 12 MB of RAM, 32 GB of files, 0.9 hours disk time, and 0.3 hours CPU time. Disk is the bottleneck.
- If n=1E4, m=1E6, we need 1.2 GB of RAM, 320 GB of files, 9 hours disk time, and 27 hours CPU time. CPU and RAM are bottlenecks.
- If n=1E4, m=1E9, we need 1.2 GB of RAM, 320 TB of files, 9000 hours disk time, and 27000 hours CPU time. Supercomputer/cluster time.
The goal for this application was n=1E6, m=1E9, so obviously algorithmic improvements are required too.
The problem with your approach, is that it obfuscates what the real issues are. For really hard problems, you need to start at the bare metal and work up. Otherwise, you can code your multi-threaded application, watch the user load a real data set, and have to program bomb on out-of-memory before you even encounter the disk space and CPU speed issues.
In practice, a balance between the top-down (high-level abstraction down to machine code) and bottom up (machine code to high level) are needed. I just don't see anything even remotely on the horizon that lets a programmer do this. The newer programming languages are making performance problems very non-obvious. In one scenario above, with 2 GB of RAM/computer, we were both CPU and RAM limited, so adding more cores and more threads wouldn't help. Some programmers are shocked to discover that a single-threaded algorithm can out perform a multi-threaded algorithm by a large margin. Sometimes statically allocated variables outperform code using new and/or malloc by a large margin, because new and malloc are expensive calls and zero-initialized static memory is cheap. For any application, critical design trade-offs can become very non-obvious, very easily. Parallelization makes many of these issues much more complex. How are the languages helping us to understand these problems?
When you have less than 64K of RAM, and a processor that barely has a modern memory management unit, then some of these "extras" like Copy-On-Write appear as advanced features. Additionally, when your computer costs $500,000, you tend not to scrimp on stuff like a UPS.
Economics have changed much since the early days of UNIX. Many of the file system design principles still remain the same. Assumptions need to change with the times. Reasonable historical assumptions were:
- Every UNIX machine has a UPS.
- Production servers run UNIX. What's this Linux you are talking about?
- Disk space is expensive. No one will pay for unused disk space.
- RAM is expensive. As such, it can be quickly flushed to disk.
- No one has enough disk space, RAM, or disk bandwidth to experience a random fault rate of 1 part in 1 quadrillion (1E-15).
Times have changed, Linux is used on heavy servers now. UNIX (with deference to AIX and Solaris) is almost gone from the market place. RAM and disk space are cheap, so cheap that random data errors can big issue. A UPS can cost more than a hard drive, and sometimes more than the computer it is attached to. Disk capacities are huge.
Unfortunately, the file system designers haven't kept pace. The Ext4 bug was detected, reproduced, and ultimately solved for a group desktop Ubuntu users. Linux is used in cheap embedded applications, like home NAS servers. Applications that don't have a UPS. Linux isn't a just server O/S anymore. The way to design and optimize a file system needs to change too.
Additionally, even for servers, the times have changed, and this affects file systems. It used to be that accepting data loss was OK, since you would need to rebuild a server after a failure. Today, the disk arrays are so large, that if you attempted to restore the data from backups, it would take hours (sometimes days.) As such, capabilities like "snapshots" are becoming very important to servers. Server disk storage is increasingly bandwidth limited, and not disk size limited. Today, it is possible to have 1 TB of data on a single disk, while being unable to use that disk space effectively. Under many workloads, the users are capable of changing the data faster than a backup program can copy the data off the disk. In such a case, without a snapshot capability, it is impossible to make a valid backup.
After the Ext4 dataloss discussion, and the "Don't fear the fsync()" posts, I don't want to hear about Ext4, fsync(), or data loss again.
If you try doing some of the harder data intensive problems, anything that doesn't generate machine code can really suck. But to give you an idea of some of the obstacles, take a simple algorithm:
for each FileName in BigListOfFileNames { // for j
// for i
// for each FileName
OpenFile(FileName);
for i = 1 to NumberOfElementsInFile {
for j = i to NumberOfElementsInFile {
ComputeCovariance(i, j);
}
}
}
SumAllCovarianceCalcsFromAllRuns();
Apologies for the pseudo-code, but even a simple numerical routine like this can become a very complex computational nightmare. There is no guarantee that the Covariance matrix fits in RAM. For speed, you want to distribute each file across the network. It is necessary to organize, as best as possible, the RAM used in the routine such that locality is preserved to minimize page hits. Appropriate use of the MMX instructions is desirable. It is also a good idea to ensure the compiler isn't adding function calls in unexpected places, like if you have a IsFinite() function call inside the CoVariance routine. Additionally, the MMX optimizations may only become obvious if the CoVariance routine is pulled out of the function and placed in the main body of the loop, so it can be easily parallelized.
Even trivial looking statements, like OpenFile are complex. For instance, the OpenFile routine should preferentially process files stored on the local computer first, and then process files sitting on other computers elsewhere on the network. To completely confuse the issue, the path names for the same file are different in Windows, based on whether the file is located locally or on the network. (Pathname translation isn't as much of a problem on Linux.) Finally, for speed, and when the covariance matrix is small, it is desirable to have OpenFile open the file as a memory mapped file, and then have a background process hit all the pages so they are paged into memory. The foreground process can then happily process the data in the file without waiting for page hits. On the other hand, if the covariance matrix doesn't fit in RAM, then you don't want to have the OpenFile read ahead, because the hard drive will be busy doing other work instead.
Additionally, it is convenient to do some of the operations with software transactional memory (STM). That adds another layer of library complexity.
From a practical point of view, even the few lines of code above, can quickly become a parallelization and optimization nightmare when applied to cluster scale computing. Incidentally, those lines of code were one of the rate limiting steps in a data processing application that I was working on, so they are worth clustering. But clustering only makes sense, if the inner loop is tight. Otherwise, the hand tuned C code clobbers Java code for speed.
I don't really mean to pick on Java and .NET so hard, but it is simply amazing how slow can get if you don't ever look at an assembly print out of what you told the computer to do. It really isn't hard to have 6 or 8 function calls per inner loop of the above code, especially if you don't very carefully keep track of what you just told the compiler to do.
I don't think you can blame the choice of Serenity on Joss Whedon fans. NASA simply had a rather lame list of names, and Serenity was the best by far. The other choices were: Earthrise (9%), Legacy (13%), and Venture (8%).
I just couldn't bring myself to vote for anything other than Serenity or Colbert. Colbert is better than Legacy, NASA's second best choice. I really don't want mankind's Legacy to be a room on a space station. Surely we can come up with something better than that ...
Fan's of the BBC show, "Yes, Minister", will have a different perspective. The will argue that the choice was a conjuring trick. Someone at NASA loved Serenity and rigged the contest for it to win. Although, I'm not sure any /. readers will remember the political ins and outs of Yes, Minister.
Unfortunately, this effect just makes the programmers job worse. It means that if he can only get the complexity estimate to within a factor of 100 for CPU usage, by the time Amdahl's law is done, his estimate will only good within a factor of 1000. To me, this screams, if you really need multi-core capability, you probably need a cluster too.
How likely is it that if a programmer shows a user some code, and the feedback is the code is too slow, that the user will be satisfied with a 2:1 or a 4:1 speedup?
The problem with very long instruction word (VLIW) architectures like the EPIC and the Itanium, is that the main speed limitations in today's computers are bandwidth and latency. Memory bandwidth and latency can be the dominant performance driver in a modern processor. At a system level, network, I/O (particularly for the video), and a hard drive bandwidth and latency can dramatically affect system performance.
With a VLIW processor, you are taking many small instruction words, and gathering them together into a smaller number of much larger instruction words. This never pays off. Essentially, it is impossible to always use all of the larger instruction words. Even with a normal super-scalar processor, it is almost impossible to get every functional unit on the chip to do something simultaneously. The same problem applies with VLIW processors. Most of the time, a program is only exercising a specific area of the chip. With VLIW, this means that many bits in the instruction word will go unused much of the time.
In and of itself, wasting bits in an instruction word isn't a big deal. Modern processors can move large amounts of memory simultaneously, and it is handy to be able to link different sections of the instruction word to independent functional blocks inside the processor. The problem is the longer instruction words use memory bandwidth every time they are read. Worse, the longer instruction words take up more space in the processor's cache memory. This either requires a larger cache, increasing the processor cost, or it increases latency, as it translates into fewer cache hits. It is no accident the Itanium is both expensive and has an unusually large on-chip cache.
The other major downfall of the VLIW architecture is that it cannot emulate a short instruction word processor quickly. This is a problem both for interpreters and for 80x86 emulation. Interpreters are a very popular application paradigm. Many applications contain them. Certain languages, like .NET and Java, use pseudo-interpreters/compilers. 80x86 emulation is a big deal, as the majority of the worlds software is written for an 80x86 platform, which features a complex variable length instruction word. The long VLIW instructions are unable to decode either the short 80x86 instructions, or the Java JIT instruction set, quickly. Realistically, a VLIW instruction processor will be no quicker, on a per instruction basis, than an 80x86 processor, despite the fact the VLIW architecture is designed to execute 4 instructions simultaneously.
The memory bandwidth problem, and the fact that VLIW processors don't lend themselves to interpreters, really slows down the usefulness of the platform.
I think this problem will take longer than a year or two to solve. Modern computers are really fast. They solve simple problems, almost instantly. A side-effect of this, is that if you underestimate the computational power required for the problem at hand, then you are likely to be off by large amounts.
If you implement an order n-squared algorithm, O(n^2), on a 6502 (Apple II), if n was larger than a few hundred, you were dead. Many programmers wouldn't even try implementing hard algorithms on the early Apple II computers. On the other hand, a modern processor might tolerate O(n^2) algorithms with n larger than 1000. Programmers can try solving much harder problems. However, the programmers ability to estimate and deal with computational complexity has not changed since the early days of computers. Programmers use generalities. They use ranges: like n will be between 5 and 100, or n will be between 1000 and 100,000. With modern problems, n=1000 might mean the problem can be solved on a netbook, and n=100,000 might require a small multi-core cluster.
There aren't many programming platforms out there that scale smoothly between applications deployed on a desktop, to applications deployed on a multi-core desktop, and then to clusters of multi-core desktops. Perhaps most worrying, is that the new programming languages that are coming out, are not particularly useful for intense data analysis. The big examples of this for me are: .NET and Functional Languages. .NET deployed at about the same time multi-core chips showed up, and has minimal support for it. Functional languages may eventually be the solution, but for any numerically intensive application, tight loops of C code are much faster.
The other issue with multi-core chips, is that as a programmer, I have two solutions to making my code go faster: .NET, or many functional languages, because they use run-time compilers/interpreters and don't generate assembly code.
1. Get out the assembly print outs and the profiler, and figure out why the processor is running slow. Doing this, helps every user of the application, and works well with almost any of the serious compiled languages (C, C++). Sometimes, I can get a 10:1 speed improvement.(*) It doesn't work so well with Java,
2. I recode for a cluster. Why stop at a multi-core computer? If I can get a 2:1 to 10:1 speed up by writing better code, then why stop at a dual or quad core? The application might require a 100:1 speed up, and that means more computers. If I have a really nasty problem, chances are that 100 cores are required, not just 2 or 8. Multi-core processors are nice, because they reduce cluster size and cost, but a cluster will likely be required.
The problem with both of the above approaches, is that from a tools perspective, they are the worst choice for multi-core optimizations. Approach 1 will force me into using C and C++, which doesn't even handle threads really well. In particular, C and C++ lacks an easy implementation of Software Transactional Memory, NUMA, and clusters. This means that approach 2 may require a complete software redesign, and possibly either a language change or a major change in the compilation environment. Either way, my days of fun loving Java and .NET code are coming to a sudden end.
I just don't think there is any easy way around it. The tools aren't yet available for easy implementation of fast code that scales between the single-core assumption and the multi-core assumption in a smooth manner.
Note: * - By default, many programmers don't take advantage of many features that may increase the speed of an algorithm. Built-in special purpose libraries, like MMX, can dramatically speed up certain loops. Sometimes loops contain a great deal of code that can be eliminated. Maybe a function call is present in a tight loop. Anti-virus software can dramatically affect system speed. Many little things can sometimes make big differences.
Make sure your keyboard driver is set correctly. My multi-lingual keyboard drivers take a little while to load. Consequently, I can enter a shell, type cd /etc/X11, and get something completely different. The problem is fairly easy to spot, as the slash characters come out as French A symbols.
Easter is less than one month away! There will be thousands of fake bunnies! with millions of fake eggs! Who will protect the citizens from the fake eggs?
How will we find the killer rabbit? What will the police do?
KDE isn't fixed right now. Additionally, KDE is not the only application that generates lots of write activity. I work with real-time systems, and write performance on data collection systems is important.
I did some benchmarks on the ext3 file system, the ext4 system without the patch, and the ext4 system with the patch. Code followed the open(), write(), close() sequence was 76% faster than the code with fsync(). Code that followed the open(), write(), close(), rename() sequence was 28% faster than code with that followed the open(), write(), fsync(), close(), rename() sequence. Additionally, the benchmarks were not significantly affected by the presence which file system was used (ext3, ext4, or ext4 patched.) You can look up the spreadsheet and the discussion at the launchpad discussion.
Major Linux file backup utilities, like tar, gzip, and rsync don't use fsync as part of normal operations. The only application of the three, tar, that uses fsync, only uses it when verifying data is physically written to disk. In that situation, it writes the data, calls fsync, calls ioctl(FDFLUSH), and the reads the data back. Strictly speaking, that is the only way to make sure the file is written to disk, and is readable.
Finally, as Theodore Ts'o has pointed out, if you really want to make sure the file is saved to disk, you also have to fsync() the directory too. I have never seen anyone do that, as part of a normal file save. Most C programming textbooks simply have fopen, fwrite, fclose as the recommended way to save files. Calling fsync this often is unusual for most C programmers.
I would hate to be in your programming class. Your enforcing programming standards that aren't followed by key Linux utilities, aren't in most textbooks, and aren't portable to non-Linux file systems.
If you require your students to fsync() the file and the directory, as part of a normal assignment, you are requiring them to do things that aren't done by any Linux utility out there. Further, if you are that paranoid, you better follow the example from the tar utility, and after the fsync completes, read all the data back to verify it was successfully written.
You are correct on the first statement, and wrong on the second. This bug affects almost every 'C' program ever written. Essentially, POSIX allows for a successful fclose, even if the file has not been written to disk. This permits a file system to implement a write-back cache.
Many UNIX and Linux file systems will completely screw up if the system suddenly crashes before the data has been successfully written to disk. The complaint is that the Ext4 system had a bug that did this in a very egregious way, and this bug would likely cause serious data loss on any system that is not using a UPS. Ext3 was usually mounted with the "data=ordered" option. For most realistic scenarios, Ext3 will give data loss failures that a normal person would expect. Specifically, with Ext3, you might loose your most recent files. With Ext4, the complaint was that you can loose fairly old files. Some of the old UNIX file systems, would become unreadable if the system crashed suddenly. The problems with Ext4 are a matter of scale.
To make matters worse, the fsync() remarks are incendiary. They would force modifications to almost every program on Linux. Your fopen, fwrite, fclose example is in almost every C programming textbook in existence. The fact they were made by an individual working on the Ext4 system, didn't help things either. Saying "All applications should fsync() the file and the containing directory if they don't want data loss!", when you file system has a data loss bug, creates a sudden and severe reaction ...
Calling fsync() excessively completely trashes system performance and usability. Essentially, operating systems have write back caches to speed code execution. fsync() disables the write back cache by writing data out immediately, and making your program wait while the flush happens. Modern computers can do activities that involve rapidly touching hundreds of files per second. Forcing each write to use an fsync() slows things down dramatically, and makes for a poor user experience.
To make matters worse, from a technical point of view, it is necessary for strict POSIX compliance to fsync() the file and then fsync() the containing directory. I have never seen a piece of normal application code that fsync() the containing directory. Even common linux utilities like rsync, and gzip don't use fsync anymore. tar uses fsync in one special case: for file verification before calling ioctl(FDFLUSH). The documentation on tar is instructive:
The code was using fsync only when the ioctl is unavailable, but Marty Leisner says that the ioctl does not work when not preceded by fsync. So, until we know better, or maybe to please Marty, let's do it the unbelievable way
#if HAVE_FSYNC
fsync (archive);
#endif
#ifdef FDFLUSH
ioctl (archive, FDFLUSH);
#endif
In general, application writers are interested in making sure the file is readable. Unless you are really determined, and willing to go through the file verification like in the tar command, fsync() does little to guarantee a file will be readable at a later date. Under modern file systems, there are so many reasons why a file may become unreadable, and so few of them are fixed with fsync(), that one has to ask: Why bother with fsync()?
In fact, there are so few good reasons to use fsync(), that many applications have completely given up on fsync(). fsync() is disabled on Apple Macs running OSX. If you run NFS, fsync() will probably flush your data to the network, but not to the hard disk. If you are running a PC with a modern hard drive, the hard drive probably has a write back cache. As such, fsync() doesn't guarantee your data is physically on the disk. fsync() is disabled in laptop mode.
For most applications, using fsync() will only slow down your C code. It is useful for certain applications, like databases. Many other programming languages have no equivalent to fsync(). For most programs, fsync() is an infrequently used call, and is primarily used in special purpose libraries like databases.
I was recently discussing the music copyright situation with my music teacher. We were looking at the copyright notice on the music book. Essentially, they were only selling a copy of the sheet music. They weren't actually selling the rights to play the music. Effectively, they were almost creating a misrepresentation case, in that why would you sell a music book to students, without giving them permission to play the music?
You might want to check the copyright notices. You might find that not only are you not supposed to copy the music, you aren't supposed to play it either.
People really underestimate how those embedded applications can quickly add up. Linux in your smartphone, Linux in your car navigation system, Linux in your wireless router, Linux in your playstation, before long it is "Linux everywhere."
Yes, some of these applications might not be as visible as Windows on the desktop, but they do add up. Sooner or later, someone is going to come out with a statistic like Linux outsells Windows 2:1 !!! Everyone will be wondering how that is. It will be all the embedded applications.
I bet GCC has more running applications in the world than any other compiler. It is used as a back-end compiler for a great many embedded systems.
It all depends on the target country. Afghanistan and Iraq have constant Predator overflights. I expect the blimp will offer a stationary surveillance over relatively unarmed or poorly armed countries. It might also be use for UN crisis zones, like Sudan and Somalia, or where the local government has largely broken down.
Alternatively, the blimp could be used to patrol U.S. air space. There is always the coast guard, border patrol, war on drugs, war on terrorism, war on crime, and even coastal rescue. A stationary surveillance platform might be really useful for those applications.
The main target of this platform might be here at home in America.
Novell should have already known the don't be a Microsoft Partner lesson. Novell owned WordPerfect when Windows 95 came out. Microsoft gave so much incorrect documentation to the WordPerfect developers, that the lawsuit was still going on 13 years later. In fact, the lawsuit on-going when Novell signed the Linux deal with Microsoft.
Having file system metadata not match file system data is a pretty big bug. Ext3 defaulted to having everything mounted such that the writes to the disk were "ordered" ie: (data=ordered). Ext4 does not force "ordered".
Userland can solve this problem by calling fsync() all over the place, like before every close. However, that completely defeats the purpose of having a buffered write-back file system. If the new rule is to change every userland program to force all data to be flushed to disk after every close, then we might as well mount the filesystem "o=sync", and flush our performance down the toilet. (Pun intended.)
The problem here is no call exist to force writes to disk to be "ordered". fsync() is not a substitute for ordered writes to disk. There are just too many ways an application can get into trouble if writes to disk aren't ordered. Having situations where neither the backup file nor the new file are valid is just beginning of the problems.
I write data acquisition applications that write lots of data in many files to disk. I don't care if my newest file is blank. This "bug" could mean that I have a pseudo-random number of blank files, and they might not even be ordered. My only solution, fsync(), will tank the applications performance, by causing huge amounts of disk activity. fsync() is not a substitute for ordered writes.
The legal doctine in common law countries is Force Majeure. If something sufficiently big happens, all bets are off.
The other business doctrine is that a big company shall not bankrupt the organizations selling their products:
No sales companies = No salesmen = No sales.
I think the complaint of the MCPs is Microsoft is demanding payment for product the customer isn't paying for. Specifically, my impression is that Microsoft wants to be payed for the full 3 year contract (over 3 years), even though the customer that purchased the software went bankrupt after the first year. It's a good deal from Microsoft's point of view ...
Business People tend to remember the company that pushed them into bankruptcy. They don't forgive and forget easily.
I can't see everyone "just switching" to Linux, but this could create much motivation to try. Survival in business is a strategic imperative. If someone threatens that survival, then business people tend to connect the dots, and adapt accordingly.