Thanks for your comments. I agree with almost everything you said. The context switching between threads is fairly small, but it can eat into your performance some. The trick is to get just enough threads to keep all the cores busy without getting into a thrashing state. As far as the data set is concerned: I try to break up the data into chunks that can each be operated on independently. I then make sure that each chunk handled with thread-safe code and get the locking down to the smallest unit possible.
Using my relational table model, I put each column in its own data object. All the values are de-duped and stored along with 'like' values within blocks. If I need to add, update, or delete a value then all I need to lock is the block that contains that value. The only time a thread needs to block is when it wants to update a value within the same block as another thread is currently updating. The updates are very quick so the blocked thread doesn't stay blocked for long.
I also only need to read from disk the columns needed to satisfy the query. For example, if I have a 100 column table with millions of rows, then a query such as "SELECT name, address, phone_number FROM table WHERE state = 'CA';" only needs to read in 4 columns from disk. The other 96 columns can stay out of memory. I still have some development to do and a lot of testing, but so far most queries are much faster than I can get Postgres or MySQL to do on the same data set.
Yes. The 'obviously wrong thing' I am doing is running my software on a computer that happens to be running other things along with it. The operating system, various services, other applications....etc. I think I addressed that issue.
Running on a server with a Xeon processor is definitely a use case, but it isn't the only one. Plenty of large data sets are on end user devices. Why should fast processing be reserved only for servers? Data is growing everywhere.
I am building my own general-purpose data management system that handles both structured and unstructured data (e.g. databases and files) so I guess that would qualify as a 'data science application'. I assume you are referring to things like memory-only databases. I have looked at a few of them, but I wouldn't consider myself an expert on any of them. Things resident in memory will always be faster than stuff stored on disk (whether HDD or SSD). The problem is that data sets tend to grow very large. Memory is much cheaper than it used to be, but it is still about 100x as expensive as disk. What we really need is a system that can process a query by reading the minimal amount required off disk and storing as much of it as it can in memory to speed up subsequent queries. Nothing revolutionary with that idea, but systems typically have not done a great job trying to implement it. I have taken a whole different approach to solving this issue and so far it is working very well. Still under development, but it is getting close to being finished enough for general use.
I bought a 32 GB flash thumb drive the other day for about $10 retail. I assume Apple gets it cheap buying in bulk. No reason why anything as expensive as the iPhone should come with anything less than 64 GB.
I definitely look at the number of cores before allocating threads. My system can even have multiple threads processing the same column. It works great for tables with 10 million+ rows in them. As far as your snipe about 'real programmers'....I'm sure you must be right. That must be why the same query using a PostgreSQL database takes over 2 minutes to complete while mine finishes in about 20 seconds. So...I guess never mind if speed isn't important to you.
It's important to note that a single process that has 10 threads running on a 10 core processor is not necessarily 10 times faster than the same process running a single thread on a single core processor. It will probably be much faster since your threads don't have to wait as long sharing the processor with other application threads, but you won't see a perfect multiplier effect. It takes time to spin up those threads, to schedule them, and to swap threads in and out of the core running them. Example: I took a data processing function that took about 10 minutes to complete using a single thread. I split it up so that 10 threads worked on it at the same time. Instead of taking only 1 minute to complete, it took just over 2 minutes. Still much faster, but not perfect.
Every process must have at least one thread of execution; but a process can also have many different threads running through its code simultaneously. For example, a database might have spin off a different process to handle each client request, or it might have a single process that spins off a separate thread every time a request is made. If you just had one process with one thread, then all your client requests would run serially. You would have to wait for the client in front of you to finish before your request would be processed.
People who might actually need something like this are those who are running a lot of different applications simultaneously or have individual apps that were programmed to do lots of processing in parallel. I am currently building a data management system that uses lots of threads to greatly speed up processing. The more cores are available, the faster I can process large data sets. With column based relational tables, I can assign a different thread to process each column separately. If there are 100 columns in a table, lots of threads are needed. The more threads that can run at the same time in the CPU, the faster the query will complete. These processors are not just for gaming.
Long names may be great for making very descriptive names for your files and folders, but they also increase the time it takes to find stuff. When you want to find all of your.JPG photo files by traversing your file system looking for any files with that extension; then be prepared to wait a long time if you have a lot of files and even longer if your names are really long. It takes many more processor instructions to process each string looking for patterns or file extensions. If you only have a few thousand files, then you might not notice it, but put 10 million (or 100 million) files in one of your volumes and you certainly will.
Everyone is doing it.
1) Turn on something by default.
2) Make you have to jump through a bunch of hoops in order to opt out or turn it off.
3) Charge you a lot of money if you fail to do so.
My phone company does it. My satellite company does it. My credit card company does it......
Welcome to the world of dynamic pricing. Once all the on-line retailers have a full profile of you and your spending habits, expect to see higher prices if you have a good income and don't clip coupons. It's kind of like appliance repairmen who jack up their estimates if you live in a nice house in a nice neighborhood.
I can't wait until we are all forced to watch a 30 second ad before our program will start. Want to run Microsoft Word, Skype, or even a third-party app? You must watch a commercial that you can't skip, first.
It was annoying enough when you pocket dialed your wife when you sat down with your phone in your pocket. Now you might have to explain to the FedEx driver that you didn't really mean to order $2000 worth of stuff.
Hillary Clinton did another self investigation of her private email server and found NO evidence of any classified documents (oops...I mean documents that were still clearly marked CLASSIFIED after her staff did some routine maintenance).
Millions of lines of code does not necessarily make a good software project. Which is better...a project that does X in 2 million lines of code...or a different project that also does X but only needs 500K lines of code to do it? In most cases the smaller code base is better; but you can sometimes make a function do something in just a few lines, but it takes you all day to figure out what those few lines mean. A bigger function with more lines might be much better because it is easier to follow, has more error checking, and is easier to maintain and update. I have also seen functions that were 2000+ lines long with GOTOs all over the place. It was a mess.
I did not mean to imply that every car is shot at 200,000 miles or that you can't breathe a little more life into an aging PC or iMac to get better performance for not a lot of money. I just meant that when all your components reach a certain age, some of us would just rather replace the whole thing than put a few more band-aids on it. When I was in college, I had an old car that I was constantly working on because I could not afford a better one. I made numerous trips to the junkyard to find a part that still had some useful life left in it, whenever something broke. I never knew when the car was going to break down next so I didn't try any really long trips in it.
I have built PCs from components I bought by shopping around the Internet and I have bought PCs that were built for me. In some cases, I built because I couldn't get the exact components I wanted in a pre-built model. In others it was to save some money. The money I saved did not come close to a 66% discount as you claim. I would be very surprised if anyone can build one for less than 80% of the cost of paying someone else to do it. (Unless of course, you found a way to build your own Apple products)
Yes, I know I could put more RAM in my 9 year old iMac and it would perform better. But just like a car with 200,000+ miles on it, there comes a time when replacing the battery/alternator/breaks/tires/etc. is just not worth it anymore.
Thanks for your comments. I agree with almost everything you said. The context switching between threads is fairly small, but it can eat into your performance some. The trick is to get just enough threads to keep all the cores busy without getting into a thrashing state. As far as the data set is concerned: I try to break up the data into chunks that can each be operated on independently. I then make sure that each chunk handled with thread-safe code and get the locking down to the smallest unit possible. Using my relational table model, I put each column in its own data object. All the values are de-duped and stored along with 'like' values within blocks. If I need to add, update, or delete a value then all I need to lock is the block that contains that value. The only time a thread needs to block is when it wants to update a value within the same block as another thread is currently updating. The updates are very quick so the blocked thread doesn't stay blocked for long. I also only need to read from disk the columns needed to satisfy the query. For example, if I have a 100 column table with millions of rows, then a query such as "SELECT name, address, phone_number FROM table WHERE state = 'CA';" only needs to read in 4 columns from disk. The other 96 columns can stay out of memory. I still have some development to do and a lot of testing, but so far most queries are much faster than I can get Postgres or MySQL to do on the same data set.
Yes. The 'obviously wrong thing' I am doing is running my software on a computer that happens to be running other things along with it. The operating system, various services, other applications....etc. I think I addressed that issue.
Running on a server with a Xeon processor is definitely a use case, but it isn't the only one. Plenty of large data sets are on end user devices. Why should fast processing be reserved only for servers? Data is growing everywhere.
I am building my own general-purpose data management system that handles both structured and unstructured data (e.g. databases and files) so I guess that would qualify as a 'data science application'. I assume you are referring to things like memory-only databases. I have looked at a few of them, but I wouldn't consider myself an expert on any of them. Things resident in memory will always be faster than stuff stored on disk (whether HDD or SSD). The problem is that data sets tend to grow very large. Memory is much cheaper than it used to be, but it is still about 100x as expensive as disk. What we really need is a system that can process a query by reading the minimal amount required off disk and storing as much of it as it can in memory to speed up subsequent queries. Nothing revolutionary with that idea, but systems typically have not done a great job trying to implement it. I have taken a whole different approach to solving this issue and so far it is working very well. Still under development, but it is getting close to being finished enough for general use.
I bought a 32 GB flash thumb drive the other day for about $10 retail. I assume Apple gets it cheap buying in bulk. No reason why anything as expensive as the iPhone should come with anything less than 64 GB.
I definitely look at the number of cores before allocating threads. My system can even have multiple threads processing the same column. It works great for tables with 10 million+ rows in them. As far as your snipe about 'real programmers'....I'm sure you must be right. That must be why the same query using a PostgreSQL database takes over 2 minutes to complete while mine finishes in about 20 seconds. So...I guess never mind if speed isn't important to you.
It's important to note that a single process that has 10 threads running on a 10 core processor is not necessarily 10 times faster than the same process running a single thread on a single core processor. It will probably be much faster since your threads don't have to wait as long sharing the processor with other application threads, but you won't see a perfect multiplier effect. It takes time to spin up those threads, to schedule them, and to swap threads in and out of the core running them. Example: I took a data processing function that took about 10 minutes to complete using a single thread. I split it up so that 10 threads worked on it at the same time. Instead of taking only 1 minute to complete, it took just over 2 minutes. Still much faster, but not perfect.
Every process must have at least one thread of execution; but a process can also have many different threads running through its code simultaneously. For example, a database might have spin off a different process to handle each client request, or it might have a single process that spins off a separate thread every time a request is made. If you just had one process with one thread, then all your client requests would run serially. You would have to wait for the client in front of you to finish before your request would be processed.
People who might actually need something like this are those who are running a lot of different applications simultaneously or have individual apps that were programmed to do lots of processing in parallel. I am currently building a data management system that uses lots of threads to greatly speed up processing. The more cores are available, the faster I can process large data sets. With column based relational tables, I can assign a different thread to process each column separately. If there are 100 columns in a table, lots of threads are needed. The more threads that can run at the same time in the CPU, the faster the query will complete. These processors are not just for gaming.
Long names may be great for making very descriptive names for your files and folders, but they also increase the time it takes to find stuff. When you want to find all of your .JPG photo files by traversing your file system looking for any files with that extension; then be prepared to wait a long time if you have a lot of files and even longer if your names are really long. It takes many more processor instructions to process each string looking for patterns or file extensions. If you only have a few thousand files, then you might not notice it, but put 10 million (or 100 million) files in one of your volumes and you certainly will.
Rules and laws are for little people. Not for anyone named Clinton.
Were you a new anchor in LA in the 80s?
Obligatory War Games reference.
Everyone is doing it. 1) Turn on something by default. 2) Make you have to jump through a bunch of hoops in order to opt out or turn it off. 3) Charge you a lot of money if you fail to do so. My phone company does it. My satellite company does it. My credit card company does it......
Welcome to the world of dynamic pricing. Once all the on-line retailers have a full profile of you and your spending habits, expect to see higher prices if you have a good income and don't clip coupons. It's kind of like appliance repairmen who jack up their estimates if you live in a nice house in a nice neighborhood.
It sounds cool, but how does it incorporate lasers and virtual reality while traveling at warp speed?
I can't wait until we are all forced to watch a 30 second ad before our program will start. Want to run Microsoft Word, Skype, or even a third-party app? You must watch a commercial that you can't skip, first.
It was annoying enough when you pocket dialed your wife when you sat down with your phone in your pocket. Now you might have to explain to the FedEx driver that you didn't really mean to order $2000 worth of stuff.
I refuse to believe that the best is behind us.
Hillary Clinton did another self investigation of her private email server and found NO evidence of any classified documents (oops...I mean documents that were still clearly marked CLASSIFIED after her staff did some routine maintenance).
Millions of lines of code does not necessarily make a good software project. Which is better...a project that does X in 2 million lines of code...or a different project that also does X but only needs 500K lines of code to do it? In most cases the smaller code base is better; but you can sometimes make a function do something in just a few lines, but it takes you all day to figure out what those few lines mean. A bigger function with more lines might be much better because it is easier to follow, has more error checking, and is easier to maintain and update. I have also seen functions that were 2000+ lines long with GOTOs all over the place. It was a mess.
I guess you missed the part where it is a 9 year old iMac instead of a 4 year old MacBook Pro.
I did not mean to imply that every car is shot at 200,000 miles or that you can't breathe a little more life into an aging PC or iMac to get better performance for not a lot of money. I just meant that when all your components reach a certain age, some of us would just rather replace the whole thing than put a few more band-aids on it. When I was in college, I had an old car that I was constantly working on because I could not afford a better one. I made numerous trips to the junkyard to find a part that still had some useful life left in it, whenever something broke. I never knew when the car was going to break down next so I didn't try any really long trips in it.
I have built PCs from components I bought by shopping around the Internet and I have bought PCs that were built for me. In some cases, I built because I couldn't get the exact components I wanted in a pre-built model. In others it was to save some money. The money I saved did not come close to a 66% discount as you claim. I would be very surprised if anyone can build one for less than 80% of the cost of paying someone else to do it. (Unless of course, you found a way to build your own Apple products)
Yes, I know I could put more RAM in my 9 year old iMac and it would perform better. But just like a car with 200,000+ miles on it, there comes a time when replacing the battery/alternator/breaks/tires/etc. is just not worth it anymore.