Programming Things I Wish I Knew Earlier
theodp writes "Raw intellect ain't always all it's cracked up to be, advises Ted Dziuba in his introduction to Programming Things I Wish I Knew Earlier, so don't be too stubborn to learn the things that can save you from the headaches of over-engineering. Here's some sample how-to-avoid-over-complicating-things advice: 'If Linux can do it, you shouldn't. Don't use Hadoop MapReduce until you have a solid reason why xargs won't solve your problem. Don't implement your own lockservice when Linux's advisory file locking works just fine. Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job. Modern Linux distributions are capable of a lot, and most hard problems are already solved for you. You just need to know where to look.' Any cautionary tips you'd like to share from your own experience?"
Put enough comments in your code so that five years from now you (and others) can remember what you indented the code to do. Remember that comments are not for describing what the code technically does (that is what the code is for), comments are for what the code is intended to do. Try and comment the decisions you made when developing the code, specifically why you took the approach you did and why you didn't use other options.
The truth is that the "hard" way of doing things is often more fun, because you have the challenge of learning a new tool or API. Plus sometimes it's actually easier in the long run because you've engineered a solution for the outer bounds conditions of scalability, so if your application takes off, it can handle the load.
I guess the real issue is that you have to engineer a "good enough" solution rather than a "worst case" solution.
I do not fail; I succeed at finding out what does not work.
Don't ask for advice about programming on slashdot unless you have a pile of salt grains ready.
Am I part of the core demographic for Swedish Fish?
Sometimes it's easier and faster to code from scratch than it is to use off-the shelf software - especially in the age of "frameworks".
In that train of thought, its often better to toss and rewrite (or write new programs) than it is to extend existing programs.
It's easier to implement a whole new framework than it is to convince your boss that writing anew is actually faster.
A database is not a bitbucket. Re-building basic database functionality in an external app is not a good idea. Applications, frameworks, languages come and go; data remains forever [1]. Business logic is part of the database. If you find yourself adding more and more "application servers" to get performance than you have a fundamental problem with your architecture (and probably a fundamental misunderstanding of how databases work). While it is not impossible to learn and implement good data management/database development practices using Microsoft tools, such a result is seldom seen in the wild.
sPh
[1] Per Tom Kyte of Oracle, whose first database job at the Department of Agriculture involved working with datasets stretching back to 1790.
If you are writing a program that touches more than two persistent data stores, it is too complicated.
I disagree. Is a program too complicated if it has 1. input, 2. output, and 3. logging? Is a program to prepare images for an online store too complicated if it reads 1. raw source images and 2. an overlay image and writes 3. finished images?
If Linux can do it, you shouldn't.
That'd be fine if we all ran Linux. But in an organization that already has to run Microsoft Access for other reasons, we have to take Windows into consideration. And I don't think Ted Dziuba was talking about just using Windows as a shell to run Linux in VirtualBox OSE either.
Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job.
Our programmer is far more experienced in Python than in bash, and if I felt like it, I could benchmark PIL against subprocess.Popen(['convert', ...]).
if the physical machine is not the bottleneck, do not split the work to multiple physical machines.
Yet PC game developers split a 4-player game across four PCs when one could do, and increasingly, PS3 and Xbox 360 game developers are following the same path.
It is far more efficient to buy your way out of a performance problem than it is to rewrite software. When running your app on commodity hardware, don't expect anything better than commodity performance.
If you are writing software to be used internally, sometimes springing for better hardware is worth it. But if you are writing software to distribute to the public, you can generally assume your customer has commodity hardware unless your software costs at least 1000 USD a seat.
I just RTFA. It isn't that good. These aren't many tips and they also don't seem to be too specialized. Most of them are already known or predictable. I can say that I didn't learn anything from TFA.
/.
I think that anything that reads "___ things to know about ____" or similar gets instant hits on
-1, Boring from me; hope it helps others.
Have you heard about SoylentNews?
And that's before you get into the really difficult stuff (that very few have managed to master) of getting a website that is easy to navigate and intuitive to use
politicians are like babies' nappies: they should both be changed regularly and for the same reasons
Don't program in C if Python will do the job.
But in a lot of cases, Python does not do the job. It doesn't do the job on iOS because Apple has explicitly banned everything but Objective-C++. It doesn't do the job on Xbox 360 or Windows Phone 7 because IronPython uses Reflection.Emit, and the version of .NET used by XNA doesn't support Reflection.Emit. And it doesn't do the job on Nintendo DS because the runtime uses up a lot of the available 4 MB of RAM.
Don't assume that, even six months from now, you're going to remember why you did things a certain way.
And the corollary: Don't assume you're going to be the one modifying the code a year or two from now.
Either way: Add comments liberally. Even if you're a conservative.
#DeleteChrome
I guess I'm an idiot, but.. err... did someone REALLY use mapreduce to solve an argument passing problem (the domain of xargs), or is the writing just shit?
xargs and its successor GNU parallel implement the "map" part of MapReduce.
Do not make things super-modular and generic unless they 100% have to be. In 99.9% of the projects no one, including yourself, will use your stupid dependency injection, and logging / access control can be done just fine without AOP. Don't layer patterns where there's no need. Aim for the simplest possible design that will work. Don't overemphasize extensibility and flexibility, unless you KNOW you will need it, follow the YAGNI principle (you ain't gonna need it).
"Modern Linux distributions are capable of a lot, and most hard problems are already solved for you. You just need to know where to look."
First off, I must say this piece says a lot about the Linux ecosystem. Specifically that this system's documentation is anemic at best. Why won't we have something like:
"What do you want to do?...with an associated answer...this kind of arrangement surely cannot hurt the Linux ecosystem.
You know, I find that as I get older, I am able to avoid overengineering things a lot better than when I was twenty something. There's nasty effect, though. I'm learning a lot less in depth about systems than I normally would.
Overengineering is terrible for a project, but it often is highly educational.
Visual Basic - Don't. Just don't.
It's always great to learn a language then have the company change it so drastically in the next version that all your knowledge of the language is useless. I don't believe it'll be the last time that happens either. I do know I will never bother to learn another MS programming language again.
Good luck to all you C# programmers when they switch to C#.NET, or whatever they call the next one. Hope you like reading!
This sentence no verb.
Some oversimplified philosphy, some good hints. Programmers and SysAdmins who do a lot of resource management eventually become managers. This isn't neccessarily a bad thing, as the world needs more managers with extensive experience with that which they are managing, and the respect of those people they are managing. It's true that it's silly to adopt some software, technology or process just because it's new. But Ted seems to be resistant to any change, which is not good either. The problem with "don't fix it if it ain't broke" reasoning is..what do you do when it eventually breaks? This is a mistake made by many in process control / automation eniveronments: failure of a part which is so obsolete that it has become difficult and expensive to obtain a replacement. Just try to find a new motherboard with an ISA bus these days. Or a composite monitor. The same thing can happen with software and the OS..where are you going to find a guy who knows enough about that old Kaypro which was running some COBOL software on CP/M, which controlled the electroplating machinery? This is why companies have lifecycle management, so that the pain of switching to newer software / hardware comes with predictable cost and timetables instead of sudden, possibly prolonged unavailability and expensive, awkward, band-aid fixes.
This flows into the idea of organizational amnesia, where important processes become lost. This is perhaps best illustrated by the US DoE forgetting how to make this secret substance called FOGBANK, which is a critical component of H-bombs. Upper management felt as though, because there was no need for additional H-bombs, the process was unimportant, and didn't take into account that H-bombs become (more) dangerous with advancing age, and eventually these needed to be replaced. It took considerable time and money to re-engineer FOGBANK.
These are both examples of failure to consider that all equipment wears out, and failure to plan for long-term needs.
Hey
I know these might sound odd but hear me out. Start by trying to rewrite the basic library's. make your own printf, strcpy. strlen etc..... Write copies of your own link list and tree storage methods and above all really really start to understand how memory works.
Another really really important thing that I have learned is stay FAR FAR away from OO programming until your really really comfortable in lower level languages. The reason is that to many students and beginners sit there trying to figure out why there variable started with value X and ended up with value Y only to find out that there object bashed some memory earlier on.
Basically just grab a good C compiler, I mean C COMPILER, not C++, not C#, not F# and start to learn how all the functions you use on a dailiy basis work, it will give you new insight into why and how you can quickly avoid and fix problems when and before they happen. It's also important to get a really really good handle on using a CLI over a GUI, Stay away from Visual Studio and other simular compilers. Use GCC and CC and make sure you look at how LD is working and understand how compilers do optimizations and improvements to your own code. This post is taking about grabbing tool to do Image processing and preform functions that have working solutions. However taking the time to see how the solutions work and why they work will give you good insight into not only great code design but great programming methods. It might seem odd for me to suggest to a beginner to try and rewrite strcpy or strcmp, but once you see how they really work you'll be far less likely to make the simple mistakes that can ground your program / project from working. It's the same say with with a beginner figuring out how malloc works and where memory gets taken from and put to, all of these suggestions are coming from the way I learned to program in C and other languages.
Feel free to throw any of these away or take any of them into your own programming adventure, but one thing is for sure. When you can figure out how the basic functions you use every day work it will save you hours and days of trouble shooting and leave you with a greater pallet of tools to use in the class room and on the job. I welcome and one who wants to add ideas to this post and attack it with there own view points.
No. Sorry, I strongly disagree.
C is like a swiss army knife that can be used to assemble a chainsaw, a sword a lightsaber, etc. You'll have to carve out all the pieces, file them down and put them together yourself, but the swiss army knife will help you a long the way.
But don't forget the bandaids.
Look, the PHB usually wants the code to run on HIS server for HIS use only. That's what he pays you for. Not to code it in the most cross-platform friendly language-du-jour and take 2 years to iron out all the bugs.
He doesn't give a shit if it'll run on the Xbox 360 or a Linux-ready Dead Badger.
And neither should you, unless you are some kind of anal retentive who spends all day arguing the merits of absolute versus relative positioning, fixed vs percentage tables, and worrying whether your code will run on every machine conceived in the next 50 years.
I have news for you, it won't.
Hell, I got an N900 6 months ago, and it's already EOL'd as far as updates to the OS are concerned.
I suppose I could wait till there's a port of Android or something, because the three coders on the project are doing a fine job, in less than a year I'll have the same functionality as a 3210.
I'm in a different boat from most commenters here, I think, because I am a scientist writing simulations; some simluations run a long time and create a lot of data which would be costly to reproduce, and what I wish someone had told me early on was that I should comment my *data files*, not just my code. Each file should include the exact parameters used to create it, an explanation of what each column represents, and preferably there should be a way of knowing what version of your simulation code was used to create it. A couple of times in grad school I had toss out months of data after I discovered a bug in my code, and didn't know when the bug showed up and which data was affected by it.
(I'd welcome other advice from simulationists too; I've never had an advisor who was particularly programming-savvy, even though programming was always a large part of my research, and so I always had to make it up as I went along.)
we could not figure out whether the author was an incredibly elaborate troll or just a run-of-the-mill idiot.
Reading this comment of his reminds me of something I read recently:
Physicists stand on each other's shoulders. Engineers dig each other's graves.
I've never understood why so many software developers feel the need to disparage one another in an attempt to prove their intelligence/superiority. There are plenty of tough problems out there and we all can learn something from one another, no? I've definitely been guilty of this in my tech career but lately I'm wondering more and more, why does the person who has a different solution always have to be an "idiot?" Why isn't he/she just someone who has a different take on solving this particular problem?
Now, I'm not saying that engineers do this more than any other group but out of all of my friends (some of whom are doctors, lawyers, teachers, etc.) it certainly seems like a more common event among software developers.
Sadly, if you're working in Java then you will frequently be forced to write the same 8 damn lines over 70 times, and there will be literally no way to avoid it unless you resort to automatic code generation, which is a fancy name for using a separate program to do the copying-and-pasting for you.
Put everything in version control. Everything. EVERYTHING!
Well. You could skip /home, but I know a roll back of /etc has saved me a couple of times on config upgrades.
Remember that once code is deleted, you can't get it back. However, version control changes that. Version control is one of the most vital tools for anyone developing/working with a computer.
Oh and git rocks and stuff :)
Penguins can be fascists too
I wish I'd know about LISP 25 years ago. Stupid people told me it was "for processing lists." If only I'd known better. Functional programming gives you wings and a jet engine.
I wish I hadn't paid too much attention to people with limited imaginations. Just because they're older, have more money and shout louder doesn't mean they are clever or wise.
C++ is way over-rated but it's worth knowing because it's so widely used. Don't let it detract you from mastering C and learning scripting languages. Understanding object-oriented design is more important than knowing the latest trendy language.
Objective-C.
Just because software is Free/Open doesn't mean it's "cheap" and poor quality. I could have saved myself 2-3 years there.
Ignore Windows and it will ignore you.
Stick Men
Nice.
It must have been something you assimilated. . . .
A few rules of thumb for a startup environment:
1. Don't overengineer! Overengineering wastes time on things that may never be used. Features should be customer driven.
2. Functions and methods should be as small as possible. You should make it an obsession to split methods and functions into the smallest possible components. Only then can you have good code reuse. Don't start thinking I will split it when I need it, you never will!
3. Never ever reinvent the wheel. Reinventing things that exist is overengineering.
4. Don't optimize ahead of time. When I say that I don't mean don't use a hash table instead of an array where it makes sense. I mean don't try to avoid exception handling or function calls or other minor optimizations. If it has an impact on readability don't do it. Optimization always comes last. Often you'll find there are only 1 or 2 "hotspots" in your code. If you spend time optimizing these "hotspots" after your application is built thats when you'll get the best return on your investment. Another gotcha with optimization is using technologies that can't deliver the level of performance you expect. You should test to make sure the underlying components you plan to use will perform as expected before you start coding.
5. Don't cram as much code in a single statement as possible. Every compiler I know about today will produce identical code whether it's one statement or 5 statements. It makes it hard to read so don't do it!
6. Allocate time for testing. No one writes perfect code.You want to give a good impression to your customer so don't skip this step.
7. Make unit testing an obsession. Always add unit tests for new code, it reveals errors in your code. When you find a bug in your code add a unit test to test for it. If in the future someone decides to rewrite some function or method you wrote because it's not elegant enough they will not reintroduce old bugs.
8. Don't rewrite code if possible. Refectoring is almost always easier and less error prone.
If you are writing a program that touches more than two persistent data stores, it is too complicated.
Others have already mentioned cases where multiple datastores make sense. A trivial example: One database to handle user data, another to handle blobs (image conversions, etc) -- bonus if the second store can do its own conversions; a third to handle logging -- that's already three, and that's before we start considering things like RESTful services, which can function as intelligent datastores of their own...
If Linux can do it, you shouldn't.
Unless you're not on Linux. And, specifically:
Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job.
If you're doing something that truly works as a shell script, and isn't part of a larger app, I agree. However, PIL likely performs better, and it removes the shell as an issue -- if you thought SQL injection was bad, wait till you have people exploiting your shell commands. You can do it safely, but why would you bother, when you've got libraries that accept Python (or Perl, or Ruby) native arguments, rather than forcing you to deal with commandline arguments? Why do you want to check return values, when you can have these native libraries throw exceptions?
Parallelize When You Have To, Not When You Want To
If you don't at least think about parallelization in the planning stage, it's going to be painful later on. It's easy to build a shared-nothing, stateless architecture and run it in a single-threaded way. It's hard to build a stateful web service with huge, heavyweight sessions, and then make it run on even two application servers in the future. Possible, but awkward, to say the least.
For example, if you are doing web crawling, and you have not saturated the pipe to the internet, then it is not worth your time to use more servers.
...unless, maybe, it's CPU-bound? And this is odd to mention in a section about parallelization -- wouldn't slow servers be a prime candidate for some sort of parallelization, even on a single machine, even if it's evented?
If you have a process running and you want it to be restarted automatically if it crashes, use Upstart.
Cool, but it looks like Upstart is becoming a Maslow's Hammer for this guy. Tools like Nagios, Monit, and God exist for a reason -- one such reason is knowing when and why your processes are dying even if they're spread across a cluster.
NoSQL is NotWorthIt
People who have read my other posts likely know where I stand on this, but...
Redis, even though it's an in-memory database, has a virtual memory feature, where you can cap the amount of RAM it uses and have it spill the data over to disk. So, I threw 75GB of data at it, giving it a healthy amount of physical memory to keep hot keys in...
So you found out an in-memory database wasn't suitable when you have far more data than physical memory? Great test, there.
Redis was an unknown quantity...
Maybe so, but that wasn't terribly hard to guess.
Yes, maybe things could have been different if I used Cassandra or MongoDB...
So maybe you should've benchmarked a NoSQL database which is actually designed to solve the problem you're trying to solve? Just a thought.
especially if something like PostgreSQL can do the same job.
If PostgreSQL could do the same job, the current generation of NoSQL databases wouldn't have been invented. Unless something's changed, PostgreSQL can't scale beyond a single machine for writes, unless you deliberately shard at the application layer, which would violate his rule about multiple datastores, wouldn't it?
It seems like the attitude is to no
Don't thank God, thank a doctor!
Don't do image processing work with PIL unless you have proven that command-line ImageMagick won't do the job.
I think the worst mistake I made as Mr. XEmacs was attempting to unify our graphics support to call ImageMagick libraries instead of the custom stuff we were using (and later restored when ImageMagick was backed out).
Does it work any better now? The last time I looked at display(1) a couple of years ago, it still wasn't close to long lost and patent challenged xv(1) that got shut down by the GIF patent war.
So, where exactly did you get that information from and why do you think it is real?
I'm not, no. But some people have, and they may be reading this right now. How do you know you're not addressing the next Bill Gates?
Anyway, your advice is way off. The people with the most _social_ success are raging egotists and shameless self-promoters. There's right ways and wrong ways to do it, and the wrong ways lead to you being written off as a pompous ass -- but not being such will never lead to social success.
Hell, I got an N900 6 months ago, and it's already EOL'd as far as updates to the OS are concerned.
I suppose I could wait till there's a port of Android or something, because the three coders on the project are doing a fine job, in less than a year I'll have the same functionality as a 3210.
Or you can read a little farther and see that upgrades are already available: http://en.wikipedia.org/wiki/MeeGo
The code in question (9 lines actually) is:
Ironically, making your code less object oriented (ie. screw encapsulation) fixes this problem.