App Developers Spend Too Much Time Debugging Errors in Production Systems (betanews.com)
According to a new study, 43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production. BetaNews adds: The survey carried out by ClusterHQ found that a quarter of respondents report encountering bugs discovered in production one or more times per week. Respondents were also asked to identify the most common causes of bugs. These were, inability to fully recreate production environments in testing (33 percent), interdependence on external systems that makes integration testing difficult (27 percent) and testing against unrealistic data before moving into production (26 percent). When asked to identify the environment in which bugs are most costly to fix, 62 percent selected production as the most expensive stage of app development to fix errors, followed by development (18 percent), staging (seven percent), QA (seven percent) and testing (six percent).
43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production
That seems like an odd metric, but it doesn't surprise me. Production support has always been expensive. Especially if you can't create a full production-like environment with real world data and stupid users to test with.
This is due to finance cheaping out and not allowing the purchase of an exact "test" system to work on. Also, the rush to production is often more important than checking to be sure it all works.
That said, its all a risk/reward thing. Maybe its often better to screw up production here and there than to spend tons of money and time on testing. It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?
Ninjas don't carry tic tacs
How is "management telling people to put it into production as soon as the basic functionality works" not one of the common causes of bugs? At almost every job I've worked at, QA and Engineering would say "We need this much time to test and fix bugs before launch", and management would say "Too bad! Sales already told someone we're launching tomorrow, so we're going live with whatever we have then!"
It isn't the lack of a good test environment, or good test data, it's being told by management that you aren't going to have any time to test...
I spent some time working QA on a carrier-level system that was being developed for what was at the time Cingular. The biggest problem is that the investors that propped-up the company wanted it to ship as absolutely as soon as possible, so the company could go from a money-sink to a money-producer for them. Our investor was some heir to a fortune that was made in chemicals back in the day, he didn't really know anything about the technology of telco-grade communications systems. He was ill-qualified to even know if his money was going to producing something functional.
The idea (basically take-in anything, process it for meaning, and then turn around and convert and resend or else store and notify) was a good one and at the time there wasn't really anything else on the market doing this at carrier-grade. The problem was, while the central core of the product was reasonably well written, so many input/output daemons and filters were just garbage. The rush to get the product making money meant it shipped well before it was ready, and in the end it became the only sale that the company had.
A couple of years later the whole product/project could've been had for something like $200,000. They'd sold the only production system for more like $1,000,000.
Do not look into laser with remaining eye.
I did a six-month contract as an software tester internship after college, where I came across a crash bug on the test server that I could reproduced 100% of the time. My supervisor could not reproduced the bug, and approved the patch for production server. The production server crashed immediately from the patch. Engineers determined that a major code rewrite was required to fix the underlying problem. The production server was offline for three days and cost the company $250K in lost revenue. My contract wasn't renewed, one-third of the division got laid off after I left, and further budget cuts doomed the project. As for my supervisor, he got promoted into management.
That brings back memories:
Me: "It works for me"
Production: "It gives me this error"
Me: "Can you show me the data"
Prod: "It was in Missouri's data for 2014"
Me: "It still works. Can you show me a screenprint of your data?"
Prod: "I'm using this dataset"
Me: "I don't have access to that (expletive unsaid) dataset. Can you show me a (more unsaid stuff) screenprint??"
Prod: *mumbles something about privicy*
Me: *thinks about shooting someone*
Slow down, cowboy! It has been 4 hours since you last posted. You must wait another few hours.
That's not the most prevalent issue. The main issue is the malpractice of Agile methodologies. What happens when you jam a 2 week task into a 1 week time box? Corners get cut in the code, the unit tests, QA test plans and technical debt accrues creating unpredictable results when someone changes brittle code in the future. Most companies are not interested investing in REAL environments and continuous delivery pipelines with:
All of this takes a lot of effort and you don't get it for free running around like a chicken with your head cut-off. Ignore it and you reap what you sow especially in larger scale software efforts.
We'll make great pets
I wrote a awesome testing program that resolves the problem of differences between test and production but I can't get it to run in a production environment.
She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
We get hung up on developer costs but never on rework and fix costs. There is constant pressure to deliver untested features to make sales but never much accounting for customers who will walk at the first opportunity or sales which get cancelled due to bugs.
And it has never changed. Watefall, 6 signma, kanban, agile, rapid proto=typing, devops etc. has not made a difference. I have seen no improvement at all over close to 30 years. And people wonder why I drink.
putting the 'B' in LGBTQ+
I have never seen a methodology survive its first contact with sales.
putting the 'B' in LGBTQ+
It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?
For the Mars mission:
a) about 186mph
b) no
http://www.space.com/34472-exo...
If you post it, they will read.
Even if you can reproduce all of the hardware exactly, you are never going to get the same kinds of results that putting software in the hands of real users will get you.
There's different kinds of buys, which is why you have different kinds of systems and testing environments.
A dev should be able to have an isolated environment in which to be able to test the various parts. Each part should be able to have a sufficient emulation of external parts to be able to have its own unit and functional testing. From there, several parts should be integrated at a time to do functional and integration testing, eventually building up to the entire system being fully integrated and using emulated externals (e.g external auth emulation) so the system can itself run in isolation. This gets to 95% of the issues.
From here is scalability - for which the operations team should be providing environments sufficient to do the scaling testing so stuff can be tested at sufficient scale before it hits production.
Now, that doesn't mean you won't end up with issues in production, but that it should be a rare thing for that to happen. In those rare cases you may have to test in production, but that should be the exception, not the rule.
Too often we don't invest in all the different levels of testing b/c (a) devs are lazy, and (b) management cheaps out. However, doing all the layers of testing will be cheaper in the end since things will be caught earlier where it's cheaper and faster to fix.
Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
I've run into that as well, then I made the comment that it may be unique to that personal information pertaining to the "person". I suggested to obfuscate the personal information, but not other data to reproduce. This will usually pinpoint the cause, but if the error still can not be produced, the error is most likely attributed to that specific personal data that was obfuscated.
Where I used to work - big telco software firm whose software generates 80% of the phone bills in the US we had a simple solution to the problem of testing to scale.
We had two identical setups one for production and one for staging. After UAT was almost over we would deploy to staging and then continue UAT on the staging with real world data till the day of cutover (Use Oracle Active-Passive to keep both in sync for the production data while not copying over UAT data to prod)
On cutover day we would change the network switch to now point to the new setup and run scripts to delete the dat created by UAT.
The nice part was now the Prod setup (a bank of 8 servers with 4 quad core CPUS each) was now our backup machine. We would switch it to passive and continue to keep it in sync with prod for at least 7 days. If something horrible went wrong with the new setup. Changing back to the earlier prod machine was a network switch flip. The scripts were a little more difficult this time over especially if the software bug had messed up the data but it was still easy.
Once a production was stable the old prod was now used as staging for the next prod.
What this meant is we did UAT on machines with identical config as the prod machines . It solved a lot of issues and since we also used the machines as the prod backup machine during cutover the cost was taken from the operations budget and not the testing budget.
Our System test and UAT environments were almost as good but not as good as prod and most testing and UAT was done there but the last batch of UAT on the big iron gave good confidence and made cutover day a lot less stressfull than it used to be.
**Life is too short to be serious**
I think its not time, money,quality .
The iron triangle is time,money,scope. You can increase or decrease one by changing the other 2 . But if you try to reduce one without changing the other 2, the iron triangle breaks open and the magic smoke which is quality inside the triangle escapes and once it escapes you cant get it back in even if you close the triangle.
**Life is too short to be serious**
Design? Testing? This is the Scrum way !!!! We only have requirements and code and documentation is for pussies.
**Life is too short to be serious**
Slow down your production process so you have time to catch them?
That causes end users to choose a competitor's software with tolerable defects over your unfinished vaporware.
Do without the fancies so there are less vulnerability points?
That causes end users who rely on "the fancies" to choose a competitor's software that offers "the fancies".
App developer here.
Something is missing here; namely we spend more time debugging issues found in production, because they get reported. Almost every app nowadays has a crash logger that reports all crashes. Libraries like Twitter's Crashlytics are awesome like that. You get all crashes reported to you, including a ring buffer of the last 100 log messages. It's really, really awesome and I've solved problems in production that wouldn't ever be found normally.
8 of 13 people found this answer helpful. Did you?
They've been doing it for years, I find it fascinating how easy it is to rebuff most of the claims. But, I think it shows the industry is just really poor at executing it and end up with Fragile instead.
Change is certain; progress is not obligatory.