App Developers Spend Too Much Time Debugging Errors in Production Systems (betanews.com)

← Back to Stories (view on slashdot.org)

App Developers Spend Too Much Time Debugging Errors in Production Systems (betanews.com)

Posted by msmash on Thursday November 3, 2016 @08:00AM from the not-sure-if-bad-thing dept.

According to a new study, 43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production. BetaNews adds: The survey carried out by ClusterHQ found that a quarter of respondents report encountering bugs discovered in production one or more times per week. Respondents were also asked to identify the most common causes of bugs. These were, inability to fully recreate production environments in testing (33 percent), interdependence on external systems that makes integration testing difficult (27 percent) and testing against unrealistic data before moving into production (26 percent). When asked to identify the environment in which bugs are most costly to fix, 62 percent selected production as the most expensive stage of app development to fix errors, followed by development (18 percent), staging (seven percent), QA (seven percent) and testing (six percent).

3 of 167 comments (clear)

Min score:

Reason:

Sort:

Been there, done that as an intern... by __aaclcg7560 · 2016-11-03 08:23 · Score: 4, Interesting

I did a six-month contract as an software tester internship after college, where I came across a crash bug on the test server that I could reproduced 100% of the time. My supervisor could not reproduced the bug, and approved the patch for production server. The production server crashed immediately from the patch. Engineers determined that a major code rewrite was required to fix the underlying problem. The production server was offline for three days and cost the company $250K in lost revenue. My contract wasn't renewed, one-third of the division got laid off after I left, and further budget cuts doomed the project. As for my supervisor, he got promoted into management.
Re:inability to fully recreate production environm by zifn4b · 2016-11-03 08:52 · Score: 3, Interesting
That's not the most prevalent issue. The main issue is the malpractice of Agile methodologies. What happens when you jam a 2 week task into a 1 week time box? Corners get cut in the code, the unit tests, QA test plans and technical debt accrues creating unpredictable results when someone changes brittle code in the future. Most companies are not interested investing in REAL environments and continuous delivery pipelines with:
- - Adequate infrastructure
- - Adequate workstation and tools
- - Adequate product training
- - Reasonable time to do the work
- - Reasonably well-defined work
- - Development best practices: code reviews, unit tests, testing in general (yes dev's it's also your responsibility to test, you don't just throw your crap over the wall)
- - Automatic builds either nightly or on commit with automatic unit and integration tests using Bamboo/Jenkins/whatever, perhaps even usage of source control at all!
- - Investment in some type of test case database like TestRail or Zephyr so you actually know what your software is expected to do and it can actually evolve over time. This can replace traditional test plans that people put in Confluence that become stale almost immediately and lose value.
- - Good documentation
All of this takes a lot of effort and you don't get it for free running around like a chicken with your head cut-off. Ignore it and you reap what you sow especially in larger scale software efforts.
--
We'll make great pets
Re:inability to fully recreate production environm by ghoul · 2016-11-03 10:51 · Score: 4, Interesting

Where I used to work - big telco software firm whose software generates 80% of the phone bills in the US we had a simple solution to the problem of testing to scale.
We had two identical setups one for production and one for staging. After UAT was almost over we would deploy to staging and then continue UAT on the staging with real world data till the day of cutover (Use Oracle Active-Passive to keep both in sync for the production data while not copying over UAT data to prod)
On cutover day we would change the network switch to now point to the new setup and run scripts to delete the dat created by UAT.
The nice part was now the Prod setup (a bank of 8 servers with 4 quad core CPUS each) was now our backup machine. We would switch it to passive and continue to keep it in sync with prod for at least 7 days. If something horrible went wrong with the new setup. Changing back to the earlier prod machine was a network switch flip. The scripts were a little more difficult this time over especially if the software bug had messed up the data but it was still easy.
Once a production was stable the old prod was now used as staging for the next prod.
What this meant is we did UAT on machines with identical config as the prod machines . It solved a lot of issues and since we also used the machines as the prod backup machine during cutover the cost was taken from the operations budget and not the testing budget.
Our System test and UAT environments were almost as good but not as good as prod and most testing and UAT was done there but the last batch of UAT on the big iron gave good confidence and made cutover day a lot less stressfull than it used to be.

--
**Life is too short to be serious**