Slashdot Mirror


App Developers Spend Too Much Time Debugging Errors in Production Systems (betanews.com)

According to a new study, 43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production. BetaNews adds: The survey carried out by ClusterHQ found that a quarter of respondents report encountering bugs discovered in production one or more times per week. Respondents were also asked to identify the most common causes of bugs. These were, inability to fully recreate production environments in testing (33 percent), interdependence on external systems that makes integration testing difficult (27 percent) and testing against unrealistic data before moving into production (26 percent). When asked to identify the environment in which bugs are most costly to fix, 62 percent selected production as the most expensive stage of app development to fix errors, followed by development (18 percent), staging (seven percent), QA (seven percent) and testing (six percent).

12 of 167 comments (clear)

  1. No surprise by tomhath · · Score: 3, Insightful

    43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production

    That seems like an odd metric, but it doesn't surprise me. Production support has always been expensive. Especially if you can't create a full production-like environment with real world data and stupid users to test with.

  2. inability to fully recreate production environment by MooseTick · · Score: 5, Insightful

    This is due to finance cheaping out and not allowing the purchase of an exact "test" system to work on. Also, the rush to production is often more important than checking to be sure it all works.

    That said, its all a risk/reward thing. Maybe its often better to screw up production here and there than to spend tons of money and time on testing. It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?

  3. Most common causes of bugs? by Anonymous Coward · · Score: 4, Insightful

    How is "management telling people to put it into production as soon as the basic functionality works" not one of the common causes of bugs? At almost every job I've worked at, QA and Engineering would say "We need this much time to test and fix bugs before launch", and management would say "Too bad! Sales already told someone we're launching tomorrow, so we're going live with whatever we have then!"

    It isn't the lack of a good test environment, or good test data, it's being told by management that you aren't going to have any time to test...

  4. Been there, done that as an intern... by __aaclcg7560 · · Score: 4, Interesting

    I did a six-month contract as an software tester internship after college, where I came across a crash bug on the test server that I could reproduced 100% of the time. My supervisor could not reproduced the bug, and approved the patch for production server. The production server crashed immediately from the patch. Engineers determined that a major code rewrite was required to fix the underlying problem. The production server was offline for three days and cost the company $250K in lost revenue. My contract wasn't renewed, one-third of the division got laid off after I left, and further budget cuts doomed the project. As for my supervisor, he got promoted into management.

    1. Re:Been there, done that as an intern... by Altrag · · Score: 3, Informative

      That's not always as easy as it sounds. If there was data conversions involved for example, the previous stable build may not even run anymore and would require restoring everything from backup, which may well be a many-many-hour project in itself -- and possibly taking time away from fixing the issue if it was a small-to-mid size company that recycles people into multiple roles (and programmer/IT services is a frequent combination at the best of times.) Just in time to turn around and have to re-convert as soon as you're done because the fix has been completed.

      Never mind the fun of the programmers telling you "it'll just be another 2 hours" for 18 hours straight because issues in software tend to branch out in ways that nobody thinks about/remembers and can't include in their estimates until their nose is already in the code and its looking them in the face.

  5. Re:Never can though by Cro+Magnon · · Score: 3, Informative

    That brings back memories:

    Me: "It works for me"
    Production: "It gives me this error"
    Me: "Can you show me the data"
    Prod: "It was in Missouri's data for 2014"
    Me: "It still works. Can you show me a screenprint of your data?"
    Prod: "I'm using this dataset"
    Me: "I don't have access to that (expletive unsaid) dataset. Can you show me a (more unsaid stuff) screenprint??"
    Prod: *mumbles something about privicy*
    Me: *thinks about shooting someone*

    --
    Slow down, cowboy! It has been 4 hours since you last posted. You must wait another few hours.
  6. Re:inability to fully recreate production environm by zifn4b · · Score: 3, Interesting

    That's not the most prevalent issue. The main issue is the malpractice of Agile methodologies. What happens when you jam a 2 week task into a 1 week time box? Corners get cut in the code, the unit tests, QA test plans and technical debt accrues creating unpredictable results when someone changes brittle code in the future. Most companies are not interested investing in REAL environments and continuous delivery pipelines with:

    • - Adequate infrastructure
    • - Adequate workstation and tools
    • - Adequate product training
    • - Reasonable time to do the work
    • - Reasonably well-defined work
    • - Development best practices: code reviews, unit tests, testing in general (yes dev's it's also your responsibility to test, you don't just throw your crap over the wall)
    • - Automatic builds either nightly or on commit with automatic unit and integration tests using Bamboo/Jenkins/whatever, perhaps even usage of source control at all!
    • - Investment in some type of test case database like TestRail or Zephyr so you actually know what your software is expected to do and it can actually evolve over time. This can replace traditional test plans that people put in Confluence that become stale almost immediately and lose value.
    • - Good documentation

    All of this takes a lot of effort and you don't get it for free running around like a chicken with your head cut-off. Ignore it and you reap what you sow especially in larger scale software efforts.

    --
    We'll make great pets
  7. I've solved this problem by Maxo-Texas · · Score: 3, Funny

    I wrote a awesome testing program that resolves the problem of differences between test and production but I can't get it to run in a production environment.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  8. Re:inability to fully recreate production environm by plopez · · Score: 3, Insightful

    I have never seen a methodology survive its first contact with sales.

    --
    putting the 'B' in LGBTQ+
  9. Re:inability to fully recreate production environm by ghoul · · Score: 4, Interesting

    Where I used to work - big telco software firm whose software generates 80% of the phone bills in the US we had a simple solution to the problem of testing to scale.

    We had two identical setups one for production and one for staging. After UAT was almost over we would deploy to staging and then continue UAT on the staging with real world data till the day of cutover (Use Oracle Active-Passive to keep both in sync for the production data while not copying over UAT data to prod)

    On cutover day we would change the network switch to now point to the new setup and run scripts to delete the dat created by UAT.

    The nice part was now the Prod setup (a bank of 8 servers with 4 quad core CPUS each) was now our backup machine. We would switch it to passive and continue to keep it in sync with prod for at least 7 days. If something horrible went wrong with the new setup. Changing back to the earlier prod machine was a network switch flip. The scripts were a little more difficult this time over especially if the software bug had messed up the data but it was still easy.

    Once a production was stable the old prod was now used as staging for the next prod.

    What this meant is we did UAT on machines with identical config as the prod machines . It solved a lot of issues and since we also used the machines as the prod backup machine during cutover the cost was taken from the operations budget and not the testing budget.

    Our System test and UAT environments were almost as good but not as good as prod and most testing and UAT was done there but the last batch of UAT on the big iron gave good confidence and made cutover day a lot less stressfull than it used to be.

    --
    **Life is too short to be serious**
  10. Re:Exponential... by ghoul · · Score: 3, Funny

    Design? Testing? This is the Scrum way !!!! We only have requirements and code and documentation is for pussies.

    --
    **Life is too short to be serious**
  11. Something is missing here by cerberusss · · Score: 4, Informative

    App developer here.

    Something is missing here; namely we spend more time debugging issues found in production, because they get reported. Almost every app nowadays has a crash logger that reports all crashes. Libraries like Twitter's Crashlytics are awesome like that. You get all crashes reported to you, including a ring buffer of the last 100 log messages. It's really, really awesome and I've solved problems in production that wouldn't ever be found normally.

    --
    8 of 13 people found this answer helpful. Did you?