Slashdot Mirror


App Developers Spend Too Much Time Debugging Errors in Production Systems (betanews.com)

According to a new study, 43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production. BetaNews adds: The survey carried out by ClusterHQ found that a quarter of respondents report encountering bugs discovered in production one or more times per week. Respondents were also asked to identify the most common causes of bugs. These were, inability to fully recreate production environments in testing (33 percent), interdependence on external systems that makes integration testing difficult (27 percent) and testing against unrealistic data before moving into production (26 percent). When asked to identify the environment in which bugs are most costly to fix, 62 percent selected production as the most expensive stage of app development to fix errors, followed by development (18 percent), staging (seven percent), QA (seven percent) and testing (six percent).

24 of 167 comments (clear)

  1. No surprise by tomhath · · Score: 3, Insightful

    43 percent of app developers spend between 10 and 25 percent of their time debugging application errors discovered in production

    That seems like an odd metric, but it doesn't surprise me. Production support has always been expensive. Especially if you can't create a full production-like environment with real world data and stupid users to test with.

    1. Re:No surprise by lgw · · Score: 2

      As a (fairly large) devops team, we probably spend 1/3 of our time in "production support", from bug investigation to various kinds of automation, it never seems to end.

      On thing we don't do though, unless there's no other way, is "debug in production". If anything seems off, you roll back, no question. (And if it's not a recent deployment, you should know that very quickly, too, from logs and metrics.) Figuring out exactly what went wrong can wait, reverting the change before it becomes a customer-visible outage is everything.

      The only reason to be mucking about in prod in any sort of interactive way is when a change does lasting damage before you can revert it, and you're scrambling to fix the damage. That's the worst place to be. Fortunately, that's pretty rare here.

      Trying to get meaningful testing before production for us is a matter of cleverness with mocking or simulation of one kind or another. It would be far too expensive to test at some representative scale. You test what you can, but that's limited. Which means we spend a lot of time on early warning systems and automation for gradual deployment with automated rollback. We don't quite spend more time on that crap as we do on the actual product, but it's close.

      --
      Socialism: a lie told by totalitarians and believed by fools.
    2. Re:No surprise by Tesen · · Score: 2

      Most developers seem to forget that companies don't make software so it is comfortable for the developers. Most developers argue with the Product Managers that certain new features should not be done, because they are too much of a hassle or go against a policy in the development team. Most developers forget that the company makes the software to sell, and have a profit in the process. Most developers should do their job faster and with better quality and stop arguing with their Product Managers.

      Geez I hope that was parody.

    3. Re:No surprise by I4ko · · Score: 2

      I am not trolling, I just work in a dysfunctional organization - just this morning I finished a conversation with my development team who refused to implement a feature and a small change in the QA process that is required by regulation, because it was inconvenient for them and adds 5 minutes to each end to end test case.
      Without that feature, there is no certification, without certification there is no go to market, without go to market there is no sales, without sales there is no income for the company and by extension no salary for said developers.
      That is what I mean that we don't make the software just to be comfortable for developers, we actually make the software to sell and profit.

  2. inability to fully recreate production environment by MooseTick · · Score: 5, Insightful

    This is due to finance cheaping out and not allowing the purchase of an exact "test" system to work on. Also, the rush to production is often more important than checking to be sure it all works.

    That said, its all a risk/reward thing. Maybe its often better to screw up production here and there than to spend tons of money and time on testing. It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?

  3. Most common causes of bugs? by Anonymous Coward · · Score: 4, Insightful

    How is "management telling people to put it into production as soon as the basic functionality works" not one of the common causes of bugs? At almost every job I've worked at, QA and Engineering would say "We need this much time to test and fix bugs before launch", and management would say "Too bad! Sales already told someone we're launching tomorrow, so we're going live with whatever we have then!"

    It isn't the lack of a good test environment, or good test data, it's being told by management that you aren't going to have any time to test...

  4. Re:inability to fully recreate production environm by TWX · · Score: 2

    I spent some time working QA on a carrier-level system that was being developed for what was at the time Cingular. The biggest problem is that the investors that propped-up the company wanted it to ship as absolutely as soon as possible, so the company could go from a money-sink to a money-producer for them. Our investor was some heir to a fortune that was made in chemicals back in the day, he didn't really know anything about the technology of telco-grade communications systems. He was ill-qualified to even know if his money was going to producing something functional.

    The idea (basically take-in anything, process it for meaning, and then turn around and convert and resend or else store and notify) was a good one and at the time there wasn't really anything else on the market doing this at carrier-grade. The problem was, while the central core of the product was reasonably well written, so many input/output daemons and filters were just garbage. The rush to get the product making money meant it shipped well before it was ready, and in the end it became the only sale that the company had.

    A couple of years later the whole product/project could've been had for something like $200,000. They'd sold the only production system for more like $1,000,000.

    --
    Do not look into laser with remaining eye.
  5. Been there, done that as an intern... by __aaclcg7560 · · Score: 4, Interesting

    I did a six-month contract as an software tester internship after college, where I came across a crash bug on the test server that I could reproduced 100% of the time. My supervisor could not reproduced the bug, and approved the patch for production server. The production server crashed immediately from the patch. Engineers determined that a major code rewrite was required to fix the underlying problem. The production server was offline for three days and cost the company $250K in lost revenue. My contract wasn't renewed, one-third of the division got laid off after I left, and further budget cuts doomed the project. As for my supervisor, he got promoted into management.

    1. Re:Been there, done that as an intern... by __aaclcg7560 · · Score: 2

      This, like every other "I was an intern who saved the world" story, has more than meets the eye.

      I work in IT. I save the world every day.

      Like, why didn't they simply revert to the previous stable build?

      It was a virtual world. Going back wasn't an option.

    2. Re:Been there, done that as an intern... by Altrag · · Score: 3, Informative

      That's not always as easy as it sounds. If there was data conversions involved for example, the previous stable build may not even run anymore and would require restoring everything from backup, which may well be a many-many-hour project in itself -- and possibly taking time away from fixing the issue if it was a small-to-mid size company that recycles people into multiple roles (and programmer/IT services is a frequent combination at the best of times.) Just in time to turn around and have to re-convert as soon as you're done because the fix has been completed.

      Never mind the fun of the programmers telling you "it'll just be another 2 hours" for 18 hours straight because issues in software tend to branch out in ways that nobody thinks about/remembers and can't include in their estimates until their nose is already in the code and its looking them in the face.

  6. Re:Never can though by Cro+Magnon · · Score: 3, Informative

    That brings back memories:

    Me: "It works for me"
    Production: "It gives me this error"
    Me: "Can you show me the data"
    Prod: "It was in Missouri's data for 2014"
    Me: "It still works. Can you show me a screenprint of your data?"
    Prod: "I'm using this dataset"
    Me: "I don't have access to that (expletive unsaid) dataset. Can you show me a (more unsaid stuff) screenprint??"
    Prod: *mumbles something about privicy*
    Me: *thinks about shooting someone*

    --
    Slow down, cowboy! It has been 4 hours since you last posted. You must wait another few hours.
  7. Re:inability to fully recreate production environm by zifn4b · · Score: 3, Interesting

    That's not the most prevalent issue. The main issue is the malpractice of Agile methodologies. What happens when you jam a 2 week task into a 1 week time box? Corners get cut in the code, the unit tests, QA test plans and technical debt accrues creating unpredictable results when someone changes brittle code in the future. Most companies are not interested investing in REAL environments and continuous delivery pipelines with:

    • - Adequate infrastructure
    • - Adequate workstation and tools
    • - Adequate product training
    • - Reasonable time to do the work
    • - Reasonably well-defined work
    • - Development best practices: code reviews, unit tests, testing in general (yes dev's it's also your responsibility to test, you don't just throw your crap over the wall)
    • - Automatic builds either nightly or on commit with automatic unit and integration tests using Bamboo/Jenkins/whatever, perhaps even usage of source control at all!
    • - Investment in some type of test case database like TestRail or Zephyr so you actually know what your software is expected to do and it can actually evolve over time. This can replace traditional test plans that people put in Confluence that become stale almost immediately and lose value.
    • - Good documentation

    All of this takes a lot of effort and you don't get it for free running around like a chicken with your head cut-off. Ignore it and you reap what you sow especially in larger scale software efforts.

    --
    We'll make great pets
  8. I've solved this problem by Maxo-Texas · · Score: 3, Funny

    I wrote a awesome testing program that resolves the problem of differences between test and production but I can't get it to run in a production environment.

    --
    She was like chocolate when she drank... semi-sweet at first and then increasingly bitter.
  9. The propblem is bad accounting practices. by plopez · · Score: 2

    We get hung up on developer costs but never on rework and fix costs. There is constant pressure to deliver untested features to make sales but never much accounting for customers who will walk at the first opportunity or sales which get cancelled due to bugs.

    And it has never changed. Watefall, 6 signma, kanban, agile, rapid proto=typing, devops etc. has not made a difference. I have seen no improvement at all over close to 30 years. And people wonder why I drink.

    --
    putting the 'B' in LGBTQ+
  10. Re:inability to fully recreate production environm by plopez · · Score: 3, Insightful

    I have never seen a methodology survive its first contact with sales.

    --
    putting the 'B' in LGBTQ+
  11. Re:inability to fully recreate production environm by jlowery · · Score: 2

    It all depends if you're building software for a web site or a Mars mission. What is the impact of a failure, and is it recoverable?

    For the Mars mission:
    a) about 186mph
    b) no

    http://www.space.com/34472-exo...

    --
    If you post it, they will read.
  12. Re:Never can though by TemporalBeing · · Score: 2

    Even if you can reproduce all of the hardware exactly, you are never going to get the same kinds of results that putting software in the hands of real users will get you.

    There's different kinds of buys, which is why you have different kinds of systems and testing environments.

    A dev should be able to have an isolated environment in which to be able to test the various parts. Each part should be able to have a sufficient emulation of external parts to be able to have its own unit and functional testing. From there, several parts should be integrated at a time to do functional and integration testing, eventually building up to the entire system being fully integrated and using emulated externals (e.g external auth emulation) so the system can itself run in isolation. This gets to 95% of the issues.

    From here is scalability - for which the operations team should be providing environments sufficient to do the scaling testing so stuff can be tested at sufficient scale before it hits production.

    Now, that doesn't mean you won't end up with issues in production, but that it should be a rare thing for that to happen. In those rare cases you may have to test in production, but that should be the exception, not the rule.

    Too often we don't invest in all the different levels of testing b/c (a) devs are lazy, and (b) management cheaps out. However, doing all the layers of testing will be cheaper in the end since things will be caught earlier where it's cheaper and faster to fix.

    --
    Truth is like the sun. You can shut it out for a time, but it ain't goin' away. - Elvis Presley (source: imdb.com)
  13. Re:Never can though by rholtzjr · · Score: 2

    I've run into that as well, then I made the comment that it may be unique to that personal information pertaining to the "person". I suggested to obfuscate the personal information, but not other data to reproduce. This will usually pinpoint the cause, but if the error still can not be produced, the error is most likely attributed to that specific personal data that was obfuscated.

  14. Re:inability to fully recreate production environm by ghoul · · Score: 4, Interesting

    Where I used to work - big telco software firm whose software generates 80% of the phone bills in the US we had a simple solution to the problem of testing to scale.

    We had two identical setups one for production and one for staging. After UAT was almost over we would deploy to staging and then continue UAT on the staging with real world data till the day of cutover (Use Oracle Active-Passive to keep both in sync for the production data while not copying over UAT data to prod)

    On cutover day we would change the network switch to now point to the new setup and run scripts to delete the dat created by UAT.

    The nice part was now the Prod setup (a bank of 8 servers with 4 quad core CPUS each) was now our backup machine. We would switch it to passive and continue to keep it in sync with prod for at least 7 days. If something horrible went wrong with the new setup. Changing back to the earlier prod machine was a network switch flip. The scripts were a little more difficult this time over especially if the software bug had messed up the data but it was still easy.

    Once a production was stable the old prod was now used as staging for the next prod.

    What this meant is we did UAT on machines with identical config as the prod machines . It solved a lot of issues and since we also used the machines as the prod backup machine during cutover the cost was taken from the operations budget and not the testing budget.

    Our System test and UAT environments were almost as good but not as good as prod and most testing and UAT was done there but the last batch of UAT on the big iron gave good confidence and made cutover day a lot less stressfull than it used to be.

    --
    **Life is too short to be serious**
  15. Re:Time/Money/Quality competition by ghoul · · Score: 2

    I think its not time, money,quality .

    The iron triangle is time,money,scope. You can increase or decrease one by changing the other 2 . But if you try to reduce one without changing the other 2, the iron triangle breaks open and the magic smoke which is quality inside the triangle escapes and once it escapes you cant get it back in even if you close the triangle.

    --
    **Life is too short to be serious**
  16. Re:Exponential... by ghoul · · Score: 3, Funny

    Design? Testing? This is the Scrum way !!!! We only have requirements and code and documentation is for pussies.

    --
    **Life is too short to be serious**
  17. And lose the bulk of sales to your competitors by tepples · · Score: 2

    Slow down your production process so you have time to catch them?

    That causes end users to choose a competitor's software with tolerable defects over your unfinished vaporware.

    Do without the fancies so there are less vulnerability points?

    That causes end users who rely on "the fancies" to choose a competitor's software that offers "the fancies".

  18. Something is missing here by cerberusss · · Score: 4, Informative

    App developer here.

    Something is missing here; namely we spend more time debugging issues found in production, because they get reported. Almost every app nowadays has a crash logger that reports all crashes. Libraries like Twitter's Crashlytics are awesome like that. You get all crashes reported to you, including a ring buffer of the last 100 log messages. It's really, really awesome and I've solved problems in production that wouldn't ever be found normally.

    --
    8 of 13 people found this answer helpful. Did you?
  19. Re:inability to fully recreate production environm by Ash-Fox · · Score: 2

    Why is it that since lately people who obviously have no clue about agil methods are bashing them on /.?

    They've been doing it for years, I find it fascinating how easy it is to rebuff most of the claims. But, I think it shows the industry is just really poor at executing it and end up with Fragile instead.

    --
    Change is certain; progress is not obligatory.