Dev vs. Ops: The State of Accountability (overops.com)
Here's an analysis by OverOps on how shared accountability affects the delivery of reliable software in a DevOps environment, and what are some of the top challenges teams face when it comes to building and maintaining quality applications. Conclusion from the report [PDF], which relies on a survey of over 2,000 IT professionals around the globe : At the center of this DevOps adoption chaos is the evolving relationship between development and operations. Many organizations are already taking a shared approach to accountability for application health, however they still lack the tools and application visibility needed to know who is ultimately responsible for addressing and fixing each issue. As the lines between these two teams continue to blur, organizations will need to focus on adopting tools that deepen visibility into their applications. Clarifying ownership of applications and services, and avoiding the "multiple owners = no owner" syndrome is a crucial for even the most bleeding edge organizations.
The "Dev vs. Ops: State of Accountability" survey revealed that as more organizations begin the transition to DevOps workflows, defining roles and processes becomes more difficult and more important. Furthermore, businesses of all sizes are building and releasing new code and application features faster than ever before, which adds additional pressure across the entire software delivery supply chain. Organizations going through the DevOps transformation are more likely to face visibility challenges that make it difficult to maintain or improve application quality and reliability.
The "Dev vs. Ops: State of Accountability" survey revealed that as more organizations begin the transition to DevOps workflows, defining roles and processes becomes more difficult and more important. Furthermore, businesses of all sizes are building and releasing new code and application features faster than ever before, which adds additional pressure across the entire software delivery supply chain. Organizations going through the DevOps transformation are more likely to face visibility challenges that make it difficult to maintain or improve application quality and reliability.
IT and the "movement disease".
I am sort of tired of this constant "revolution" garbage that surrounds the IT industry in general. I work in this shit industry, I am well paid for what I do, but one thing is always certain... It will always suck because everyone in charge of IT came from college where stupid is the only thing being taught when it comes to computer science.
There is never anything innovative being done, by the time I am done listening to a sales pitch I realized I have heard all of this shit before, it's the same shit product emulating another shit product surrounded by proprietary technology that works like shit just in a different shitty way.
There is also the problem that every industry has... 20% of the folks do 80% of the work. Do you know what else tends to happen? 20% of the people are the only ones that knows what to do or what is going on. Do you know what else? IT is not a meritocracy either... it is still the same brown-nosing ass licking who you know path to success, like every other department. Those in the know are constantly assaulted by their lesser skilled and capable "co-workers". Those in the know are constantly waiting for some other knob in a different department to do their own damn job. And all of this while management keeps not getting a fucking clue and piling on more and more work to the point that more than 50% of projects fail by either never having the proper amount of time & expense dedicated to it.
These bullshit "cultural DevOps, ITIL, Agile, Waterfall, blah blah blah" are all stupid ideas people keep coming up with to address the problem of an industry that is riddled with incompetent management trying to rule over an incompetent group of pseudo intellectual nerds that know far less than they put on. And that is another problem as well... people hate IT personnel that do not sound "over confident" it is a practical requirement for IT pro's to act like they know every fucking thing there is to know and yet those of us at the top know different. We are all running around trying to figure out every little fucking thing on the fly because experience has taught us to just roll up our fucking sleeves and work it out... regardless of whichever newfangled fucking "operation ideology" that someone pushes.
Devops the concept is great, "devops" the buzzword is just adding more complexity with little or no benefit. The idea of devops is the devs design their projects such that they maintain the operations. This leads to the desirable outcome of a moral feedback that forces the devs to actually care. Without it, devs just throw the responsibly to operate over the wall and divorces them from the consequence of poor design.
Without devops, devs tend to product projects that are brittle and need to be micromanaged. Why would devs care if ops gets called at 1am on a Saturday? If the devs are getting called in the middle of the night, they'll start to care.
Ops should be limited to maintaining infrastructure services like VMs and creating tools to allows devs to deploy to prod on their own in a controlled way.
Ops should be limited to maintaining infrastructure services like VMs and creating tools to allows devs to deploy to prod on their own in a controlled way.
That's a valid approach for sure. What I've found in that case, however, is that taking Ops out of the loop and relying completely on developers for operational response to an incident in production yields poor results. Put simply, it's not their core competency, and most don't take to it well (although some do!). It's a very different type of pressure to which they're unaccustomed.
In my experience, a hybrid approach is best, where Ops personnel quarterback a production incident (or Support, assuming it's a mature org), and escalate as needed to development.
YMMV
Docker is getting more and more heavy weight, i.e. becoming full blown VMs (Though in the strict sense of the word they always were). Now part of the problem I see is that projects end up with containers scattered around like jack straws, like the "DLL hell" many experienced in the past, or plug-in hell. All docker does is allow the complexity to take a different form. Also stateless containers in my experience are pretty useless. Working on back end "heavy lifting" applications somewhere you need to maintain state and Docker and stateless containers, by definition, cannot do that.
Cloud is just putting the application somewhere else and paying by the cpu cycle. No different than before, it just makes it opaque as to what is going on and who is responsible. There's really not much new under the sun and good ideas are basically reinvented over time. Mainframe == cloud hosted apps on your mobi or browser. Nothing to see here, move along...
putting the 'B' in LGBTQ+
Dev Ops is an example of the willing, led by the unknowing, doing the impossible for the ungrateful.
Our Dev Ops team has adopted the slogan, "Delivering Yesterday's Technology Tomorrow!"
Just cruising through this digital world at 33 1/3 rpm...
I was our company's monitoring department and was checking systems and applications and it fits this question quite well, and you know what, IT SUCKS!
Between management that does not give two f**ks, developers who don't understand infrastructure and systems administrators who cannot manage applications, no one wants to be on the hook for anything. Just TRYING to get them fix issues without pointing a finger is a nightmare. It is like being the IRS, you never get a call from them saying you did a good job.
Everyone is afraid of looking bad because management, which does not understand IT or process, falls to politics to address issues and everyone else is afraid to make a move that make get them into trouble.
DevOps and Agile crap, will not fix broken management.
Unless your project is very small or your management is stellar this just builds a road to disaster.
Developers who do not feel the pain of their actions are not incentivized to correct issues. In the beginning, there may be a few bright eyed engineers who take to heart the messages, but eventually that gets lost in the sea of priority. A number of things start to happen which can increase page count. It could be a hard to find bug that really only produces a few escalations, but combined with a number of those the issues can severely add up. Increasing escalations that don't get attention because individually they are not severe enough and well the operators can handle those.
Perhaps the largest issue only hinted at is the brittle nature that tends to evolve. It may take the form of circular dependencies or poorly considered dependency trees. Oh, such as taking a dependency on a tier 2 service for your tier 1 service or perhaps no one discussed the volume of traffic they would be placing on this other service. Those types of failures start to creep in when you disconnect yourself from your platform and multiple specialized teams start making changes to a cohesive whole. The reality is, despite everyone's hopes, is that living in the service keeps you connected to the details. Personally, I witnessed this type of degradation occur so severely I was likely one of maybe three people who understand the entire platform in an organization of hundreds.
The last service I helped build and design was built with the concept of "I don't want to be paged in the middle of the night." We didn't get it to that point by passing the micro issues to an oncall. No, the other teams did that and many of our pages are the result of those teams poor implementation or lifting work to our successful team.
I've seen whole operation teams nearly up and quit after development teams were completely disconnected from their projects and shouldered maintenance. It usually goes that way or you get some very battered and dedicated people who eventually trickle out until the lynch pin fails.
Pain is a great motivator to fix problems or at the very worst fix their poor alarms.
"You should always go to other people's funerals; otherwise, they won't come to yours." -- Yogi Berra
When I first heard about devops it was in the Ruby web dev community ~2007, and it referred to a role that was a liaison between the programmers and the sysadmins. Their job was to understand the dictates of the BOFH, and to help the programmers find a way to implement what they wanted in a way that was consistent with all the organizational rules. Mostly security or technology restrictions.
Then whoever decided to pay to hype it changed it to mean some sort of management service, a type of telemetry for the PHB to measure the team. It doesn't seem to even be a role on the team anymore, but instead a service where you pay consultants to spy on your team. I guess the idea is to externalize the resulting hate? It doesn't seem to be working.
Sysadmins mostly work at Amazon and Google and Microsoft now. So, devops is largely now the process of automating cloud services.
Bullshit. I've never known a dev not to care about maintainability and stability of their service. What devops causes is burnout for developers, as people who never signed up for 24/7 on call get forced into it, and people who know absolutely nothing about being a sysadmin get forced to do it, and do it half assed as a result.
Its one thing to have devops at a small startup, where you literally can't afford more people and need to wear multiple hats. At anything bigger, its a complete joke- you either have 1 person on the team doing 100% ops anyway, or you have a bunch of unqualified people hacking at it and doing it badly. The main cause of 1 am pages on Saturday is having devops instead of ops.
I still have more fans than freaks. WTF is wrong with you people?
DevOps is pretty cool when done correctly, where infrastructure is fully automated to the point where can you deploy new servers with the latest code and security fixes with just a mouse press.
Of course, most organizations don't have the resources or technical skill to pull that off or maintain it correctly. Worse yet, some of those same organizations also tend to be staffed with clueless managers who think that "DevOps" means that they can hire a junior developer out of college to replace their senior systems administrator. These same people then wonder in amazement when they are the victims of a major security breach six months later.
I've always been interested in all aspects of the lifetime of software services. I am part of a small team in my company that deal with one-off custom projects, typically driven by SLA'd large contracts. We get district, state, and national level requests. We need to deal with many aspects of these demands. Our company does not specialize in contracted work, the work is all value add, it has to be treated as being throw away work, but at the same time we cannot train up Ops or any other team. Due to all of the requirements, it is impractical to get too many people involved.
We analyse, design, implement, deploy, and operate our custom services as a 5 person team, in a multi-billion dollar company with hundreds of software engineers. We have tens of millions of users who tend to hit the myriad of services hard at roughly the same few times every year, and a low constant rate throughout the year. The systems need to be stable, performant, easy to configure, easy to diagnose, and difficult to not understand. It must lead you to doing the correct thing. Make it easy to do the right thing and difficult to do the wrong thing.
Our team has several years of backlog, primarily in tooling enhancements to better support ourselves. Our biggest pain point is we're dependent on nearly every other services in the company. This means our services break when other services are not working to spec, assuming we're graced with a spec. We constantly deal with undocumented unreliable systems that are liable to change without warning. When we design a system, we go over nearly every possible case and either handle them or make it blatantly obvious what the issue is.
I get peeved when I hear someone from ops putting in 60 hour weeks because the general services are constantly needing to be hand-tuned with a seemingly chaotic load pattern because of the.... "intricate"... inter-service dependencies and unreliable performance characteristics. I just want to tear out someone's throat when I ask "how many concurrent requests should I be making to your service and how long should we wait before considering it a timeout?" and getting "Measure it and see what's fastest. If you're getting timeouts, try increasing your timeout setting and see if that helps. If you're still having performance issues, open a defect." Yeah.. please find another line of work you code monkey. The best part is when someone responds like this, we almost invariably overload their services and ops yells at us. And boy do I hate making ops' life any harder than it needs to be.
We quickly found that we needed to add multiple forms of rate limiting to our services. Engineering "load tests" their services and they claim everything is fine, but for some reason we blow their crap out of the water. It's been an interesting exercise in system design to make sure our systems are very gentle with the snow flake services because they seemingly spontaneously go from steady-ish response times to timing out every request for a minute or so without warning. Generally feels like garbage collection issues, but we never know.
I found out ops has to manually scale many of the AWS services up and down because certain negative scaling characteristics. No AWS auto-scaling here. Adding more nodes to many of the clusters barely gives any increase to throughput, and response time jitter increases as the number of nodes goes up. The main benefit of adding more nodes is there are more instances to buffer events. Other services expect an event to be "serviced" in a reasonable time, so pulling an event out of the event queue makes it look like they are. Of course, if one of these nodes goes down while holding onto all of these events, the events are effectively lost without lots of manual intervention. And each additional node increases the risk of race conditions, to which they have created hundreds of scripts that walk the data during off hours to make educated guesses on how to fix data in an unknown state. But whatever, it's ops' issue now, right?