The latest expert opinions, articles, and guides for the Java professional.

Bridging the DevOps gap with Tools & Culture

Hi there, Jevgeni here. I was recently on the road showing the latest version of LiveRebel, our app release automation suite designed to make IT Ops teams happy, and it occurred to me that there are really only two important metrics that we are trying to improve with LiveRebel:

  1. Time to release. Time it takes from commit to production release.

  2. Failure rate. Instances of failure in production that reach users.

It would seem that these are at odds with each other, but as many in DevOps movement have noticed, they are actually very much aligned and can even reinforce each other. More on that later.

There are multiple challenges to bringing either of those goals to fruition, some of them are cultural while others can be solved with proper tooling. Next, we will look at some of the key issues both from a tooling and culture perspective and see if they can be helped.

Why bring releases into the hands of developers?

Over the years, I’ve had many conversations with developers, and every so often I need to repress the need to roll my eyes. Here are some typical things I hear that cause my involuntary eye-roll:

  • “We don’t need the test environment, it runs OK on my laptop.”

  • “Hey, just give me root access to production server, I’ll fix my app there.”

  • “C’mon, there is an ancient, two-week-old version of lib-foo installed, here’s his Github account, just clone a fresh release and compile it straight on to the server!”

  • “We’ve been refactoring the database for the last three months, here are the scripts to update it.”

  • “So here is the release procedure described. Note that p47 on page 12 works only on my Windows machine, I don’t know how to do it on RedHat. What, we have Debian? Oh.”

Wat?! All these statements come from a simple fact: developers are too isolated from release processes and considerations. They often don’t understand that releasing software is just as hard a technical problem as creating it–you need to worry about infrastructure provisioning, network configuration, capacity, security and performance, which all are specialized and complex areas of skill.

The way to bridge the gap is to bring releases closer to developers. This requires that there is a test environment which the developer team is responsible for. The test environment must represent the production environment as closely as possible, so that all release procedures can be tested there, to weed out any possible failures without impacting users.

LiveRebel brings release management to developers

LiveRebel goes further to bring releases closer to developers than one would traditionally expect .  First of all, deliverables are decoupled from their intended deployment environments. LiveRebel manages environment properties and substitutes placeholders within deliverables during deployments. As a result, a single release deliverable can be deployed, unchanged, onto multiple environments, like test, staging, production, or different customer environments.

Secondly, code, database and configuration changes are bundled into a single deliverable. Any environment the deliverable is deployed to is configured to correspond to the deliverable version. Exactly the same release actions are run in every single environment.

Together, these features mean that releases become essentially testable and exposed to developers. If a developer commits the code, but forgets to include database or configuration changes, the release will fail in the test environment instead of failing in production. Sweet!

This makes developers more exposed to release mechanics and encourages them to take a more proactive and rightfully meaningful place in the release process–after all, if the app fails in production, the ensuing code frenzy will be on the dev side. Plus, this also increases trust from the operations side, as now they can see that releases were actually tested by developers, and not run for the first time in production.

Why you’ll always need excellent testing and recovery in production…

Operations folks are not superhuman either. They often assume that testing should be done by the QA team and that actually releasing is the process of enacting the change, one that should not extend into testing too. However, even the most rigorous prior testing in existing environments can still cause things to fail in production, so releases should be treated as a test, not just an event.

To treat releases as a test we need to do two things: 1) run tests against the changed system and 2) have a recovery plan in case the tests fail. A simple way of doing “tests” is doing gradual rollouts, exposing a small portion of users at a time to changes. Although effective in limiting the scope of failures, it still exposes users to failure and should be the last defense, not the first choice.

A key issue with running tests in production is finding a good moment to do so without exposing the potentially dangerous changes to the users. There needs to be some kind of “staging” of releases, where tests can be run after the update is applied, but before it’s exposed to the users.

Another issue is that although it’s relatively easy to automate a deployment (just write an enormous script), it’s much harder to automate recovery. So even if tests fail, it isn’t trivial to stop the deployment and roll back the changes. Often folks will substitute recovery by fixing things and rolling out the fixed version, but if the failure is caused by misconfiguration it can be hard to identify and fix it quickly.

LiveRebel solves these issues with two key features. The first is release staging – during any release there is a phase when application is updated (on that particular server), but not yet available to the users. This is a perfect moment to run integration tests (e.g. Selenium-based) to smoke test the deployment. If tests or other activities take too long for users to wait for them, rolling updates can be used, isolating and updating a few servers at a time without any user impact.

The second is automated recovery. Any change you make – code, database or configuration, can be automatically reverted by LiveRebel. And any failure during the release, including test failure, will result in a full rollback to the previous stable state.

Together with the previously described features, this makes releases truly testable, all the way from development to production and provide several lines of defense against failures in production.

Not only that, but thanks to pervasive testing and rolling updates, releases can be done during the day as often as you like. In fact, smaller releases actually decrease the risk further, as less things change in each and thus there is less risk that they are broken.

What to do for incorporating manual changes

One of the biggest issues in creating a good process around releases is the necessity of manual changes. Some are too small or too hard  to automate (e.g. open a port in the firewall), some of them are too long running (e.g. large database migrations). Even in a fully-automated infrastructure based on virtualization and configuration management, there will be some actions that have to be done manually or semi-manually (read a highly customized script).

It may seem that such actions make automated approach to testing impossible, as manual actions will be able to bypass the automated checkpoints. In reality, a good automated process should encapsulate any manual actions by adding checkpoints. This means that before any dependent action is executed, it should run a test verifying that the preconditions are met.

LiveRebel solves that with release scripts. For example, if a long-running database migration needs to be done and releases depending on various stages of that migration are created, then to ensure that the version can actually be deployed successfully it is prudent to run a test against the database from the “init” release scripts. If those were to fail, the release would be rolled back harmlessly.

Forcing process and communication

Unfortunately one last issue remains unsolved. Devs and Ops may be happier, but that doesn’t mean that they talk to each other. There are many things that need to be communicated for releases to happen. Devs need to hand off the deliverable and a release schedule need to be agreed. Ops need advance notice about the releases to plan them into the change management schedule. The rest of the business must be warned about the possible risks and coming improvements.

This process is often dysfunctional, with either things being forgotten or omitted, or having to fill out endless paperwork. More often than not, there is no clarity on what changes are coming, or when they are coming. Furthermore, there is no simple way to identify the deployed release version.

Automation is the best means to solving process and communication issues. With automation, the process that pushes commits through the test pipeline, right into production can carry all the necessary information that needs to be relayed. We described that approach previously in our Rebel Labs report Pragmatic Continuous Delivery with Jenkins, Nexus and LiveRebel, where the gist is to use a CI tool (like Jenkins or Bamboo) to literally push changes with all accompanying information towards production.

LiveRebel forces process automation and communication in several ways:

  • The Servers dashboard allows anyone to see at a glance what is currently deployed in any and all environments.
  • The “all-in-one” deliverable replaces communication on database and environment updates.
  • Tests validate and communicate assumptions.
  • Version tagging restricts the environments that a version can be deployed to and thus brings the Dev, QA and Business approval into the Ops hands.
  • Seamless integration with CI servers like Jenkins helps build an automated pipeline with all the process and communication encoded along with the deliverable.

But this to work, a cultural shift from the traditional way of doing things has to occur. I know, scary :)

A bit about culture, tooling & metrics

A common misconception is that DevOps is about tooling. The culture of communication and trust is crucial to success and no tools can replace that. That said, tools can help facilitate the change in culture, if it forces or suggests actions that help establish trust, communication and best practices. This is what we set out to do with LiveRebel and many customers now enjoy not just improved tooling, but also progress culturally, towards greater and more effective collaboration.

Tools should help to build the cultural bridge between Dev and Ops–assuming that it will somehow just happen is a losing bet.

Together those changes significantly improve both metrics – improving your time to release and ideally lowering your failure rate. The greater the trust into tooling and process, the more often can releases be done and the less risk they carry. These improvements then ripple across the business as a whole to help things run better (i.e. less costs, more agility & competitiveness).

And for that, we can all be thankful :)

No Responses

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.