In this blog posting, part of our continuing series on DevOps, we explore solution support strategies. There are several solution support (help desk) strategies, which can be combined, that you may choose to adopt. These options are:
- Online information. A very common “self serve” support strategy is to develop and maintain online assets such as frequently asked questions (FAQ) pages, training videos, and user manuals to name a few. This enables end users to potentially support themselves, although suffers from the TAGRI (They Ain’t Gonna Read It) syndrome.
- Online discussion forums. Many organizations choose to implement internal discussion forums so that their end users can help each other in learning how to use their systems. This is effectively a collaborative self-serve support option for end users. The primary advantage is that your “power users”, or in some cases members of the development team, will come to the aid of other users who are struggling with an issue. A potential disadvantage is that you risk your discussion forum becoming a complaints forum if problems aren’t addressed in a timely manner.
- Asynchronous support. With asynchronous support strategies an end user will put in a request for help and then sometime later somebody gets back to them with help (hopefully). Common ways to implement asynchronous support include implementing a standard support email or a support request page/screen. It is common in many organizations to put a service level agreement (SLA) in place putting limits on how long people will need to wait for help.
- Synchronous support. With synchronous support strategies end users are put in contact with support people (who may even be one of the application developers) in real-time. This is often done via online chat software, video conferencing, or telephone calls. The key advantage of synchronous support is responsiveness. However, synchronous support can be expensive to operate and potentially frustrating for end users, particularly when the support desk function is outsourced to people following scripts.
- Support alerts. With this strategy your solution itself detects serious problems affecting end users, such as a data source or a service/component being unavailable. When such an event occurs, and the solution isn’t able to swiftly recover, the end user is informed of the problem and presented with a “Would you like help?” option. If yes, they are put in direct contact with an appropriate support person who then helps them in real-time. This is part of your solution’s self-recovery process.
- Developer-led support. This strategy has development teams performing the support services for their own solutions and was described previously in DevOps Strategies: Development.
In the next installment in our DevOps series we will describe release management strategies.
In addition to the general DevOps strategies and development-focused DevOps strategies we’ve described previously, there are also several technical strategies that support the operations-aspects of DevOps:
- Solution monitoring. As the name suggests, this is the operational practice of monitoring running solutions and applications once they are in production. Technology infrastructure platforms such as operating systems, application servers, and communication services often provide monitoring capabilities that can be leveraged by monitoring tools (such as Microsoft Management Console, IBM Tivoli Monitoring, and jManage). However, for monitoring application-specific functionality, such as what user interface (UI) features are being used by given types of users, instrumentation that is compliant with your organization’s monitoring infrastructure will need to be built into the applications. Development teams need to be aware of this operational requirement or, better yet, have access to a framework that makes it straightforward to provide such instrumentation.
- Standard platforms. Software development practices, such as continuous deployment and initial architecture envisioning, are enabled by consistency within your operational infrastructure. It is much easier to deploy to a handful of standard hardware configurations than it is to a myriad of unique ones. It is easier to deploy when there are consistent versions of infrastructure software (e.g. operating systems, databases, middleware, and so on) deployed across your environment. For example, all instances of your Oracle DB are 220.127.116.11, you don’t have 18.104.22.168, 22.214.171.124, and 126.96.36.199 installed in various places. Furthermore, it is much easier to make architecture decisions when there is consistency of infrastructure software packages in the first place. For example you standardize on Linuz for your server operating system, you don’t also have Windows, z/OS and others also in production (and if you do you’re actively retiring them).
- Deployment testing. After a solution, or an update to a component of your operational infrastructure, has been deployed you should run a quick set of tests to verify that the deployment was successful. Were the right versions of the files installed where they need to be? And were they deployed to all appropriate servers? Were database transformations applied successfully? Did the appropriate announcements, if any, get sent out? Did the overall deployment process run within the desired time frame?
- Automated deployment. Deployments should be automated, not manual. This increases the consistency of your deployments and supports the practice of continuous deployment. Part of your automation effort should be to support both self-recovery and self-testing as native aspects of your deployment strategy.
- Support environments. Anyone doing solution support, even if it is the development team itself, is likely to need an environment in which they can reproduce problems that end users experience. There are several options available to you:
- Production. In some cases your production environment is sufficient, although many regulatory regimes, particularly life-critical and financial-critical ones, will not allow this.
- Pre-production test sandbox. Some support teams will find that they can use their pre-production test environment to try to simulate production problems. The advantage is that you don’t put production at risk when trying to reproduce problems, the disadvantage is that you the test environment will be different than production and as a result you may not be able to simulate all reported problems.
- Support sandbox. Some organizations choose to have a specific environment set up to enable support staff to simulate production problems. This strategy has the same tradeoffs as using a pre-production test sandbox plus the additional cost and maintenance associated with yet another environment.
In the next blog posting in this DevOps series we will explore solution support strategies.
There are several disaster mitigation strategies that IT departments may choose to adopt:
- Disaster planning. Disciplined organizations will plan for operational disasters. Potential disasters include servers going down, network connectivity going down, power outages, failed solution deployments, failed infrastructure deployments, natural disasters such as fires and floods, terrorist attacks, and many more. This planning will include identification of potential problems, identification of strategies to address those problems, and putting mechanisms in place to hopefully mitigate the disasters. Potential strategies to address these disasters include building solutions that self-test and self-recover, building redundancies into your operational infrastructure, having disaster procedures in place, and practicing those procedures in simulated disasters.
- Scheduled disaster simulation. It is one thing to have disaster mitigations plans in place, it is another to know whether they actually work. Disciplined organizations will run through disaster scenarios to verify how well their mitigation strategies work in practice. For example, to test whether your power outage emergency plan works you would purposely simulate a power outage at one of your data centers and then work through your recovery plan. Like fire drills, these simulations should be done on a regular basis so that staff members build up the “body memory” required to act swiftly and appropriately in an emergency. The advantage of a scheduled disaster simulation is that you knowingly run it at a time where you will have minimal impact on your stakeholders. A disadvantage, at least when people are informed of the simulation ahead of time, is that people are mentally prepared for the simulation and aren’t caught unaware and thereby you don’t simulate the real level of stress that people would be under during an actual emergency.
- Random disaster simulation. Very disciplined organizations will implement a service within their operational environment that causes problems such as server or service outages at random times. An example of this is the Chaos Monkey functionality in Amazon’s Web Services (AWS) offering, functionality that is being implemented within many organizations now. The Chaos Monkey injects random problems into production to verify that the IT operations organization is capable of overcoming them. This is done to verify that your solutions really are able to automatically recover from problems and failing that at least operators are alerted to the problem.
As you would expect, truly disciplined organizations have adopted all of these strategies.
Related blog postings:
There are several teaming strategies that you can choose to adopt when it comes to getting development professionals and operations professionals to work together. Starting with the least effective and working our way to the most effective, they are:
- Production hand-off. When a development team releases a solution into production the operations team takes on the responsibility for running and supporting the solution. At this point the development team is often disbanded or moves on to another effort. A sustainment team of one or more developers may be formed to perform maintenance updates as needed over time, or the responsibility to do this work is given to an existing sustainment team. The advantage of this approach is that your organization no longer has to fund the full development team moving forward. However, you risk losing the knowledge and expertise of the team that is required to maintain and evolve the solution over time. This can be particularly problematic when there are high-severity defects to be fixed.
- Warranty period. With this strategy the development team commits to fixing critical defects for a pre-defined period of time after the solution is released into production. For example, a development team may be required to fix any severity 1 or severity 2 defects free of charge for the first thirty days following a production release. Warranty periods are often combined with the production hand-off strategy to reduce the risks associated with it. Warranty periods are also common when development teams are funded via a fixed-price funding model or in outsourcing situations because the stakeholders typically want to ensure that they received the level of quality that they paid for.
- Production support. In enterprise environments most application development teams are working on new releases of a solution that already exist in production. Not only will they be working on the new release, they will also have the responsibility of addressing serious production problems that are escalated to them. The development team will often be referred to as “level three support” for the application because they will be the third (and last) team to be involved with fixing critical production problems. The primary advantage is that production emergencies associated with a specific solution are often resolved by the most qualified people – the actual developers of that solution. Another advantage is that it gives developers an appreciation of the kinds of things that occur in production, providing them with learning opportunities to improve the way that they design solutions in the first place. A potentially significant disadvantage is that the need to fix production emergencies will distract the development away from working on new functionality.
- Developer-led operations. This strategy turns up the dial on production support by having the development team be responsible for operating and supporting their own solution. This is often referred to as “you build it you run it”. This strategy has the benefits that it focuses the team on ensuring that their solution is easy to operate and support and it ensures that the most qualified people are the ones evolving the solution. However, this strategy results in Scrum teams producing silo solutions running on disparate platforms – luckily DAD teams are enterprise aware and include someone in the role of architecture owner who will guide the team in avoiding this very sort of architecture mistake. Another common strategy is to include someone with strong operations experience in your team. A developer-led operations strategy also runs the risk of varying levels of support quality as some teams will be better than others at this. Once again, teams that are enterprise aware will be following common guidelines and will reach out to other teams for help in improving their approach.
Of the four approaches listed above, the only one that is clearly a DevOps strategy is developer-led operations. The production support strategy is definitely a step in the right direction and is often seen as sufficient in many enterprises. If this is the case in your organization we recommend that you experiment with the developer-led operations strategy on a few teams to see how well it works for you. We suspect that you’ll be pleasantly surprised.
In the next blog in this series we will explore disaster mitigation strategies.
Related blog postings:
- Canary tests. A canary test is a small experiment where new functionality is deployed to a subset of end users so you can determine whether that functionality is of interest to them. This in turn provides insight to the development team as to the true potential value of the functionality (if any). For example, an e-commerce company might believe that a new feature where people can buy two related items at a discount will help to increase sales. At the same time they fear this could decrease overall revenue. So they decide to run a canary test where 5% of their customers are provided this functionality for a two-week period. Sales and revenue are tracked and compared against customers not given access to this new functionality. If a new feature successfully passes a canary test it is then made available to a wider range of end users (you may choose to several rounds of canary tests before finally deploying the functionality to all users). You can think of canary testing as an extreme form of pilot testing.
- Split tests. A split test, also known as an A/B test, is an experiment where two or more options are run in parallel so that their effectiveness can be compared. For example, a bank may identify three different screen design strategies to transfer funds between two accounts via an automated teller machine (ATM). Instead of holding endless meetings, focus groups, or modelling sessions the bank instead decides to implement all three strategies and put them into production in parallel. When I use an ATM I’m always presented with strategy A, when you login you always get strategy B, and so on. Because the ATM solution is instrumented to track important usage metrics the bank is able to determine which of the three strategies is most effective. After the split test is completed the winning strategy is made available to all users of ATMs.
- Automated regression testing. Agile software developers are said to be “quality infected” because of their focus on writing quality code and their desire to test as often and early as possible. As a result, automated regression testing is a common practice adopted by agile teams, which is sometimes extended to test-first approaches such as test-driven development (TDD) and behavior-driven development (BDD). The regression test suite(s) may address function testing, performance testing, system integration testing (SIT), and acceptance testing and many more categories of tests. Because agile teams commonly run their automated test suites many times a day, and because they fix any problems they find right away, they enjoy higher levels of quality than teams that don’t. Because some tests can take a long time to run, in particular load/stress tests and performance tests, that a team will choose to have several test suites running at different cadences (i.e. some tests run at every code check in, some tests run at scheduled times each day, some once every evening, some over the weekend, and so on). This greater focus on quality is good news for operations staff that insists a solution must be of sufficient quality before approving its release into production.
- Continuous integration (CI). Continuous integration (CI) is the discipline of building and validating a project automatically whenever a file is checked into your configuration management (CM) system. As you see in the following diagram, validation can occur via several strategies such as automated regression testing and even static or dynamic code and schema analysis. CI enables developers to develop a high-quality working solution safely in small, regular steps by providing immediate feedback on code defects.
- Continuous deployment (CD). Continuous deployment extends the practice of continuous integration. With continuous deployment, when your integration is successful in one sandbox your changes are automatically promoted to the next sandbox. The CI strategy running in that environment automatically integrates your solution there because of the updated source files. As you can see in the following diagram this automatic promotion continues until the point where any changes must be verified by a person, typically at the transition point between development and operations. Having said that, advanced teams are now automatically deploying into production as well. Continuous deployment enables development teams to reduce the time between a new feature being identified and being deployed into production. It enables the business to be more responsive. However, when development teams aren’t sufficiently disciplined continuous deployment can increase operational risk by increasing the potential for defects to be introduced into production. Successful continuous deployment in an enterprise environment requires an effective continuous integration strategy in place in all sandboxes.
There are also several common operations-friendly features that developers with a Disciplined DevOps mindset will choose to build into their solutions:
- Feature access control. To support experimentation strategies such as canary tests and split tests it must be possible to limit end user access to certain features. This strategy must be easy to configure and deploy, a common approach is to have XML-based configuration files that are read into memory that contain the meta-data required to drive an access control framework.
- Monitoring instrumentation. Developers with a Disciplined DevOps mindset will build instrumentation functionality – logging and better yet real-time alerts – into their solutions. The purpose is to enable monitoring, in (near) real-time, of their systems when they are operating in production. This is important to the people responsible for keeping the solution running, to people supporting the solution, to people responsible for debugging and fixing any problems, and to your operational intelligence efforts. Monitoring instrumentation enables canary tests and split tests in that it provides the data required to determine the effectiveness of the feature or strategy under test.
- Feature toggles. A feature toggle is effectively a software switch that allows you to turn features on (and off) when appropriate. A common strategy is to turn on a collection of related functionality that provide a value stream, often described by an epic or use case, all at once when end users are ready to accept it. Feature toggles are also used to turn off individual features when it’s discovered that the feature isn’t performing well (perhaps the new functionality isn’t found to be useful by end users, perhaps it results in lower sales, …). Another benefit of feature toggles is that they enable you to test and deploy functionality into production on an incremental basis.
- Self-testing. One strategy to make a solution more robust, and thus easier to operate, is to make it self testing. The basic idea is that each component of a solution includes basic tests to validate that it can properly run while in production. For example, an application server may run basic tests at startup such as verifying the version of the operating system or of frameworks that it relies on. While the server is running it might regularly check to see if other components that it relies on, such as data sources and middleware services, are available. When a problem is detected it minimally should be logged, better yet an alert should be posted if intervention by a person is required, and even better yet the solution should try to recover from the problem.
- Self-recovery. When a system runs into a problem it should do it’s best to automatically recover and continue on as before. For example, if the system detects that a data source is no longer available it should try to restart that data service. If that fails, it should record change transactions where possible and then process them until the data service becomes available again. A good example of this is an ATM. When ATMs lose their connection to a bank’s financial processing system they will continue on for a period of time independently albeit with limited functionality. They will allow people to withdraw money from their accounts, perhaps putting a limit on the amount withdrawn to limit potential problems with overdrawn accounts. People will still be able to deposit money but will not be able to get a current balance or see a statement of recent transactions. Self-recovery functionality provides a better experience to end users and reduces the operational burden on your organization.
Now that we have overviewed a collection of development practices and implementation features, in the next blog posting in this series we will explore strategies that streamline your operations efforts.
Recently we published a blog which provided a Definition for Disciplined DevOps. In that posting we worked through the potential scope of DevOps to help gain a better understanding of what DevOps is all about. One point that we made was that there was no consistent definition of DevOps in the industry due to various reasons. In this posting we explore those reasons.
We see several key forces in the current marketplace which makes it difficult to settle on a common definition:
- Specialized IT practitioners. Many IT professionals still tend to specialize – someone will choose to focus on being a programmer, an operations engineer, an enterprise architect, a database administrator (DBA) and so on. As a result they tend to see the world through the lens of their speciality. Programmers will focus on the software development aspects of DevOps, operations engineers the operations aspects of DevOps, enterprise architects on the long-term planning and modelling aspects, and DBAs on the data management aspects. Few people are looking at the overall “big picture”.
- Agilists are focused on continuous delivery. Right now agile and lean developers are investing a lot of effort to figure out continuous delivery practices so as to streamline the regular deployment of value into production. Advanced teams are releasing daily if not several times a day due to adoption of practices such as automated regression testing, continuous integration (CI), and continuous delivery (CD). As a result most of the DevOps discussion in these communities focuses on these topics, sometimes straying into other practices such as canary testing, feature toggles, and production monitoring frameworks. Clearly important techniques, but still not covering the full potential range of DevOps. These practices and more are described later in this article.
- Operations professionals are often frustrated. Many operations groups are overwhelmed already with the rate of updates being foisted upon them by development teams. This is often exacerbated by the inconsistent use of technologies – the impact of the lack of enterprise awareness within undisciplined development teams is largely felt by the operations group who needs to support the plethora of technology platforms used by the full range of development teams. Worse yet, the internal operations processes are often based on heavy implementations of ITIL or ITSM and have yet to be streamlined so that operations engineers are in a better position to collaborate with development teams.
- Tool vendors have limited offerings. As a result of this the DevOps messaging from tool vendors will focus on just the aspects of DevOps supported by their tools, narrowing the discussion to what they have on offer. Yes, tools are important, but they are only part of the DevOps picture. Even if there was a vendor with a full range of tools, and if they actually interoperated smoothly (yes ALM vendors, we’re referring to you), you would still need to understand how to use those tools effectively. To paraphrase an old saying – A fool with a DevOps tool is still a fool.
- Service vendors have limited offerings. Similar to the issues surrounding tool vendors, service vendors are also making great claims about their deep expertise in DevOps. Upon examination you will often find, like the tool vendors, their definition of DevOps will focus on whatever they can currently support.
- Tool vendors treat DevOps as a marketing buzzword. To be blunt, many vendors have taken their existing products, and started marketing them as DevOps products (regardless of how well those products actually support DevOps practices). Granted, these products may have been very good at supporting traditional ways of working, but when it comes to supporting DevOps they prove to be rather clunky even though they may have added a few new features.
- The DevOps=Cloud vision. There is a lot of rhetoric, particularly coming from Cloud vendors, about how cloud-based tooling and deployment environments are critical to success in DevOps. Yes, having a cloud-based infrastructure clearly enables many DevOps practices and given the choice we prefer to work in an environment which leverages cloud-based technologies whenever appropriate. But, that doesn’t mean that the cloud is a prerequisite for doing DevOps.
The point is that there are several contributing factors to the lack of agreement within our industry as to what DevOps means in practice. The implication is that when someone is giving you advice about DevOps that you need to understand the scope of what they’re actually discussing. Another way to understand what DevOps is and how it may apply to your organization is to explore the various DevOps strategies and practices available to you, which we’re doing in other blog postings. Please see our first post in that series which overviews General DevOps Strategies.
In a previous blog posting we overviewed the concept of Disciplined DevOps, which is the streamlining of IT solution development and IT operations activities, as well as supporting enterprise activities. In this blog posting we begin to overview strategies that support DevOps. This posting overviews general strategies, and future postings will describe development, operations, release management, data management, and enterprise architecture strategies.
There are several “general” strategies that support DevOps:
- Collaborative work. A fundamental philosophy of DevOps is that developers, operations staff, and support people must work closely together on a regular basis. An implication is that they must see one other as important stakeholders and actively seek to work together. A common practice within the agile community is “onsite customer,” adopted from Extreme Programming (XP), which motivates agile developers to work closely with the business. Disciplined agilists take this one step further with the practice of active stakeholder participation, which says that developers should work closely with all of their stakeholders, including operations and support staff–not just business stakeholders. This is a two-way street: Operations and support staff must also be willing to work closely with developers.
- Automated dashboards. The practice of using automated dashboards is called IT intelligence, effectively the application of business intelligence (BI) strategies for IT. There are two aspects to this, development intelligence and operational intelligence. Development intelligence requires the use of development tools that are instrumented to generate metrics; for example, your configuration management (CM) tools already record who checked in what and when they did it. Continuous integration tools could similarly record when a build occurred, how many tests ran, how long the tests ran, whether the build was successful, how many tests we successful, and so on. This sort of raw data can then be analyzed and displayed in automated dashboards. Operational intelligence is an aspect of application monitoring discussed previously. With automated dashboards, an organization’s overall metrics overhead can be dramatically reduced (although not completely eliminated because not everything can be automated). Automated dashboards provide real-time insight to an organization’s governance teams.
- Integrated configuration management. With an integrated approach to configuration management (CM), development teams not only apply CM at the solution level as is customary, they also consider production configuration issues between their solution and the rest of your organization’s infrastructure. This can be a major change for some developers because they’re often used to thinking about CM only in terms of the solution they are currently working on. In a DevOps environment, developers need to be enterprise aware and look at the bigger picture. How will their solution work with and take advantage of other assets in production? Will other assets leverage the solution being developed? The implication is that development teams will need to understand, and manage, the full range of dependencies for their product. Integrated configuration management enables operations staff to understand the potential impact of a new release, thereby making it easy to decide when to allow the new release to occur.
- Integrated change management. From an IT perspective, change management is the act of ensuring successful and meaningful evolution of the IT infrastructure to better support the overall organization. This is tricky enough at a project-team level because many technologies, and even versions of similar technologies, will be used in the development of a single solution. Because DevOps brings the enterprise-level issues associated with operations into the mix, an integrated change management strategy can be far more complex, due to the need to consider a large number of solutions running and interacting in production simultaneously. With integrated change management, development teams must work closely with operations teams to understand the implications of any technology changes at an organization level. This approach depends on the earlier practices of active stakeholder participation, integrated configuration management, and automated testing.
- Training, education, and mentoring. As you would expect, people will need help to learn and adopt your DevOps strategies.
- Continuous improvement. Disciplined agile teams strive to learn from their experiences as well as from others so that they can continuously improve the way that they work together, including how they approach DevOps.
- One team. An important aspect of the DevOps mindset is shifting away from a “them and us mindset” to an “us mindset.” We all work together as a single, streamlined team. An extreme form of this is the “you build it, you run it” philosophy where there are no separate development, operations, data administration teams but instead product teams who are responsible for the entire lifecycle of a product.
Our next blog posting in this series will overview development-oriented strategies.