What is Cloud Operations
Cloud Operations is a broad term describing procedures, tools and practices for running IT services in cloud environments. When building out a Cloud Ops function within an organization, teams are typically built out in line with specialized disciplines. The skill-set and requirements of these teams can be mapped differently, depending on the nature of the organization and the phase of growth.
Roles and teams naturally evolve as an organization matures. It is not uncommon to have a single cloud team that wears many hats, often being a small team for companies that are starting out. Certain specialties might be scaled up during early build out, with a shift in focus to other roles as the company and it's products/services move more into run / maintain / improve / optimize phase.
The various roles and disciplines under this blanket term share a set of common principles and values. When thinking about how businesses are organized, especially technology/ engineering divisions, I often map it to The Spine Model - a framework that an organization I worked in used for a number of years, and foundational to how I still operate. Some of the references to values, principles and processes referred to here reference this framework (with needs and values being more strategic/abstract, and principles and processes becoming more tactical/pragmatic).
Developers, DevOps & CloudOps
All disciplines share some common principles. In the technology / software engineering industry, teams or roles are often tagged as "DevOps team" or "DevOps Engineer". In the organizations that I have developed Cloud Ops teams in, I have avoided using these terms to reference teams, roles or positions, based on how I understood the concept when i was introduced to it around 2010: It is a methodology or framework (or philosophy) of levering technology and automation to reduce handovers and delays in delivery between software/product engineers (developers) and operational teams in a collaborative way. The "Dev" refers to Developers and "Ops" operations teams. I always found slight irony in labeling an Ops team "DevOps". CloudOps team/individuals and Developers collaborate to build systems with shared responsibility to deliver better software faster within an engineering org. And so, the principles of DevOps are core to CloudOps, while also being a pillar for software engineers. First-to-mind questions for adopters of DevOps is often "how can we allow each role to deliver on their requirements autonomously, reducing dependencies between teams and roles?". A typical example of how software was built and developed before or without DevOps would be developers writing code and checking it into a repo; operations building it (or devs building it and sending it to ops); and then ops deploying it to servers, including config updates. This consumes engineering resources and reduces transparency. With DevOps, a cloudops team would focus on building a CI/CD solution aligned to SDLC, with feedback loops so devs can see the result of builds; and then allow devs, release managers, or other business roles to progress software updates through environments.
And this has touched on some of the common principles of an organization embracing DevOps, and cornerstones of CloudOps teams:
- Shared Responsibility: it is the responsibility of operations teams and developers to deliver software to the customer. It is the opposite of hard boundaries between writing code and delivering it, looking to get rid of "throwing it over the wall" and the "it worked on my machine" sort of rhetoric.
- Collaboration & Communication cultivates a no-blame environment. Success is collective. Transparency is promoted (seeing how systems work, seeing what work is planned and being delivered)
- Automate [repetitive processes] - the previous 2 principles are somewhat abstract and cultural; automation is tangible. Having someone run a process manually for a few hours every week is less optimal than having someone develop a system that runs the process. Automation is often most prevalent in CI/CD and the SDLC. Automation can be framed as a side-effort or outcome of a collaborative mindset, and having the shared responsibility of delivery.
- Continuous Improvement - Kaizan is a wonderful philosophy, constantly evaluating processes and tools and iteratively improving (the delivery process).
Core CloudOps or Cloud Engineering
The primary focus of CloudEng is build, running and governance of cloud-based infrastructure. Effective CloudOps teams develop infrastructure as code and follow the SDLC to some extent. They ensure availability, visibility, governance, security, cost management and infrastructure automation for the cloud.
For small organizations / startups, this will often be the first cloud operations role and also be key stakeholders in defining and delivering the CI/CD system, monitoring and alerting and be involved in incident response (at infrastructure AND service layer).
Site Reliability Engineering (SRE)
The concept of SRE was developed by Google in the 2000's. It's original definition was "what you get when you ask a software engineer to design an operations team". The purpose of SRE is to ensure services run reliability: stay available, performant and secure. Google has published some great SRE books, especially the workbook, that help create an effective SRE function.
SRE works in the layer where service meets infrastructure. Key areas owned by SRE is observability, monitoring, alerting and incident response / on-call. Keeping aligned with DevOps principles, SREs work with developers to define observability standards (e.g. ensuring software/services expose metrics, logging standards, tracing), implement monitoring systems (centralized log aggregation, dashboarding and reporting). They help define Service Level Objectives (internal representation of SLAs) and help define alerting and response processes when SLOs are not met.
An SRE team can often take on responsibilities in the CI/CD (taking over some aspects from Cloud Eng), bringing a more operational mindset to how services are deployed and run (e.g. reliability monitoring during deploys, automated rollbacks). Developing SRE in an organization often starts with a core / central SRE team. There is a lot of value to having mature SREs embedded in dev teams too.
Strong SRE teams usually have software development / architecture experience. Some common "unreliability" issues in the service layer I have seen are less experienced developers not considering that their code is not going to just run on 1 machine like it does on their computer, and not handle concurrency well; or not give much thought to connection management / connection pooling; and poor retry logic - so SREs being familiar with patterns like retry backoff windows and circuit breakers in software code help identify causes of issues - preferably far left in the delivery cycle (being included in code reviews) before discovering in prod.
Platform Engineering
I have seen a number of engineering organizations fall into the pitfall of developers thinking "we write product code, cloud ops handles the automation" and other common DevOps anti-patterns. With Cloud Engineering delivering a lot of automation for infrastructure, anything-automated can sometimes fall back to them if there's a lack of automation skills in the developer teams. And so (again, in more mature teams), a Platform Engineering function may be created that focused on developing automation for the developer environment and delivering an internal developer platform. This could be tooling that helps replicate a prod-like local environment, or libraries that ensure consistency / alignment with standards - for example, an SDK that handles authentication for multiple components, or provides a stanardized metrics and logging implementation based on the standards laid out by SRE.