Friday, 3 February 2023

Implementing Security Governance at Scale in AWS Organizations

Introduction

Many organizations establish security policies and compliance requirements, but translating those requirements into consistent technical controls across a growing AWS environment can be challenging.

AWS provides the building blocks through Organizations, Control Tower, Service Control Policies (SCPs), Security Hub, and other native services. However, these services alone do not define a security operating model. Organizations still need a framework that connects governance requirements to technical implementation and ongoing compliance monitoring.

This article outlines a practical security governance model built on AWS Control Tower and AWS Organizations. The model focuses on three key objectives:

Prevent non-compliant configurations from being deployed.
Detect security and compliance drift when it occurs.
Automatically remediate high-risk violations.

The result is a repeatable approach that allows cloud environments to remain aligned with business security policies and standards while maintaining the flexibility expected from modern cloud platforms.

From Policy to Implementation

Every security program begins with policies and standards.

Policies are typically owned by governance, risk, compliance, or security teams and define organizational expectations. Examples might include:

Approved AWS regions
Identity and access management requirements
Logging and audit requirements
Vulnerability management expectations
Network security requirements

The cloud platform team is then responsible for implementing technical controls that enforce those requirements.

To bridge this gap, I advocate maintaining a formal Security Model document. This document serves as the implementation specification for cloud security controls.

A typical purpose statement might read:

This document details specific configurations in the AWS Organization that implement the controls for the organization's multi-account AWS environment, ensuring alignment with security policies and standards. The configuration establishes preventative, detective, and remediative mechanisms to maintain confidentiality, integrity, and availability while meeting customer and industry expectations.

The document should define:

Purpose and scope
Applicable policies and standards
Stakeholders / Review Board members (owner, approvers)
Preventative controls
Detective controls
Remediative controls
Exceptions and compensating controls

Importantly, ownership of the security model should reside with a stakeholder group that periodically reviews and updates the implementation. In some organizations this may be a dedicated governance or compliance function. In others, security governance may be embedded directly within cloud platform engineering.

Establishing the Foundation with AWS Control Tower

The foundation of this model is an AWS Landing Zone implemented using AWS Control Tower.

Control Tower provides:

AWS Organizations management
Organizational Units (OUs)
Centralized identity integration
Audit account
Log Archive account
Standardized account provisioning

This creates a consistent multi-account architecture where controls can be deployed uniformly. While Control Tower provides the foundation, most organizations quickly require controls beyond those available through native guardrails.

This is where Customizations for AWS Control Tower (CfCT) becomes a critical component.

CfCT allows organizations to define a deployment manifest that can:

Deploy Service Control Policies (SCPs)
Deploy resources through CloudFormation stack sets across accounts and regions.
Maintain consistent configuration across the organization

In effect, CfCT becomes the mechanism that transforms governance requirements into enforceable technical controls.

Layer 1: Preventative Controls

Preventative controls stop non-compliant actions before they occur.

Restricting AWS Regions

One of the most effective organizational controls is restricting resource deployment to approved AWS regions.

For many organizations there is no business requirement to operate globally. Allowing unrestricted deployment increases attack surface and can create regulatory concerns around data residency.

A common approach is implementing an SCP that:

Denies actions in non-approved regions
Excludes global AWS services such as IAM and Route53
Allows operations only in explicitly approved regions

This immediately prevents accidental or unauthorized deployment outside approved geographic boundaries.

Eliminating Root User Activity

Another foundational control is preventing use of AWS account root users.

In a Control Tower environment, users should authenticate through IAM Identity Center or an approved identity provider. Administrative access should be granted through federated roles with least-privilege permissions.

An SCP can deny actions when the principal matches:

arn:aws:iam::*:root

This effectively removes root user activity from normal operations while preserving emergency break-glass procedures when required.

The result is a significantly reduced risk profile and improved accountability through centralized identity management.

Layer 2: Detective Controls

Preventative controls are important, but they cannot address every scenario. Organizations also need visibility into compliance posture and configuration drift.

Creating a Central Security Services Account

In several AWS Organizations, I have implemented a dedicated Security Services account was introduced alongside the standard Control Tower Audit and Log Archive accounts.

This account becomes the centralized management plane for security tooling.

Typical deployments include:

AWS Security Hub
Amazon GuardDuty
AWS Firewall Manager
Additional monitoring integrations (Macie, Inspector, etc)

The objective is to provide a single operational view of security posture across all AWS accounts. This configuration can be complimented if using a 3rd party SIEM, or act as a SIEM if an external option is not available.

Cross-Account Security Management

Using CfCT, standardized CloudFormation deployments can establish:

Security administration roles
Cross-account execution roles
Security Hub configuration
GuardDuty enrollment
Supporting automation

A common pattern includes:

SecurityServicesRole in the Security Services account.
SecurityServicesExecutionRole deployed to member accounts.
Automated role assumption for centralized administration.

This creates secure delegated administration without requiring direct access to every account.

Automated Account Enrollment

As organizations scale, manual onboarding becomes unsustainable.

A lightweight Lambda function can monitor AWS Organizations for newly created accounts and automatically:

Send Security Hub invitations from Security Services
Accept invitations in the target account and region.
Configure delegated administration
Enable required security services

This ensures newly provisioned accounts immediately inherit the organization's security baseline.

Layer 3: Remediative Controls

The final layer addresses a reality of cloud operations: non-compliant resources will occasionally be created despite preventative controls.

Rather than relying solely on manual intervention, organizations can implement automated remediation.

Moving Beyond Detection

Security Hub provides excellent visibility into security findings. However, detection alone often creates operational overhead. For frequently recurring findings, remediation can be automated.

Consider a common example:

> A developer creates a security group allowing SSH or RDP access from: 0.0.0.0/0

This violates controls found in multiple standards including CIS Benchmarks and AWS Foundational Security Best Practices.

Security Hub will detect the issue, but detection alone leaves a window of exposure.

Auto-Compliant Controls

An alternative approach is to treat selected findings as self-healing controls.

The workflow becomes:

Security Hub generates a finding.
EventBridge receives the event.
A remediation Lambda is invoked.
The Lambda modifies the resource.
Compliance is restored automatically.

For example:

Remove unrestricted SSH ingress rules.
Remove unrestricted RDP ingress rules.
Restore required logging settings.
Re-enable security monitoring services.

The resource remains available, but the non-compliant configuration is corrected within seconds. This approach significantly reduces operational risk while minimizing the burden on security teams.

Managing Standards and Compensating Controls

A common misconception is that security standards should be treated as immutable checklists and the only and whole source of truth for controls.

Standards such as:

CIS Benchmarks
AWS Foundational Security Best Practices (FSBP)
PCI DSS
NIST frameworks

should be viewed as inputs into the security model rather than requirements that must be adopted verbatim. Organizations frequently implement compensating controls that provide equivalent or stronger protection.

For example, AWS FSBP control IAM.6 recommends hardware MFA on root accounts.

If the organization has implemented an SCP that blocks all root account activity, a governance review board may determine that the risk is already mitigated and approve disabling the finding.

Such decisions should be formally documented, reviewed, and approved as part of the security model governance process.

The objective is not perfect compliance with every benchmark. The objective is effective risk reduction aligned to business requirements.

Closing Thoughts

Effective cloud security governance is not achieved through policies alone, nor through tooling alone. It requires a deliberate connection between governance requirements and technical implementation.

By combining AWS Control Tower, AWS Organizations, Customizations for AWS Control Tower, Security Hub, and automated remediation patterns, organizations can establish a security framework that continuously enforces policy, monitors compliance, and corrects drift at scale.

The most successful implementations treat security as a living system rather than a static configuration. Policies evolve, risks change, and cloud platforms grow. A well-defined security model, reviewed by the appropriate stakeholders and implemented through automation, provides a sustainable mechanism for maintaining security and compliance across the AWS environment.

Sunday, 20 November 2022

Building an Engineering Incident Response Program: A Practical Maturity Model for Cloud Operations

In building cloud operations and SRE capability within regulated financial environments, one of the most impactful investments I made was the development of an engineering incident response program.

In a number of organizations that develop financial advisory platforms that I have been involved with (including Jemstep and intelliflo), reliability was not simply an operational concern—it was a business and compliance requirement. As we modernized the cloud operating model, it became clear that incident response could not remain an informal or reactive discipline. It needed to evolve into a structured and proactive capability: one that aligned engineering execution, operational risk, and business expectations.

What follows is a pragmatic maturity model for building such a program—based on real-world implementation, continuous refinement, and operational learning.

1. Incident Response as a Maturity Journey, Not a One-Time Implementation

A common misconception is that incident response is a program you “implement” and then operationalize. In practice, it is better understood as a continuously evolving maturity model.

At a high level, the lifecycle looks like this:

Initial program definition (designing structure and expectations)
Operationalization (embedding into engineering and support workflows)
Continuous validation (testing assumptions through real and simulated incidents)
Continuous improvement (refining based on outcomes and organizational change)

The key insight is that the program is never “finished.” Instead, it stabilizes into a governed system of processes that is continuously stress-tested and improved.

2. Core Incident Lifecycle: From Detection to Resolution

While maturity evolves over time, the operational incident flow remains consistent:

Detect → Alert → Triage → Respond → Resolve → Learn

Detection & Alerting

Detection is driven by observability systems and defined SLO thresholds. In practice, this includes telemetry from platforms such as Prometheus, alert routing via Alertmanager, and log aggregation through tools such as Splunk.

Alerts are not just technical signals—they represent defined breaches or risks against expected service behavior.

Triage and Severity Classification

Once an alert is triggered, the first critical function is triage and severity classification.

We used a formal severity model:

SEV 1–SEV 4 classification
Severity determined by:
- Scope of impact (number of users or systems affected)
- Functional impact (degradation vs full outage)
- Alignment to SLOs derived from customer SLAs

This distinction is important: severity is not purely technical—it is anchored in customer impact and contractual expectations.

3. On-Call Model and Cross-Functional Response

Once severity is established, incidents are routed through an on-call structure designed to ensure rapid engagement across functional boundaries.

A typical rotation spans:

Infrastructure and platform engineering
Software engineering teams
Support and operations functions
Business stakeholders for high-severity incidents

We leveraged tooling such as PagerDuty and Amazon Incident Response capabilities to coordinate escalation, paging, and structured response workflows.

The intent is not simply to “wake people up,” but to ensure the right expertise is engaged quickly, based on the nature of the incident.

A key learning here is that effective incident response is inherently cross-functional. Engineering alone cannot fully resolve high-severity business-impacting incidents without operational and stakeholder alignment.

4. Designing the Program: From Structure to Operating Rhythm

When establishing the programs at various organizations, the initial focus is typically not tooling—it is structure.

The foundational design included:

A defined severity model aligned to SLOs
Clear escalation paths and ownership
On-call expectations and rotation structure
Incident communication protocols
Post-incident review standards

Once established, the program was embedded into the operating rhythm of engineering teams.

However, the most important shift was recognizing that design alone is insufficient. The program only becomes effective when it is continuously exercised and challenged.

5. Continuous Improvement Through Tabletop Exercises and Real Incidents

A critical component of maturity is the continuous validation of the incident response system itself.

We implemented regular tabletop exercises designed to simulate realistic incident scenarios. These exercises served two purposes:

Validate operational readiness
Expose gaps in process, tooling, or communication

Importantly, these were not “compliance exercises.” They were structured learning events, which we refer to as Table Top Exercises. These are creating incident scenarios either based on previous real incidents, or hypothetical potential incidents that we identify, and recreating them (while not actually impacting production systems).

Over time, these exercises revealed systemic improvements:

Clarification of escalation paths
Refinement of severity definitions
Improved cross-team coordination
Faster alignment between technical and business stakeholders

In addition to simulations, real incidents served as the most valuable feedback loop.

6. Blameless Postmortems as a Learning System

One of the most important cultural components of the program was the adoption of blameless postmortems.

The intent is simple but powerful: incidents are treated as system outcomes, not individual failures.

Postmortems focused on:

What happened
Why it happened (systemically, not personally)
What signals were missed or unclear
What process or architectural change prevents recurrence

This approach shifts the organization from reactive firefighting to structured learning. Over time, it creates a compounding effect where each incident strengthens system resilience.

7. Measuring Success: From Activity to Operational Outcomes

A mature incident response program is not measured by how often it is used, but by how effectively it reduces impact and improves recovery.

Key metrics included:

MTTA (Mean Time to Acknowledge) – how quickly incidents are recognized and owned
MTTR (Mean Time to Resolve) – how quickly service is restored
Root cause analysis quality and completion rate
Trend analysis across incident frequency and severity

These metrics were aggregated over time to identify systemic patterns rather than isolated events. I have seen many incident response frameworks stagnate or operate ineffectively without incorporating the restrospective lookback loop to learn and improve. Preventative measures are so much more valuable than reactive ones.

Importantly, metrics were always interpreted in context of SLOs. The goal was not optimization of individual numbers, but improvement of service reliability and customer experience.

8. Operational Maturity: What Changes Over Time

As the program matures, several shifts typically occur:

From reactive response → proactive detection and prevention
From informal coordination → structured cross-functional execution
From hero-driven recovery → repeatable systems-based resolution
From isolated incidents → trend-based reliability engineering

In regulated environments, this maturity is especially important. Incident response becomes not just an engineering capability, but part of the organization’s operational risk posture.

Closing Perspective

Building an incident response program is not a tooling exercise or a procedural checklist. It is a deliberate act of operational design—one that connects engineering execution, business continuity, and organizational risk management.

An important realization was that incident response maturity is not achieved through documentation alone. It is achieved through repetition, validation, and continuous refinement of both systems and behaviors.

For organizations developing similar capabilities, the key takeaway is this:

The value of an incident response program is not defined by how it behaves in steady state, but by how effectively it adapts under stress—and how quickly it learns from that stress to become better.

That is the essence of operational maturity in cloud engineering.

Saturday, 15 August 2020

Building Trust in the Cloud: Engineering Compliance into a SOC 2–Certified Financial Advisory Platform

In many conversations about cloud architecture, compliance is often treated as a parallel track—something handled by policy teams, auditors, or documentation exercises that sit adjacent to engineering work. In practice, especially in regulated financial environments, that separation doesn’t hold.

Compliance becomes part of the system design itself—but it is also a shared responsibility that spans multiple layers of the organization.

During my time as a Principal Cloud Architect supporting a cloud-based, multi-tenant SaaS advisory platform serving Registered Investment Advisors (RIAs), broker-dealers, and institutional financial firms, I work within an environment governed by SOC 2 expectations and broader financial industry regulatory scrutiny. While cloud architecture plays a central role in enabling operational controls, it is only one part of a broader ecosystem. Software engineering practices, product development standards, and organizational processes—such as HR-managed personnel controls and access lifecycle governance—collectively contributed to the overall compliance posture.

The platform operates as a trusted advisory system for financial institutions, where availability, confidentiality, and operational integrity are not abstract goals—they are foundational requirements of the product.

Over time, one principle became increasingly clear: compliance is not something any single function “owns.” It is something that must be continuously expressed through how the system behaves, how teams operate, and how organizational controls align across technical and non-technical domains.

Within that broader context, cloud architecture plays a critical role in enabling and enforcing many of the operational controls—but it succeeds only when it is aligned with disciplined engineering practices and supporting organizational processes.

The Environment: Where Trust Is a Product Requirement

The platform is a cloud-native, multi-tenant SaaS advisory system used by regulated financial institutions. Our customers are not simply evaluating functionality—they are evaluating trust.

That trust has multiple dimensions:

Can the system remain available when demand spiked?
Can sensitive financial data remain isolated and protected across tenants?
Can every action within the system be traced, explained, and audited?
Can operational risks be demonstrated as being actively managed rather than assumed?

SOC 2 provides the assurance framework for answering these questions, but it did not define how to build the system. That responsibility sits squarely with architecture and engineering leadership.

In that sense, SOC 2 was not a checklist—it was a set of expectations that had to be embedded into the operating model of the platform.

Turning Controls into System Behavior

One of the earliest design shifts was recognizing that compliance controls are only meaningful when they are observable in system behavior.

A written policy might state that access is restricted. A system design ensures that access is enforced, logged, and reviewable by default. A document might describe incident response procedures. A mature system ensures incidents are detected, escalated, and recorded automatically with traceable evidence.

The goal was to move from interpretation-based compliance to behavior-based compliance.

A guiding principle emerged:

If a control cannot be observed, measured, and validated continuously, it is not operationally real.

This principle influenced every major architectural decision—from infrastructure design to deployment pipelines to operational monitoring.

Designing for SOC 2 Principles in Practice

SOC 2 is often summarized in terms of trust principles such as Security, Availability, and Confidentiality. In practice, each of these translates into concrete architectural patterns and operational expectations.

Availability: Designing Systems That Must Stay Online

For a financial advisory platform, availability is not a convenience metric—it is a trust requirement. Customers expect consistent access to critical advisory capabilities, even under infrastructure stress or partial system failure.

This required architectural decisions around:

High availability across cloud availability zones (or regions). 1 of our Disaster Scenarios what "what if California slid into the Pacific" and "What if a meteor hit the East Coast"
Resilient service design with failure isolation and "reducing blast radius"
Disaster recovery planning with defined recovery objectives (RTO/RPO)
Regular disaster recovery testing as a normal operational cadence, not an exceptional exercise

Importantly, resilience was not treated as a theoretical target. It was validated through repeatable operational testing, ensuring that recovery behavior was known rather than assumed.

Confidentiality: Protecting Data in a Multi-Tenant System

Multi-tenancy introduced a fundamental architectural constraint: data isolation was not optional, and it had to be consistently enforced across all system layers.

Confidentiality was implemented through:

Strong tenant isolation patterns at application and data layers. Primarily data layers. Per-tenant cipher keys for data encryption is a great example of strong controls. Tenant Identifiers built into application logic is often also suitable for isolation, but may be perceived as a weaker mechanism.
Encryption of data at rest and in transit
Controlled identity and access management with least-privilege principles
Segmented administrative access paths for sensitive operations

The key design intent was consistency. Controls needed to behave the same way across environments, services, and deployment states. Any deviation introduced both operational risk and audit complexity.

Security Monitoring: Making the System Observable

Security and compliance are only as strong as the visibility into system behavior.

This led to a strong emphasis on observability as a control mechanism:

Centralized logging, metrics, and alerting using tooling such as Prometheus, Loki, Splunk, CloudWatch, and similar platforms depending on environment constraints
System-level monitoring designed to detect abnormal patterns and operational degradation
Structured incident response workflows that ensured traceability from detection through resolution. A developed Incident Response Program should be considered a prerequisite.
Post-incident reviews that generated durable audit artifacts

Over time, observability became more than an operational tool—it became the primary mechanism through which control effectiveness was demonstrated.

Integrity: Ensuring Systems Behave as Intended

Integrity in a regulated system is not just about data correctness—it is about ensuring that change is controlled, reproducible, and auditable.

This was achieved through:

CI/CD pipelines with controlled promotion and review processes
Infrastructure-as-code using tools such as Terraform and CloudFormation
Configuration management practices using systems like Ansible, Chef, and Puppet to enforce consistency
Version-controlled system changes that ensured environments could be reproduced and audited

The effect was to remove ambiguity from system state. What existed in production was always traceable to a defined, reviewable change process.

Building a Governance Model That Scales with Engineering

As the platform matured, it became necessary to move beyond individual controls and toward a structured governance model that could scale with system complexity.

This internal governance framework was designed to operationalize SOC 2 requirements across engineering teams and services.

Rather than treating compliance as a set of external expectations, the model embedded it into:

Standardized control definitions mapped directly to system behavior
Reusable architectural patterns across services and environments
Continuous validation of controls through monitoring and automation
Reduction of manual effort required during audit cycles

A key design objective was consistency: once a control was defined, it had to behave the same way everywhere it was applied.

While this framework remained internal and was never externally published, its intent was simple—make compliance a natural outcome of how systems are built and operated, rather than a separate workflow.

Continuity Across Regions and Growth Phases

My experience working with the US-based platform from both South Africa and the United States highlighted an important reality: regulatory expectations in financial systems do not meaningfully change with geography, but operational complexity often does.

This reinforced the importance of portability in governance:

Controls needed to apply consistently across distributed teams
Operational expectations had to remain stable across organizational transitions
Architecture patterns had to support scale without weakening compliance guarantees

In effect, the governance model needed to function as a platform capability, not a local team practice.

Audit Reality: When Compliance Holds Under Scrutiny

A particularly meaningful milestone in this environment was achieving SOC 1 audit readiness followed by SOC 2 continuous monitoring, resulting in certification with no additional requirements, findings, or remediation requests on first pass.

While audits are often viewed as the goal, this outcome reflected something more significant: the controls were not being validated for the first time by auditors. They had already been operating as part of the system.

In other words, the audit did not introduce new expectations—it confirmed existing behavior.

That distinction is important. It reflects a maturity model where compliance is not “prepared for” but already embedded in production reality.

Lessons Learned: What Matters in Regulated Cloud Architecture

Several enduring lessons emerged from building and operating in this environment:

First, compliance is most effective when it is designed into systems early. Retrofitting controls after systems are built introduces friction and inconsistency.

Second, observability is not just an operational tool—it is a compliance enabler. If system behavior cannot be observed, it cannot be governed.

Third, automation is essential. Manual processes do not scale to the level of assurance required in regulated environments.

Finally, governance is most effective when it is invisible in daily operations. The strongest compliance frameworks are the ones engineers do not need to “think about” in order to follow—they are simply how the system works.

Closing: The Cloud Architect as a Designer of Trust

In regulated financial environments, cloud architecture is not just about performance, scalability, or cost optimization. It is about designing systems that institutions can trust with sensitive data and critical financial workflows.

However, that trust is not created by cloud architecture alone. It emerges from a shared responsibility model that spans cloud infrastructure, software engineering practices, product design decisions, and supporting organizational controls such as access management and personnel-based governance. Each layer contributes to the overall assurance posture, and misalignment in any one area can weaken the system as a whole.

This shifts the role of a cloud architect in a meaningful way.

It is no longer just about building infrastructure. It is about designing systems where trust, compliance, and operational reliability are emergent properties of coordinated design across engineering and business functions.

In that sense, compliance is not the end goal. It is the outcome of well-aligned systems—technical and organizational—that behave predictably, transparently, and reliably under real-world conditions.

And in regulated financial platforms, that consistency of behavior across all contributing functions is ultimately what defines trust.

Friday, 20 March 2020

Principles and Disciplines of Cloud Operations

What is Cloud Operations

Cloud Operations is a broad term describing procedures, tools and practices for running IT services in cloud environments. When building out a Cloud Ops function within an organization, teams are typically built out in line with specialized disciplines. The skill-set and requirements of these teams can be mapped differently, depending on the nature of the organization and the phase of growth.

Roles and teams naturally evolve as an organization matures. It is not uncommon to have a single cloud team that wears many hats, often being a small team for companies that are starting out. Certain specialties might be scaled up during early build out, with a shift in focus to other roles as the company and it's products/services move more into run / maintain / improve / optimize phase.

The various roles and disciplines under this blanket term share a set of common principles and values. When thinking about how businesses are organized, especially technology/ engineering divisions, I often map it to The Spine Model - a framework that an organization I worked in used for a number of years, and foundational to how I still operate. Some of the references to values, principles and processes referred to here reference this framework (with needs and values being more strategic/abstract, and principles and processes becoming more tactical/pragmatic).

Developers, DevOps & CloudOps

All disciplines share some common principles. In the technology / software engineering industry, teams or roles are often tagged as "DevOps team" or "DevOps Engineer". In the organizations that I have developed Cloud Ops teams in, I have avoided using these terms to reference teams, roles or positions, based on how I understood the concept when i was introduced to it around 2010: It is a methodology or framework (or philosophy) of levering technology and automation to reduce handovers and delays in delivery between software/product engineers (developers) and operational teams in a collaborative way. The "Dev" refers to Developers and "Ops" operations teams. I always found slight irony in labeling an Ops team "DevOps". CloudOps team/individuals and Developers collaborate to build systems with shared responsibility to deliver better software faster within an engineering org. And so, the principles of DevOps are core to CloudOps, while also being a pillar for software engineers. First-to-mind questions for adopters of DevOps is often "how can we allow each role to deliver on their requirements autonomously, reducing dependencies between teams and roles?". A typical example of how software was built and developed before or without DevOps would be developers writing code and checking it into a repo; operations building it (or devs building it and sending it to ops); and then ops deploying it to servers, including config updates. This consumes engineering resources and reduces transparency. With DevOps, a cloudops team would focus on building a CI/CD solution aligned to SDLC, with feedback loops so devs can see the result of builds; and then allow devs, release managers, or other business roles to progress software updates through environments.

And this has touched on some of the common principles of an organization embracing DevOps, and cornerstones of CloudOps teams:

Shared Responsibility: it is the responsibility of operations teams and developers to deliver software to the customer. It is the opposite of hard boundaries between writing code and delivering it, looking to get rid of "throwing it over the wall" and the "it worked on my machine" sort of rhetoric.
Collaboration & Communication cultivates a no-blame environment. Success is collective. Transparency is promoted (seeing how systems work, seeing what work is planned and being delivered)
Automate [repetitive processes] - the previous 2 principles are somewhat abstract and cultural; automation is tangible. Having someone run a process manually for a few hours every week is less optimal than having someone develop a system that runs the process. Automation is often most prevalent in CI/CD and the SDLC. Automation can be framed as a side-effort or outcome of a collaborative mindset, and having the shared responsibility of delivery.
Continuous Improvement - Kaizan is a wonderful philosophy, constantly evaluating processes and tools and iteratively improving (the delivery process).

Core CloudOps or Cloud Engineering

The primary focus of CloudEng is build, running and governance of cloud-based infrastructure. Effective CloudOps teams develop infrastructure as code and follow the SDLC to some extent. They ensure availability, visibility, governance, security, cost management and infrastructure automation for the cloud.

For small organizations / startups, this will often be the first cloud operations role and also be key stakeholders in defining and delivering the CI/CD system, monitoring and alerting and be involved in incident response (at infrastructure AND service layer).

Site Reliability Engineering (SRE)

The concept of SRE was developed by Google in the 2000's. It's original definition was "what you get when you ask a software engineer to design an operations team". The purpose of SRE is to ensure services run reliability: stay available, performant and secure. Google has published some great SRE books, especially the workbook, that help create an effective SRE function.

SRE works in the layer where service meets infrastructure. Key areas owned by SRE is observability, monitoring, alerting and incident response / on-call. Keeping aligned with DevOps principles, SREs work with developers to define observability standards (e.g. ensuring software/services expose metrics, logging standards, tracing), implement monitoring systems (centralized log aggregation, dashboarding and reporting). They help define Service Level Objectives (internal representation of SLAs) and help define alerting and response processes when SLOs are not met.

An SRE team can often take on responsibilities in the CI/CD (taking over some aspects from Cloud Eng), bringing a more operational mindset to how services are deployed and run (e.g. reliability monitoring during deploys, automated rollbacks). Developing SRE in an organization often starts with a core / central SRE team. There is a lot of value to having mature SREs embedded in dev teams too.

Strong SRE teams usually have software development / architecture experience. Some common "unreliability" issues in the service layer I have seen are less experienced developers not considering that their code is not going to just run on 1 machine like it does on their computer, and not handle concurrency well; or not give much thought to connection management / connection pooling; and poor retry logic - so SREs being familiar with patterns like retry backoff windows and circuit breakers in software code help identify causes of issues - preferably far left in the delivery cycle (being included in code reviews) before discovering in prod.

Platform Engineering

I have seen a number of engineering organizations fall into the pitfall of developers thinking "we write product code, cloud ops handles the automation" and other common DevOps anti-patterns. With Cloud Engineering delivering a lot of automation for infrastructure, anything-automated can sometimes fall back to them if there's a lack of automation skills in the developer teams. And so (again, in more mature teams), a Platform Engineering function may be created that focused on developing automation for the developer environment and delivering an internal developer platform. This could be tooling that helps replicate a prod-like local environment, or libraries that ensure consistency / alignment with standards - for example, an SDK that handles authentication for multiple components, or provides a stanardized metrics and logging implementation based on the standards laid out by SRE.

Closing Thoughts

As cloud adoption continues to mature, the principles and disciplines outlined here should not be viewed as a one-time checklist, but as an evolving foundation for how cloud platforms are designed, operated, and improved over time. Successful cloud environments are built on continuous alignment between architecture, operations, automation, and governance—each reinforcing the other as systems scale and change. Ultimately, organizations that treat these disciplines as living practices rather than static rules will be best positioned to deliver resilient, efficient, and adaptable cloud platforms that can meet both current and future demands.