DevOps#
Definition#
DevOps is the practice — equal parts cultural and technical — of collapsing the wall between the people who write software (Dev) and the people who run it in production (Ops), so that the same team owns a service end-to-end and ships changes in small, automated, low-risk increments. It is not a job title, not a tool, and not a stage in a pipeline; it is the operating model that produces continuous integration, continuous delivery, observability-driven incident response, and infrastructure-as-code. Reach for the DevOps framing whenever the bottleneck between “we have a code change” and “it’s serving traffic” is human handoff rather than engineering complexity.
Why it matters#
Before DevOps became the default, software organizations were structured around quarterly releases and a hard handoff between development and operations — a model that produced large, risky deployments, long lead times for trivial fixes, and a brittle understanding of how systems behaved in production. The empirical evidence that this model is worse is overwhelming: the annual State of DevOps Report from DORA (Google’s research group, now widely-cited industry data) consistently shows that “elite” performers — those who deploy multiple times per day, have lead times under an hour, change-failure rates under 15%, and time-to-restore under an hour — outperform “low” performers on every business metric, from revenue growth to employee retention. Those four signals (lead time, deployment frequency, change-failure rate, time to restore) are now standard board-level engineering KPIs and exist precisely because they measure how well a team has adopted the DevOps loop.
The cost of not working this way compounds. Without trunk-based development and automated tests, every release is a research project. Without infrastructure-as-code, every environment is a snowflake and every incident is a forensics exercise. Without observability, your “is it broken?” answer comes from customer tickets. The DevOps model is what makes a small team able to operate a system that would have needed a war room ten years ago.
DevOps overlaps with — but is not the same as — three adjacent disciplines that frequently get conflated:
- SRE (Site Reliability Engineering) is Google’s prescriptive answer to “how do we operationalize the DevOps culture?” — explicit error budgets, SLI/SLO contracts, a hard cap on operational toil, blameless postmortems, on-call rotations with a defined toil budget. SRE is one implementation of DevOps; DevOps is the broader cultural envelope.
- Platform Engineering is the practice of treating internal developer tools (CI templates, IaC modules, golden paths, internal developer platforms / IDPs) as products with their own users — application engineers — and their own roadmap. It is how a mature DevOps org scales beyond a single team.
- DevSecOps is DevOps with security shifted left into the same pipeline: SAST, dependency scanning, secret scanning, policy-as-code, and supply-chain attestation as gates inside the build, not afterthoughts inside a quarterly review.
How it works#
DevOps in practice is a loop — Plan → Code → Build → Test → Release → Deploy → Operate → Observe → back to Plan — implemented as a single automated pipeline that takes a commit and either lands it in production or rejects it with a clear failure. The pipeline is the artifact; the team’s effectiveness is largely a function of how short, how fast, and how trustworthy that pipeline is.
The pipeline (CI/CD)#
Continuous Integration (CI) is the practice of merging every developer’s work into a shared trunk many times a day, with an automated build + test gate. The signal is “is trunk green?” — and the gate must be fast enough (typically <10 minutes) that engineers run it on every change. Continuous Delivery (CD) extends that gate to produce a deployable artifact at every green commit, ready to be released to production with one click. Continuous Deployment (the more aggressive variant) removes the click — every green commit lands in production automatically, gated only by progressive-rollout signals.
A modern CI/CD pipeline has these stages, in roughly this order: source-control trigger → linters and formatters → unit tests → build (compile, container image) → integration tests → security scans (SAST, dependency-CVE, secret-leak) → artifact publish → infrastructure-as-code plan → staging deploy → smoke tests → progressive rollout to production (canary → percentage rollouts → 100%) → post-deploy monitoring and auto-rollback on SLO regression.
Infrastructure as Code (IaC)#
Production environments are no longer mutated by hand. The cluster, the database, the networking, the IAM policies — all declared in version-controlled source (Terraform, Bicep, CloudFormation, Pulumi, Kubernetes manifests), reviewed by PR, applied by the pipeline. The reasons are familiar by now: reproducibility, drift detection, audit trail, blast-radius limits via PR review, and the ability to rebuild a region from a tag.
Two patterns dominate. Procedural IaC (Terraform, Pulumi) describes desired state and lets the tool figure out the diff against current state — the dominant pattern outside Kubernetes. GitOps (Argo CD, Flux) treats a Git repository as the source of truth for cluster state and runs a reconciler inside the cluster that continuously pulls and converges — the dominant pattern inside Kubernetes.
Observability and incident response#
You can’t operate what you can’t see. The three pillars — metrics (numeric time-series — request rate, error rate, latency), logs (structured event records), and traces (causal chains across services) — are emitted by the application and infrastructure, ingested into a managed store (Datadog, Honeycomb, Grafana Cloud, Azure Monitor, CloudWatch), and surfaced through dashboards and alerts. Service Level Indicators (SLIs) are the specific signals that matter for users; Service Level Objectives (SLOs) are the targets (“99.9% of requests under 300 ms over a 30-day window”); the error budget is 1 - SLO and is the unit of currency between “ship faster” and “stabilize what we have.” When a service has burned its error budget, the team’s deploy velocity is automatically throttled until the budget recovers.
Incident response is blameless, follows a defined on-call rotation, and produces a post-incident review that focuses on systemic root cause rather than individual error. Toil — repetitive, manual, automatable operational work — is tracked and capped (Google’s SRE handbook puts the limit at 50% of an SRE’s time); when toil exceeds the cap, engineering work to automate it takes priority over feature work.
The DORA four#
The shared scoreboard the whole industry now uses, originating in DORA’s research and now embedded in tools like GitHub Insights, Azure DevOps Analytics, and most APM vendors’ “delivery” dashboards:
- Deployment Frequency — how often does code reach production? Elite: multiple per day. Low: less than once a month.
- Lead Time for Changes — commit-to-production wall-clock time. Elite: under one hour. Low: over six months.
- Change-Failure Rate — percent of deploys that cause a degraded service. Elite: under 15%. Low: over 45%.
- Time to Restore Service — wall-clock time from incident detection to recovery. Elite: under one hour. Low: more than six months.
Pair them: high deployment frequency with low change-failure rate is the only healthy combination. High frequency with high failure is reckless; low frequency with low failure is brittle (the rare bad release is catastrophic).
Common pitfalls#
- Calling a tool “DevOps” doesn’t make a team DevOps. Adopting Jenkins/GitHub Actions/Azure Pipelines without changing how teams are structured produces “DevOps theater” — a pipeline, but no shared ownership of production.
- Hand-off cultures masquerading as DevOps. A separate “DevOps team” that owns the pipeline is just operations renamed; the model breaks because the dev team has no incentive to make their code operable. The fix is “you build it, you run it” — application teams carry their own pager.
- Long-running feature branches. Branches that live for weeks defeat CI’s purpose; trunk goes stale, merges become research projects, and the change-failure rate climbs. Use feature flags + trunk-based development instead.
- Manual production changes. A
kubectl apply -ffrom a laptop, anazcommand run by an SRE during an incident — any change not committed to source breaks drift detection and audit. Use break-glass procedures with explicit, time-bounded, audited access; don’t normalize bypassing the pipeline. - Coverage as a quality proxy. 95% line coverage with no assertions is worse than 60% coverage with meaningful tests. Track defect-escape rate (bugs found post-deploy / total bugs) instead.
- SLOs nobody acts on. An SLO that the team doesn’t respect when it’s burning — by halting feature work and prioritizing stability — is theater. Wire the SLO to deployment freezes automatically.
- Pipelines that run for an hour. A 60-minute CI loop kills the rapid-feedback property that makes CI worth doing. Aim for <10 minutes for the trunk-blocking path; push slower checks to nightly or pre-release.
- Security as a gate instead of a service. A “security review” that adds two weeks at the end of every change incentivizes hiding work from security. Shift left: scanners run on every PR, the security team owns the scanners, vulnerabilities flow as PRs into the dev team’s backlog.
- Hero culture as on-call. A team where one person handles every page is one resignation away from a crisis. Spread on-call across a minimum-sized rotation; if you don’t have one, fix the staffing, not the rotation.
- Ignoring lead-time for non-code changes. A pipeline that ships application code in 10 minutes but requires a ticket and a four-day wait for a new IAM role is not a 10-minute pipeline — it’s a four-day pipeline. Bring the slow parts into the same model.
Where to go next#
Concrete cheat sheets for the toolchain DevOps practitioners reach for every day:
- /sections/linux/gh — GitHub CLI; PR review, GitHub Actions workflow management, releases, and the supply-chain attestation surface.
- /sections/linux/git — the version-control substrate every other DevOps practice sits on top of; trunk-based development requires fluent Git.
- /sections/linux/az — Azure CLI;
az repos,az pipelines, andaz devopscover Azure DevOps end-to-end (PR policies, build-and-release pipelines, variable groups, service connections). - /sections/zos/zowe — Zowe; the same pipeline-driven mental model applied to z/OS, demonstrating that DevOps is platform-agnostic.
Concept neighbours worth reading alongside this one:
- /concepts/cloud — DevOps practices were born in the cloud era and assume API-driven infrastructure; the two concepts co-evolved.
- /concepts/api — every CI/CD pipeline is, mechanically, a sequence of API calls against source control, the cloud, registries, and the observability backend.
Sources#
References consulted while writing this concept page. Links open in a new tab.
- DORA research / State of DevOps — Source for the four delivery metrics (deployment frequency, lead time, change-failure rate, time to restore) and the “elite vs. low performer” thresholds in The DORA four.
- Google SRE Book — Authoritative reference for SLI/SLO/error-budget, blameless postmortems, the 50% toil cap, and the “SRE is one implementation of DevOps” framing.
- Google SRE Workbook — How SRE relates to DevOps — Direct source for the SRE-prescribes-how / DevOps-describes-what distinction used in Why it matters.
- Atlassian — SRE vs DevOps — Cross-vendor framing of the “culture vs engineering discipline” split between DevOps and SRE.
- Atlassian — CI vs CD vs Continuous Deployment — Authoritative breakdown of the three terms used in The pipeline.
- OpenGitOps principles — Reference for the GitOps pattern (Git as source of truth, in-cluster reconciler) distinguished from procedural IaC.