Stop Shipping Snowflakes: A Five‑Stage Maturity Model for Cloud Networking Teams
“Infrastructure that isn’t repeatable isn’t reliable.”
The fastest way to rack up incident fatigue and ballooning cloud bills is to treat every VPC, subnet, and firewall rule as a special snowflake. After coaching dozens of scale‑ups and two Fortune‑100s, I’ve found the same pattern: teams progress through predictable stages on their way to a fully automated, intent‑based network. Mapping where you sit today—and what it takes to level‑up—cuts months of thrash.
Below is an opinionated five‑stage maturity model, tuned specifically for cloud networking in 2025. It borrows org‑design cues from the famous Spotify Squad model and performance benchmarks from Google’s 2024 DORA State of DevOps report.
TL;DR Table
Stage | Nickname | Change Velocity | Blast‑Radius Control | Automation Ratio | Typical Headcount* |
---|---|---|---|---|---|
1 | Fire‑Fighting | <1 network change/week | None (global prod/shared) | 0‑10 % | 2 engineers / 10 VPCs |
2 | Script‑Curious | 5‑10 changes/week | Basic IAM separation | 10‑40 % (scripts) | 2 engineers / 25 VPCs |
3 | Module‑Driven | 20‑100 changes/week | Per‑env accounts + infra tests | 40‑80 % (IaC pipelines) | 1 engineer / 80 VPCs |
4 | Platform‑as‑Product | 100‑500 changes/day | Policy‑as‑Code + SLO gates | 80‑95 % (self‑service) | 1 engineer / 250 VPCs |
5 | Autonomous | Continuous | Dynamic intent + AI drift fix | 95‑100 % | 1 engineer / 500 VPCs |
*Headcount = dedicated networking specialists, not counting app engineers. Your mileage will vary with cloud provider complexity and compliance overhead.
Stage 1 – Fire‑Fighting
Symptoms
- Everything lives in the default VPC.
- Changes flow through Jira tickets that pop your VPN open at 2 a.m.
- Nobody knows which security group is safe to delete, so you never delete anything.
- Monitoring =
ping
and CloudWatch console graphs.
Risks
- One bad
0.0.0.0/0
rule wipes out prod. - Mean time to recovery (MTTR) measured in hours because playbooks are tribal lore.
Next Steps
- Tag every resource with
owner
,purpose
,env
. - Create a single source‑of‑truth spreadsheet (yes, really) for IP ranges.
- Commit to no console changes without follow‑up pull request within 24 h.
Stage 2 – Script‑Curious
What’s Different
- Bash/Python scripts turn common tasks (new subnet, new route) into one‑liners.
- Simple CI job (
terraform plan
→ Slack) but still manually applied. - Shared test account for experiments.
Org Design
- One Networking squad (4‑6 people) that owns the scripts and weekly releases.
- Borrow the Squad Health‑Check retro—track happiness, ease of release, pain of on‑call.
Moving On
- Replace snowflake scripts with opinionated Terraform modules.
- Enforce peer review on every plan.
Stage 3 – Module‑Driven
Key Capabilities
- Infrastructure as Code (IaC) is the default. 100 % of routes, security groups, and NAT gateways live in Git.
- Golden Blueprints: one module for each archetype (public‑private VPC, private‑only, PCI‑isolated, etc.).
- Validation:
terraform validate
,opa test
, andcfn_nag
run in CI. Plans auto‑abort on policy violations. - Blast Radius: Environments split across dedicated accounts; no direct SSH—only SSM.
Metrics
- Change failure rate: < 5 % (DORA Medium performers).
- Deploy frequency: 1–10 times/day across all networks.
People
- Two cross‑functional squads: Network Core (modules, tests) and Network Enablement (support, PR reviews).
Stage 4 – Platform‑as‑Product
What “Good” Looks Like
- Self‑Service Portal (Backstage or internal UI) where app teams request a VPC and get it in <10 min.
- Policy‑as‑Code (OPA/Conftest) blocks non‑compliant CIDRs before they reach the cloud API.
- Zero‑Trust Baseline: Every subnet has an egress firewall rule set; east‑west traffic authenticated via mTLS.
- Network SLOs published—99.99 % inter‑AZ latency < 1.5 ms, P90 packet loss < 0.01 %.
Automation
- 90 %+ of changes merged by app squads themselves.
- Production promotes automatically once canary traffic passes synthetic checks.
Org Chart
- A Network Platform Tribe (3‑4 squads: UX, Policy, Reliability, FinOps).
- Quarterly health‑checks drive the backlog.
Stage 5 – Autonomous (Intent‑Based)
North‑Star Capabilities
- Intent Language: Engineers declare “Service A needs 5 Gbps to Service B with 30 ms RTT”; the system chooses pathing.
- Closed‑Loop Compliance: eBPF sensors stream flow logs to a policy engine that reconciles drift in near‑real time.
- AI‑Driven Optimisation: Reinforcement learning dials BBRv2/QUIC parameters to hit latency budgets and carbon budgets.
- Carbon‑Aware Routing: Traffic shifts to regions with lower kg CO₂/kWh when latency impact < 10 ms.
Operating Model
- One squad keeps the platform alive; the rest of the org treats the network as electricity.
- On‑call pages/week: < 0.2 per engineer (best‑in‑class DORAⁱ).
ⁱSee State of DevOps 2024 for benchmarks of elite performers.
How to Use the Model
- Run a Workshop: Print the TL;DR table. Let every squad place sticky notes where they think they are. Disagreements reveal blind spots.
- Pick Two Gaps: Trying to jump two stages at once fails 80 % of the time. Close the ugliest gap first (often blast‑radius).
- Automate Backlog Intake: Every time a human touches the console, open a Jira ticket labelled
manual‑network‑change
. That list becomes your roadmap. - Re‑score Quarterly: Use the Squad Health‑Check → feed metrics into leadership review.
Further Reading
- Henrik Kniberg & Anders Ivarsson, “Scaling Agile @ Spotify” (2012) – the original Squad paper.
- Spotify Engineering, “Squad Health Check Model” (2014).
- Google Cloud, State of DevOps Report (2024).
- IBM, “Cloud Maturity Models” (2023) – three‑phase CMM overview.
- Oracle (Mattoon, Hensle & Baty, 2011). Cloud Computing Maturity Model – Guiding Success with Cloud Capabilities.