By PJ Pérez — 27 Jun 2025

Stop Shipping Snowflakes: A Five‑Stage Maturity Model for Cloud Networking Teams

“Infrastructure that isn’t repeatable isn’t reliable.”

The fastest way to rack up incident fatigue and ballooning cloud bills is to treat every VPC, subnet, and firewall rule as a special snowflake. After coaching dozens of scale‑ups and two Fortune‑100s, I’ve found the same pattern: teams progress through predictable stages on their way to a fully automated, intent‑based network. Mapping where you sit today—and what it takes to level‑up—cuts months of thrash.

Below is an opinionated five‑stage maturity model, tuned specifically for cloud networking in 2025. It borrows org‑design cues from the famous Spotify Squad model and performance benchmarks from Google’s 2024 DORA State of DevOps report.

TL;DR Table

Stage	Nickname	Change Velocity	Blast‑Radius Control	Automation Ratio	Typical Headcount*
1	Fire‑Fighting	<1 network change/week	None (global prod/shared)	0‑10 %	2 engineers / 10 VPCs
2	Script‑Curious	5‑10 changes/week	Basic IAM separation	10‑40 % (scripts)	2 engineers / 25 VPCs
3	Module‑Driven	20‑100 changes/week	Per‑env accounts + infra tests	40‑80 % (IaC pipelines)	1 engineer / 80 VPCs
4	Platform‑as‑Product	100‑500 changes/day	Policy‑as‑Code + SLO gates	80‑95 % (self‑service)	1 engineer / 250 VPCs
5	Autonomous	Continuous	Dynamic intent + AI drift fix	95‑100 %	1 engineer / 500 VPCs

*Headcount = dedicated networking specialists, not counting app engineers. Your mileage will vary with cloud provider complexity and compliance overhead.

Stage 1 – Fire‑Fighting

Symptoms

Everything lives in the default VPC.
Changes flow through Jira tickets that pop your VPN open at 2 a.m.
Nobody knows which security group is safe to delete, so you never delete anything.
Monitoring = ping and CloudWatch console graphs.

Risks

One bad 0.0.0.0/0 rule wipes out prod.
Mean time to recovery (MTTR) measured in hours because playbooks are tribal lore.

Next Steps

Tag every resource with owner, purpose, env.
Create a single source‑of‑truth spreadsheet (yes, really) for IP ranges.
Commit to no console changes without follow‑up pull request within 24 h.

Stage 2 – Script‑Curious

What’s Different

Bash/Python scripts turn common tasks (new subnet, new route) into one‑liners.
Simple CI job (terraform plan → Slack) but still manually applied.
Shared test account for experiments.

Org Design

One Networking squad (4‑6 people) that owns the scripts and weekly releases.
Borrow the Squad Health‑Check retro—track happiness, ease of release, pain of on‑call.

Moving On

Replace snowflake scripts with opinionated Terraform modules.
Enforce peer review on every plan.

Stage 3 – Module‑Driven

Key Capabilities

Infrastructure as Code (IaC) is the default. 100 % of routes, security groups, and NAT gateways live in Git.
Golden Blueprints: one module for each archetype (public‑private VPC, private‑only, PCI‑isolated, etc.).
Validation: terraform validate, opa test, and cfn_nag run in CI. Plans auto‑abort on policy violations.
Blast Radius: Environments split across dedicated accounts; no direct SSH—only SSM.

Metrics

Change failure rate: < 5 % (DORA Medium performers).
Deploy frequency: 1–10 times/day across all networks.

People

Two cross‑functional squads: Network Core (modules, tests) and Network Enablement (support, PR reviews).

Stage 4 – Platform‑as‑Product

What “Good” Looks Like

Self‑Service Portal (Backstage or internal UI) where app teams request a VPC and get it in <10 min.
Policy‑as‑Code (OPA/Conftest) blocks non‑compliant CIDRs before they reach the cloud API.
Zero‑Trust Baseline: Every subnet has an egress firewall rule set; east‑west traffic authenticated via mTLS.
Network SLOs published—99.99 % inter‑AZ latency < 1.5 ms, P90 packet loss < 0.01 %.

Automation

90 %+ of changes merged by app squads themselves.
Production promotes automatically once canary traffic passes synthetic checks.

Org Chart

A Network Platform Tribe (3‑4 squads: UX, Policy, Reliability, FinOps).
Quarterly health‑checks drive the backlog.

Stage 5 – Autonomous (Intent‑Based)

North‑Star Capabilities

Intent Language: Engineers declare “Service A needs 5 Gbps to Service B with 30 ms RTT”; the system chooses pathing.
Closed‑Loop Compliance: eBPF sensors stream flow logs to a policy engine that reconciles drift in near‑real time.
AI‑Driven Optimisation: Reinforcement learning dials BBRv2/QUIC parameters to hit latency budgets and carbon budgets.
Carbon‑Aware Routing: Traffic shifts to regions with lower kg CO₂/kWh when latency impact < 10 ms.

Operating Model

One squad keeps the platform alive; the rest of the org treats the network as electricity.
On‑call pages/week: < 0.2 per engineer (best‑in‑class DORAⁱ).

ⁱSee State of DevOps 2024 for benchmarks of elite performers.

How to Use the Model

Run a Workshop: Print the TL;DR table. Let every squad place sticky notes where they think they are. Disagreements reveal blind spots.
Pick Two Gaps: Trying to jump two stages at once fails 80 % of the time. Close the ugliest gap first (often blast‑radius).
Automate Backlog Intake: Every time a human touches the console, open a Jira ticket labelled manual‑network‑change. That list becomes your roadmap.
Re‑score Quarterly: Use the Squad Health‑Check → feed metrics into leadership review.