Cloud NAT Timeouts Demystified

Cloud NAT gateways keep your private subnets hidden while letting them talk to the outside world. They look trivial at first glance—just a managed SNAT middlebox—but the devil is in the state tables, the idle timers, and the finite pool of source ports. This walkthrough dissects how the big three clouds translate packets, when they drop flows, and how to keep your long-lived sockets alive.
1. What actually happens on the wire
Spin up a VM in a private subnet, curl ifconfig.me
, and the NAT gateway rewrites 10.0.0.5:55832
into 203.0.113.17:41206
, adds it to an internal conn-track bucket, then relays the reply back. No packets ever route to your instance directly. The mapping lives until one of two things happens:
- the connection closes (FIN/RST)
- the mapping’s idle timer expires
Everything after that point is implementation detail—and that detail is why your WebSocket sometimes dies at 5 minutes for no apparent reason.
2. Idle timers by provider
AWS hard-codes a single value. If a flow is silent for 350 seconds, the gateway wipes it and sends the next outbound packet an RST — no config knob, no mercy.
GCP defaults are friendlier but still finite: 30 s for UDP, 30 s for half-open TCP, 20 minutes for established TCP. All are tunable with gcloud compute routers nats update
.
Azure starts at a 4-minute idle timeout for both TCP and UDP. TCP can stretch up to 120 minutes; UDP stays fixed. The slider lives in az network nat gateway update --idle-timeout
.
Keep those numbers in your head. They explain 90 % of mysterious disconnects.
3. Lab: prove the timeout
# On a private-subnet EC2 instance behind an AWS NAT Gateway
nc -l 9999 &
# From anywhere on the internet
nc $PUBLIC_IP 9999
# now wait six minutes, type a key, watch the RST
The IdleTimeoutCount
metric in CloudWatch increments right after the RST lands. Same test on GCP shows the flow vanish at 20 minutes; on Azure it dies at 4 minutes unless you cranked the knob.
4. Port exhaustion is the next cliff
Each SNAT mapping claims one source port from the gateway’s pool. Heavy microservice stacks can open thousands of outbound sockets per node, quickly burning through the default 64 k × public-IP pool. When that pool dries up:
- AWS drops new connections (
PacketsDropCount
) - GCP black-holes SYNs (
compute.googleapis.com/nat/port_usage
) - Azure returns ICMP unreachable and logs a
SnatPortExhausted
event
Mitigations:
- Spread flows across more gateway IPs (all providers support multiple).
- On GCP turn on dynamic port allocation and raise
--min-ports-per-vm
. - On Azure bump
--public-ip-addresses
or shard subnets.
5. Keep-alive vs. tune-the-timer
AWS gives you no dial, so you must send traffic < 350 s. TCP keep-alives at 4 min do the trick:
sysctl -w net.ipv4.tcp_keepalive_time=240
GCP and Azure let you increase the idle timeout instead, but note the trade-off: ports stay busy longer, raising exhaustion risk. Production setups often combine a modest bump (e.g., 10 min) with lightweight probes.
gcloud compute routers nats update my-nat \
--tcp-established-idle-timeout=600
az network nat gateway update \
--resource-group rg1 --name nat1 --idle-timeout 10
6. Sizing cheat sheet
Required ports ≈ peak_concurrent_flows × (average retransmissions factor)
Total ports = 64 k × external_IPs
Keep required ≤ total × 0.8
for headroom. If not, add IPs or trim keep-alive timers.
7. Signals to watch
- AWS:
IdleTimeoutCount
,PacketsDropCount
in CloudWatch - GCP:
nat/port_usage
,nat/connection_count
, audit logs for timeout edits - Azure:
SnatConnectionCount
,SnatPortUsage
,TcpResetCount
Graph them in Grafana; alert when either usage or drops trend upward week-over-week.
8. Takeaways
Cloud NAT gateways are convenient, but they are stateful firewalls with small brains and smaller port pools. Know the idle timers, tune what you can, and emit keep-alives when you can’t. Monitor the drop counters before your first users do.
Routing is simple. State is where outages hide. Map the timeout landscape now, rule the box later.