Cloud NAT Timeouts Demystified

Cloud NAT Timeouts Demystified

Cloud NAT gateways keep your private subnets hidden while letting them talk to the outside world. They look trivial at first glance—just a managed SNAT middlebox—but the devil is in the state tables, the idle timers, and the finite pool of source ports. This walkthrough dissects how the big three clouds translate packets, when they drop flows, and how to keep your long-lived sockets alive.

1. What actually happens on the wire

Spin up a VM in a private subnet, curl ifconfig.me, and the NAT gateway rewrites 10.0.0.5:55832 into 203.0.113.17:41206, adds it to an internal conn-track bucket, then relays the reply back. No packets ever route to your instance directly. The mapping lives until one of two things happens:

  • the connection closes (FIN/RST)
  • the mapping’s idle timer expires

Everything after that point is implementation detail—and that detail is why your WebSocket sometimes dies at 5 minutes for no apparent reason.

2. Idle timers by provider

AWS hard-codes a single value. If a flow is silent for 350 seconds, the gateway wipes it and sends the next outbound packet an RST — no config knob, no mercy.

GCP defaults are friendlier but still finite: 30 s for UDP, 30 s for half-open TCP, 20 minutes for established TCP. All are tunable with gcloud compute routers nats update.

Azure starts at a 4-minute idle timeout for both TCP and UDP. TCP can stretch up to 120 minutes; UDP stays fixed. The slider lives in az network nat gateway update --idle-timeout.

Keep those numbers in your head. They explain 90 % of mysterious disconnects.

3. Lab: prove the timeout

# On a private-subnet EC2 instance behind an AWS NAT Gateway
nc -l 9999 &
# From anywhere on the internet
nc $PUBLIC_IP 9999
# now wait six minutes, type a key, watch the RST

The IdleTimeoutCount metric in CloudWatch increments right after the RST lands. Same test on GCP shows the flow vanish at 20 minutes; on Azure it dies at 4 minutes unless you cranked the knob.

4. Port exhaustion is the next cliff

Each SNAT mapping claims one source port from the gateway’s pool. Heavy microservice stacks can open thousands of outbound sockets per node, quickly burning through the default 64 k × public-IP pool. When that pool dries up:

  • AWS drops new connections (PacketsDropCount)
  • GCP black-holes SYNs (compute.googleapis.com/nat/port_usage)
  • Azure returns ICMP unreachable and logs a SnatPortExhausted event

Mitigations:

  1. Spread flows across more gateway IPs (all providers support multiple).
  2. On GCP turn on dynamic port allocation and raise --min-ports-per-vm.
  3. On Azure bump --public-ip-addresses or shard subnets.

5. Keep-alive vs. tune-the-timer

AWS gives you no dial, so you must send traffic < 350 s. TCP keep-alives at 4 min do the trick:

sysctl -w net.ipv4.tcp_keepalive_time=240

GCP and Azure let you increase the idle timeout instead, but note the trade-off: ports stay busy longer, raising exhaustion risk. Production setups often combine a modest bump (e.g., 10 min) with lightweight probes.

gcloud compute routers nats update my-nat \
    --tcp-established-idle-timeout=600

az network nat gateway update \
    --resource-group rg1 --name nat1 --idle-timeout 10

6. Sizing cheat sheet

Required ports ≈ peak_concurrent_flows × (average retransmissions factor)
Total ports   = 64 k × external_IPs

Keep required ≤ total × 0.8 for headroom. If not, add IPs or trim keep-alive timers.

7. Signals to watch

  • AWS: IdleTimeoutCount, PacketsDropCount in CloudWatch
  • GCP: nat/port_usage, nat/connection_count, audit logs for timeout edits
  • Azure: SnatConnectionCount, SnatPortUsage, TcpResetCount

Graph them in Grafana; alert when either usage or drops trend upward week-over-week.

8. Takeaways

Cloud NAT gateways are convenient, but they are stateful firewalls with small brains and smaller port pools. Know the idle timers, tune what you can, and emit keep-alives when you can’t. Monitor the drop counters before your first users do.

Routing is simple. State is where outages hide. Map the timeout landscape now, rule the box later.

Subscribe to Cloud Networking Pro

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe