Stop Losing Money on Spot Instance Interruptions in EKS

The Hidden Cost of Spot Interruptions

Spot Instances can save you up to 90% compared to On-Demand pricing. But there's a catch: AWS can reclaim them with just 2 minutes notice. Without proper handling, this leads to:

×Failed deployments mid-rollout
×Dropped connections and user-facing errors
×Lost in-flight jobs and batch processing failures
×Cascading failures when multiple nodes get reclaimed

Real Impact: One client was losing ~$4,200/month in Spot savings because fear of interruptions kept them on On-Demand. The actual solution took 2 hours to implement.

The Graceful Interruption Architecture

The key is building a system that treats interruptions as expected events, not emergencies. Here's the stack:

1. Karpenter for Intelligent Provisioning

Karpenter replaces Cluster Autoscaler with smarter, faster node provisioning. Key configuration:

# Karpenter NodePool with Spot diversity
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-diverse
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: 
            - m5.large
            - m5.xlarge
            - m5a.large
            - m5a.xlarge
            - m6i.large
            - m6i.xlarge  # Diversify!
      nodeClassRef:
        name: default
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

Why this works: Instance type diversity means AWS is less likely to reclaim ALL your nodes at once.

2. AWS Node Termination Handler

This DaemonSet catches the 2-minute warning and gracefully drains pods:

# Install via Helm
helm repo add eks https://aws.github.io/eks-charts
helm install aws-node-termination-handler \
  eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableRebalanceMonitoring=true \
  --set enableScheduledEventDraining=true

Critical: The handler cordons the node and evicts pods with respect for PodDisruptionBudgets.

3. PodDisruptionBudgets (PDBs)

PDBs ensure you always have minimum availability during disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2  # Or use maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server

Rule of thumb: Set minAvailable to N-1 where N is your replica count.

The Results

60-70%

Cost Reduction

vs On-Demand

99.9%

Uptime

maintained

<30s

Pod Migration

during interruption

Quick Implementation Checklist

Deploy AWS Node Termination Handler as DaemonSet
Configure Karpenter with 5+ instance type diversity
Set PodDisruptionBudgets for all critical workloads
Ensure all pods have proper terminationGracePeriodSeconds
Test with Spot interruption simulation (AWS FIS)
Monitor with Karpenter metrics + CloudWatch