Setting up Over-Provisioning
To implement over-provisioning effectively, it's considered a best practice to create appropriate PriorityClass
resources for your applications. Let's begin by creating a global default priority class using the globalDefault: true
field. This default PriorityClass
will be assigned to pods and deployments that don't specify a PriorityClassName
.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: default
value: 0
globalDefault: true
description: "Default Priority class."
Next, we'll create a PriorityClass
specifically for the pause pods used in over-provisioning, with a priority value of -1
.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: pause-pods
value: -1
globalDefault: false
description: "Priority class used by pause-pods for overprovisioning."
Pause pods play a crucial role in ensuring that there are enough available nodes based on the amount of over-provisioning needed for your environment. It's important to keep in mind the --max-size
parameter in the ASG of the EKS node group, as the Cluster Autoscaler won't increase the number of nodes beyond this maximum specified in the ASG.
apiVersion: apps/v1
kind: Deployment
metadata:
name: pause-pods
namespace: other
spec:
replicas: 2
selector:
matchLabels:
run: pause-pods
template:
metadata:
labels:
run: pause-pods
spec:
priorityClassName: pause-pods
containers:
- name: reserve-resources
image: registry.k8s.io/pause
resources:
requests:
memory: "6.5Gi"
In this scenario, we're going to schedule a single pause pod requesting 6.5Gi
of memory. This means it will consume almost an entire m5.large
instance, resulting in two "spare" worker nodes being available at all times.
Let's apply these updates to our cluster:
priorityclass.scheduling.k8s.io/default created
priorityclass.scheduling.k8s.io/pause-pods created
deployment.apps/pause-pods created
Once this process completes, the pause pods will be running:
NAME READY STATUS RESTARTS AGE
pause-pods-7f7669b6d7-v27sl 1/1 Running 0 5m6s
pause-pods-7f7669b6d7-v7hqv 1/1 Running 0 5m6s
We can now observe that additional nodes have been provisioned by the Cluster Autoscaler:
NAME STATUS ROLES AGE VERSION
ip-10-42-10-159.us-west-2.compute.internal Ready <none> 3d v1.31-eks-036c24b
ip-10-42-10-111.us-west-2.compute.internal Ready <none> 33s v1.31-eks-036c24b
ip-10-42-10-133.us-west-2.compute.internal Ready <none> 33s v1.31-eks-036c24b
ip-10-42-11-143.us-west-2.compute.internal Ready <none> 3d v1.31-eks-036c24b
ip-10-42-11-81.us-west-2.compute.internal Ready <none> 3d v1.31-eks-036c24b
ip-10-42-12-152.us-west-2.compute.internal Ready <none> 3m11s v1.31-eks-036c24b
These two additional nodes are not running any workloads except for our pause pods, which will be evicted when "real" workloads are scheduled.