Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet

Author: Mayank Kumar (Salesforce)

Kubernetes StatefulSets, since their introduction in 1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the case where you're using the OrderedReady Pod management policy for a StatefulSet.

Here are some examples:

  • I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache starts cold and requires some siginificant amount of time before the container can start. There could be more initial startup tasks that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the StatefulSet supported updating more than one pod at a time, it would result in a much faster update.

  • My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are large. Note that my application still requires unique identity per pod.

In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must enable the MaxUnavailableStatefulSet feature flag. Once you enable that, you can specify a new field called maxUnavailable, part of the spec for a StatefulSet. For example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  namespace: default
spec:
  podManagementPolicy: OrderedReady  # you must set OrderedReady
  replicas: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: k8s.gcr.io/nginx-slim:0.8
        imagePullPolicy: IfNotPresent
        name: nginx
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 2 # this is the new alpha field, whose default value is 1
      partition: 0
    type: RollingUpdate

If you enable the new feature and you don't specify a value for maxUnavailable in a StatefulSet, Kubernetes applies a default maxUnavailable: 1. This matches the behavior you would see if you don't enable the new feature.

I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that has 5 replicas, with maxUnavailable set to 2 and partition set to 0.

I can trigger a rolling update by changing the image to k8s.gcr.io/nginx-slim:0.9. Once I initiate the rolling update, I can watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not complete. The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding up to the nearest integer.

kubectl get pods --watch 
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          85s
web-1   1/1     Running   0          2m6s
web-2   1/1     Running   0          106s
web-3   1/1     Running   0          2m47s
web-4   1/1     Running   0          2m27s
web-4   1/1     Terminating   0          5m43s ----> start terminating 4
web-3   1/1     Terminating   0          6m3s  ----> start terminating 3
web-3   0/1     Terminating   0          6m7s
web-3   0/1     Pending       0          0s
web-3   0/1     Pending       0          0s
web-4   0/1     Terminating   0          5m48s
web-4   0/1     Terminating   0          5m48s
web-3   0/1     ContainerCreating   0          2s
web-3   1/1     Running             0          2s
web-4   0/1     Pending             0          0s
web-4   0/1     Pending             0          0s
web-4   0/1     ContainerCreating   0          0s
web-4   1/1     Running             0          1s
web-2   1/1     Terminating         0          5m46s ----> start terminating 2 (only after both 4 and 3 are running)
web-1   1/1     Terminating         0          6m6s  ----> start terminating 1
web-2   0/1     Terminating         0          5m47s
web-1   0/1     Terminating         0          6m7s
web-1   0/1     Pending             0          0s
web-1   0/1     Pending             0          0s
web-1   0/1     ContainerCreating   0          1s
web-1   1/1     Running             0          2s
web-2   0/1     Pending             0          0s
web-2   0/1     Pending             0          0s
web-2   0/1     ContainerCreating   0          0s
web-2   1/1     Running             0          1s
web-0   1/1     Terminating         0          6m6s ----> start terminating 0 (only after 2 and 1 are running)
web-0   0/1     Terminating         0          6m7s
web-0   0/1     Pending             0          0s
web-0   0/1     Pending             0          0s
web-0   0/1     ContainerCreating   0          0s
web-0   1/1     Running             0          1s

Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating.

In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready before 4. When maxUnavailable is more than 1 (in the example scenario I set maxUnavailable to 2), it is possible that replica 3 becomes ready and running before replica 4 is ready—and that is ok. If you're a developer and you set maxUnavailable to more than 1, you should know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur if any. When you set maxUnavailable greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.

Although Kubernetes refers to these as replicas, your stateful application may have a different view and each pod of the StatefulSet may be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can now have a batch size larger than 1 (as an alpha feature).

Also note, that the above behavior is with podManagementPolicy: OrderedReady. If you defined a StatefulSet as podManagementPolicy: Parallel, not only maxUnavailable number of replicas are terminated at the same time; maxUnavailable number of replicas start in ContainerCreating phase at the same time as well. This is called bursting.

So, now you may have a lot of questions about:-

  • What is the behavior when you set podManagementPolicy: Parallel?
  • What is the behavior when partition to a value other than 0?

It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can break applications or catch them by surprise? Please open an issue to let us know.

Further reading and next steps