Running Highly Available Apps on Kubernetes
As Kubernetes becomes the de-facto standard for deploying applications, many of us are either running our applications on Kubernetes or trying to migrate to Kubernetes. In this blog post, I will go through some tips for running highly available applications on Kubernetes.
Kubernetes, also known as K8s, is an open-source system for automating the deployment, scaling, and management of containerized applications. It is designed on the same principles that allow Google to run billions of containers a week, Kubernetes can scale without increasing your ops team.
In computing, the term availability is used to describe the period when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.
When running applications on Kubernetes, you can follow these points to make your applications highly available
- Don’t manage
Poddirectly. Instead, use workload resources that manage a set of pods on your behalf.
In Kubernetes, a
Pod represents a set of running containers on your cluster. Kubernetes pods have a defined lifecycle. For example, once a pod is running in your cluster then a critical fault on the node where that pod is running means that all the pods on that node fail. Kubernetes treats that level of failure as final: you would need to create a new
Pod to recover, even if the node later becomes healthy.
However, you don’t need to manage each
Pod directly. Instead, you can use workload resources that manage a set of pods on your behalf. These resources configure controllers that make sure the right number of the right kind of pods are running, to match the state you specified.
Kubernetes provides several built-in workload resources:
- Deployment and ReplicaSet:
Deploymentis a good fit for managing a stateless application workload on your cluster, where any
Deploymentis interchangeable and can be replaced if needed.
- StatefulSet: Let's you run one or more related Pods that do track state somehow.
PersistentVolumeto persist the data. This is useful for stateful applications which manage states for example Databases, Queues, etc.
- DaemonSet: Defines
Podsthat provide node-local facilities. It schedules a
DaemonSetonto every node which matches
- Job and CronJob: Define tasks that run to completion and then stop. Jobs represent one-off tasks, whereas
CronJobsrecur according to a schedule.
When using workload resources like
StatefulSet replicas value under
spec.replicas to more than one to make your application highly available.
RollingUpdate update strategy for workload resources.
Update strategy is used by Kubernetes to rollout new pods when you try to update existing workload resources (like Deployment, StatefuleSet, etc).
If you try to update all pods of a
Deployment at once without an update strategy, your application may become unavailable during the update.
Kubernetes allows you to configure an update strategy for your workload using
spec.updateStrategyfield DaemonSet and StatefuleSet
spec.strategyfield for Deployment
For your apps, set the
typeof the update strategy to the
RollingUpdate. when you set the
strategy type to
RollingUpdate , you can also specify
maxSurge to control the rolling update process.
Max Unavailable :
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from the percentage by rounding down. The value cannot be 0 if
.spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new pods are ready, the old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if
MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. The default value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
RollingUpdate update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion. At most one pod of the DaemonSet will be running on each node during the whole update process.
When a StatefulSet
.spec.updateStrategy.type is set to
RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.
3. Spread Pods across your cluster among failure domains such as regions, zones, and nodes.
The cluster admin also needs to make sure that there are enough nodes in the Kubernetes cluster and that each node has enough resources so that each new pod of the workload can be scheduled on different nodes.
The concept of the zone, and regions are not intrinsic to Kubernetes but it is more tied with the public cloud providers.
Kubernetes is designed so that a single Kubernetes cluster can run across multiple failure zones, typically where these zones fit within a logical grouping called a region. Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.
Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.
Distribute nodes across zones
In order you to make your application resilient to a zone failure, you first need to distribute your Kubernetes cluster nodes across multiple zones or regions. Kubernetes’ core does not create nodes for you; you need to do that yourself or use a tool such as the Cluster API to manage nodes on your behalf.
If your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better-expected availability, reducing the risk that a correlated failure affects your whole workload.
There are different methods that can be used for assigning pods to nodes
- Using node selectors with node labels
- Affinity and anti-affinity rules
For scheduling your pods to different failure domains, you need to use pod affinity with
topologyKey which value will be equal to the label of the failure domain.
There are two types of pod affinity:
requiredDuringSchedulingIgnoredDuringExecution: The scheduler can't schedule the Pod unless the rule is met. This functions like
nodeSelector, but with a more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.
topologyKey is the key of node labels. If two Nodes are labeled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
- key: component
The above affinity rule will make sure that nodes are evenly distributed across the different zone.
If you use cloud provider controllers, labels related to zone, region, etc gets automatically added on each node, otherwise, these labels need to be added by cluster admin while provisioning new nodes for the Kubernetes cluster.
4. Handle voluntary disruptions using a pod disruption budget
There are two types of disruptions that can happen to your pods running on a Kubernetes cluster.
Involuntary disruptions are unavoidable hardware or system software errors. For example,
- A hardware failure of the physical machine backing the node
- Cluster administrator deletes VM (instance) by mistake
- Cloud provider or hypervisor failure makes VM disappear
- A kernel panic
- The node disappears from the cluster due to cluster network partition
- Eviction of a pod due to the node being out-of-resources.
Voluntary disruptions are disruptions caused by actions initiated by the application owner and cluster administrator.
Application Owner actions include
- Deleting the deployment or other controller that manages the pod
- Updating a deployment’s pod template causes a restart
- Directly deleting a pod (e.g. by accident)
Cluster administrator actions include:
- Draining a node for repair or upgrade.
- Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
- Removing a pod from a node to permit something else to fit on that node.
These actions might be taken by the cluster administrator or the automation run by the cluster administrator or your cloud hosting provider.
Some ways to handle involuntary disruptions are
- Ensure your pod requests the resources it needs.
- Replicate your application for higher availability.
- When running replicated applications, spread applications across racks or across zones
Handling Voluntary disruptions
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
A PDB specifies the number of replicas that an application can tolerate having, relative to how many it is intended to have. For example, a Deployment that has a
.spec.replicas: 5 is supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time, then the Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same as the one used by the application’s controller (deployment, stateful-set, etc).
PodDisruptionBudget has three fields:
- A label selector
.spec.selectorspecifies the set of pods to which it applies. This field is required.
.spec.minAvailablewhich is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod.
minAvailablecan be either an absolute number or a percentage.
.spec.maxUnavailable(available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
You can specify only one of
minAvailable in a single
Example 1: With a
minAvailable of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget's
Example 2: With a
minAvailable of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy.
Example 3: With a
maxUnavailable of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas.
Example 4: With a
maxUnavailable of 30%, evictions are allowed as long as no more than 30% of the desired replicas are unhealthy.
- Involuntary disruptions cannot be prevented by PDBs; however, they do count against the budget.
- Cluster managers and hosting providers should use tools that respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.