The GKE Upgrade That Took Down Our Production Pods for 45 Minutes

The Cost of Complacency: A Kubernetes Node Pool Upgrade Gone Wrong

kubernetes node pool upgrade

You may also enjoy reading: Features in Apple's Latest GarageBand Update You Need to Know.

As we navigate the complex world of cloud computing, it’s easy to get caught up in the promise of scalability and reliability offered by platforms like Google Kubernetes Engine (GKE). But what happens when the automated processes we rely on to keep our applications running smoothly fail to deliver? In this article, we’ll explore a real-world example of a Kubernetes node pool upgrade that went awry, causing a 45-minute window of slow or failing requests to critical services. We’ll examine the factors that contributed to this issue and provide actionable advice on how to prevent similar problems in the future.

The Unexpected Upgrade Cycle

On a Tuesday morning during business hours, our alerting system fired, indicating that response times on our main API were spiking, and several pods were showing as not ready. The timing was strange, as we hadn’t deployed anything recently. It took us three minutes to realize that a node pool upgrade was in progress, with GKE cycling our nodes as part of an automatic minor version upgrade. The problem was that we hadn’t thought carefully enough about what this meant in practice.

GKE’s Default Upgrade Strategy

GKE’s default upgrade strategy is surge upgrade, which adds a surge node to the pool, drains an existing node by evicting all pods from it, and schedules them onto other nodes. This process repeats until the pool is upgraded. With a surge of one, it upgrades one node at a time, which sounds safe. However, we had not accounted for the interaction between surge upgrades and our pod topology.

The Interaction Between Surge Upgrades and Pod Topology

We had eight nodes in the pool, with most deployments having a replica count of two. Two replicas spread across eight nodes seems fine, but when GKE drained a node, it evicted one replica and tried to reschedule it. If the replacement pod took more than a few seconds to become ready, our deployment was running at half capacity during that window. For stateless API pods handling individually low amounts of traffic, losing half the replicas for thirty seconds was survivable. However, for two of our services, it wasn’t.

The Consequences of Inadequate Planning

We had not set up Pod Disruption Budgets (PDBs) on any of our workloads, which meant GKE’s node draining had no constraints from the workload side. It evicted pods freely, trusting that Kubernetes would reschedule them fast enough. For most pods, it was fast enough, but for those two critical services, it wasn’t. Setting up a PDB is straightforward, but we had not done so.

45 Ways to Prevent Similar Issues in the Future

To prevent similar problems, consider the following 45 items:

1. Understand Your Pod Topology

Take the time to understand how your pods are distributed across nodes. This will help you anticipate potential issues during upgrades.

2. Configure Pod Disruption Budgets

Set up PDBs to ensure that a minimum number of pods are available at any time. This will prevent GKE from evicting pods too quickly.

3. Choose the Right Upgrade Strategy

Consider using a different upgrade strategy, such as a rolling upgrade, to reduce the risk of downtime.

4. Monitor Upgrade Progress

Keep a close eye on upgrade progress to detect any issues early on.

5. Test Your Application

Test your application during upgrades to ensure it can handle the changes.

6. Use Canary Deployments

Use canary deployments to roll out changes gradually, reducing the risk of downtime.

7. Implement Rollbacks

Implement rollbacks to quickly revert to a previous version in case of issues.

8. Use Kubernetes’ Built-in Features

Use Kubernetes’ built-in features, such as PDBs and rollout strategies, to manage upgrades.

9. Automate Testing

Automate testing to ensure your application is stable during upgrades.

10. Use Monitoring Tools

Use monitoring tools to detect issues early on and respond quickly.

11. Document Your Setup

Document your setup, including your pod topology and PDB configurations.

12. Communicate with Your Team

Communicate with your team to ensure everyone is aware of upgrade schedules and potential issues.

13. Plan for Downtime

Plan for downtime during upgrades to minimize the impact on your users.

14. Use a Blue-Green Deployment

Use a blue-green deployment to switch between versions quickly and easily.

15. Implement a Change Management Process

Implement a change management process to ensure that upgrades are carefully planned and executed.

16. Use a CI/CD Pipeline

Use a CI/CD pipeline to automate testing and deployment.

17. Monitor Your Cluster

Monitor your cluster to detect issues early on and respond quickly.

18. Use a Load Balancer

Use a load balancer to distribute traffic evenly during upgrades.

19. Implement a Rollback Strategy

Implement a rollback strategy to quickly revert to a previous version in case of issues.

20. Use Kubernetes’ Built-in Rollout Strategies

Use Kubernetes’ built-in rollout strategies to manage upgrades.

21. Automate Upgrade Scheduling

Automate upgrade scheduling to ensure that upgrades occur at a safe time.

22. Use a Deployment Strategy

Use a deployment strategy to manage the rollout of new versions.

23. Monitor Your Application’s Performance

Monitor your application’s performance during upgrades to detect any issues.

24. Use a Canary Release

Use a canary release to roll out changes gradually, reducing the risk of downtime.

25. Implement a Feedback Loop

Implement a feedback loop to collect data on upgrade performance and make improvements.

26. Use a Monitoring Tool

Use a monitoring tool to detect issues early on and respond quickly.

27. Document Your Upgrade Process

Document your upgrade process, including any issues that arose and how they were resolved.

28. Communicate with Your Users

Communicate with your users to keep them informed about upgrade schedules and potential issues.

29. Plan for Rollbacks

Plan for rollbacks to quickly revert to a previous version in case of issues.

30. Use a CI/CD Tool

Use a CI/CD tool to automate testing and deployment.

31. Monitor Your Cluster’s Performance

Monitor your cluster’s performance during upgrades to detect any issues.

32. Use a Load Balancer to Distribute Traffic

Use a load balancer to distribute traffic evenly during upgrades.

33. Implement a Deployment Strategy

Implement a deployment strategy to manage the rollout of new versions.

34. Use Kubernetes’ Built-in Features

Use Kubernetes’ built-in features, such as PDBs and rollout strategies, to manage upgrades.

35. Automate Testing

Automate testing to ensure your application is stable during upgrades.

36. Use a Canary Release

Use a canary release to roll out changes gradually, reducing the risk of downtime.

37. Monitor Your Application’s Performance

Monitor your application’s performance during upgrades to detect any issues.

38. Use a Monitoring Tool

Use a monitoring tool to detect issues early on and respond quickly.

39. Document Your Upgrade Process

Document your upgrade process, including any issues that arose and how they were resolved.

40. Communicate with Your Users

Communicate with your users to keep them informed about upgrade schedules and potential issues.

41. Plan for Rollbacks

Plan for rollbacks to quickly revert to a previous version in case of issues.

42. Use a CI/CD Tool

Use a CI/CD tool to automate testing and deployment.

43. Monitor Your Cluster’s Performance

Monitor your cluster’s performance during upgrades to detect any issues.

44. Use a Load Balancer to Distribute Traffic

Use a load balancer to distribute traffic evenly during upgrades.

45. Review and Improve Your Upgrade Process

Review and improve your upgrade process regularly to ensure it is efficient and effective.

By following these 45 items, you can prevent similar issues and ensure that your Kubernetes upgrades are smooth and successful. Remember to always plan ahead, communicate with your team and users, and monitor your cluster’s performance during upgrades. With careful planning and execution, you can minimize downtime and ensure that your application remains available to your users.

Add Comment