Lab 7.3: Implementing High Availability

Related Lesson: Lesson 7.3: High Availability
Navigation: ← Previous Lab: RBAC | Module Overview | Next Lab: Performance →

Objectives

Enable leader election
Deploy multiple replicas
Configure resource limits
Test failover scenarios
Set up Pod Disruption Budget

Prerequisites

Completion of Lab 7.2
Operator ready for deployment
Understanding of leader election

Exercise 1: Enable Leader Election

Kubebuilder’s generated cmd/main.go already supports leader election via the --leader-elect flag.

Task 1.1: Review Leader Election Code

# Navigate to your operator project
cd ~/postgres-operator

# Review the leader election setup in main.go
grep -A 20 "LeaderElection" cmd/main.go

You should see code like:

var enableLeaderElection bool
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
    "Enable leader election for controller manager.")

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    // ... other options ...
    LeaderElection:         enableLeaderElection,
    LeaderElectionID:       "your-operator-leader-election",
})

Task 1.2: Enable Leader Election in Deployment

Update config/manager/manager.yaml to add the --leader-elect flag:

spec:
  template:
    spec:
      containers:
      - name: manager
        args:
        - --leader-elect
        - --health-probe-bind-address=:8081

Task 1.3: Deploy and Verify

# For Docker: Build and Deploy the operator with network policies enabled
make docker-build IMG=postgres-operator:latest
kind load docker-image postgres-operator:latest --name k8s-operators-course
make deploy IMG=postgres-operator:latest

# For Podman: Build and Deploy operator - use localhost/ prefix to match the loaded image
make docker-build IMG=postgres-operator:latest CONTAINER_TOOL=podman
podman save localhost/postgres-operator:latest -o /tmp/postgres-operator.tar
kind load image-archive /tmp/postgres-operator.tar --name k8s-operators-course
rm /tmp/postgres-operator.tar
make deploy IMG=localhost/postgres-operator:latest

# Check for lease object
kubectl get lease -n postgres-operator-system

# Check logs for leader election
kubectl logs -n postgres-operator-system -l control-plane=controller-manager | grep -i "leader"

Exercise 2: Deploy Multiple Replicas

Task 2.1: Update Deployment Replicas

Edit config/manager/manager.yaml to increase replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
spec:
  replicas: 3  # Change from 1 to 3
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    spec:
      containers:
      - name: manager
        args:
        - --leader-elect  # Required for HA
        - --health-probe-bind-address=:8081
        resources:
          limits:
            cpu: 500m
            memory: 128Mi
          requests:
            cpu: 10m
            memory: 64Mi

Task 2.2: Deploy and Verify

# For Docker: Build and Deploy the operator with network policies enabled
make docker-build IMG=postgres-operator:latest
kind load docker-image postgres-operator:latest --name k8s-operators-course
make deploy IMG=postgres-operator:latest

# For Podman: Build and Deploy operator - use localhost/ prefix to match the loaded image
make docker-build IMG=postgres-operator:latest CONTAINER_TOOL=podman
podman save localhost/postgres-operator:latest -o /tmp/postgres-operator.tar
kind load image-archive /tmp/postgres-operator.tar --name k8s-operators-course
rm /tmp/postgres-operator.tar
make deploy IMG=localhost/postgres-operator:latest

# Check replicas
kubectl get deployment -n postgres-operator-system

# Check all pods are running
kubectl get pods -n postgres-operator-system -l control-plane=controller-manager

# Verify only one is leader (check logs)
for pod in $(kubectl get pods -n postgres-operator-system -l control-plane=controller-manager -o name); do
  echo "=== $pod ==="
  kubectl logs -n postgres-operator-system $pod | grep -i "leader" | tail -2
done

Exercise 3: Configure Resource Limits

Task 3.1: Set Resource Requests and Limits

Update deployment:

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Task 3.2: Monitor Resource Usage

# Check resource usage
kubectl top pods -l control-plane=controller-manager

# Watch resource usage
watch kubectl top pods -l control-plane=controller-manager

Exercise 4: Test Failover

Task 4.1: Identify Leader

# List all pods
kubectl get pods -n postgres-operator-system -l control-plane=controller-manager

# Find the lease and identify the leader
kubectl get lease -n postgres-operator-system -o yaml

# The holderIdentity field shows which pod is the leader
# Look for the pod name in the holderIdentity

# Check logs to confirm leader
LEADER_POD=$(kubectl get lease -n postgres-operator-system -o jsonpath='{.items[0].spec.holderIdentity}' | cut -d'_' -f1)
echo "Leader pod: $LEADER_POD"
kubectl logs -n postgres-operator-system $LEADER_POD | grep -i "became leader"

Task 4.2: Simulate Leader Failure

# Get the leader pod name
LEADER_POD=$(kubectl get lease -n postgres-operator-system -o jsonpath='{.items[0].spec.holderIdentity}' | cut -d'_' -f1)

# Delete the leader pod
kubectl delete pod -n postgres-operator-system $LEADER_POD

# Watch failover happen
watch kubectl get pods -n postgres-operator-system -l control-plane=controller-manager

# In another terminal, watch the lease
watch kubectl get lease -n postgres-operator-system -o jsonpath='{.items[0].spec.holderIdentity}'

# After a new leader is elected, verify reconciliation continues
kubectl logs -n postgres-operator-system -l control-plane=controller-manager --tail=20 | grep -i "reconcil"

Exercise 5: Pod Disruption Budget

Task 5.1: Create PDB

Create config/manager/pdb.yaml:

cat > config/manager/pdb.yaml << 'EOF'
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: controller-manager-pdb
  namespace: system
spec:
  minAvailable: 2
  selector:
    matchLabels:
      control-plane: controller-manager
EOF

Add to config/manager/kustomization.yaml:

resources:
- manager.yaml
- pdb.yaml

Task 5.2: Deploy and Test PDB

# For Docker: Build and Deploy the operator with network policies enabled
make docker-build IMG=postgres-operator:latest
kind load docker-image postgres-operator:latest --name k8s-operators-course
make deploy IMG=postgres-operator:latest

# For Podman: Build and Deploy operator - use localhost/ prefix to match the loaded image
make docker-build IMG=postgres-operator:latest CONTAINER_TOOL=podman
podman save localhost/postgres-operator:latest -o /tmp/postgres-operator.tar
kind load image-archive /tmp/postgres-operator.tar --name k8s-operators-course
rm /tmp/postgres-operator.tar
make deploy IMG=localhost/postgres-operator:latest

# Verify PDB is created
kubectl get pdb -n postgres-operator-system

# Check PDB status
kubectl describe pdb -n postgres-operator-system postgres-operator-controller-manager-pdb

Important: PDB only protects against voluntary disruptions (evictions), NOT direct kubectl delete pod commands!

Task 5.3: Test PDB with Rollout Restart

The easiest way to test PDB is using kubectl rollout restart, which uses the eviction API internally:

# Check current PDB status - note ALLOWED DISRUPTIONS
kubectl get pdb -n postgres-operator-system

# Expected output:
# NAME                                        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# postgres-operator-controller-manager-pdb   2               N/A               1                     5m

# Trigger a rolling restart (this respects PDB)
kubectl rollout restart deployment/postgres-operator-controller-manager -n postgres-operator-system

# Watch the rollout - PDB ensures at least 2 pods remain available
kubectl get pods -n postgres-operator-system -l control-plane=controller-manager -w

# In another terminal, watch PDB status during rollout
watch kubectl get pdb -n postgres-operator-system

What you should observe:

Pods are replaced one at a time (not all at once)
ALLOWED DISRUPTIONS changes as pods are terminated/created
At least 2 pods remain Running throughout the rollout

Task 5.4: Understand How PDB Works with Rollouts

Let’s understand the math behind PDB:

# Check current state
kubectl get pdb -n postgres-operator-system

# Formula: ALLOWED DISRUPTIONS = currentHealthy - minAvailable
# With 3 healthy pods and minAvailable=2: 3 - 2 = 1 disruption allowed

Important: PDB does NOT block rollouts! Here’s why:

Initial: 3 healthy pods, minAvailable=2, allowedDisruptions=1
Rollout starts: new pod created → 4 healthy
allowedDisruptions = 4 - 2 = 2 → old pod terminated
Now 3 healthy (2 old + 1 new), allowedDisruptions = 1
Another new pod created → 4 healthy → old pod terminated
Repeat until complete

PDB ensures pods are replaced ONE AT A TIME, not all at once!

What PDB Actually Protects Against

PDB protects against external disruptions, not deployment rollouts:

# PDB protects against these scenarios:

# 1. Node drain (cluster maintenance)
kubectl drain <node-name> --ignore-daemonsets
# PDB prevents draining if it would violate minAvailable

# 2. Cluster Autoscaler scale-down
# Autoscaler won't remove a node if it would violate PDB

# 3. Pod eviction due to resource pressure
# Kubelet respects PDB when evicting pods

# 4. Manual eviction API calls
# Tools using eviction API respect PDB

Verify PDB Rate-Limits Disruptions

Watch a rollout to see PDB ensuring pods are replaced one at a time:

# Ensure we have 3 replicas and minAvailable=2
kubectl scale deployment/postgres-operator-controller-manager -n postgres-operator-system --replicas=3
kubectl patch pdb postgres-operator-controller-manager-pdb -n postgres-operator-system \
  --type='json' -p='[{"op": "replace", "path": "/spec/minAvailable", "value": 2}]'

# Wait for stable state
sleep 10

# Watch pods during rollout - notice they're replaced ONE at a time
kubectl get pods -n postgres-operator-system -l control-plane=controller-manager -w &

# Trigger rollout
kubectl rollout restart deployment/postgres-operator-controller-manager -n postgres-operator-system

# Watch the rollout - pods replaced sequentially, not all at once
# Press Ctrl+C when done watching

Without PDB, Kubernetes might terminate multiple pods simultaneously during disruptions. With PDB, it ensures minAvailable pods always remain running.

Understanding PDB Behavior

# Check current PDB status
kubectl get pdb -n postgres-operator-system

# The columns mean:
# MIN AVAILABLE: Minimum pods that must remain running
# ALLOWED DISRUPTIONS: How many pods can be evicted right now
#
# Formula: ALLOWED DISRUPTIONS = currentHealthy - minAvailable
# Example: 3 healthy - 2 minimum = 1 allowed disruption

Why kubectl delete pod doesn’t respect PDB:

kubectl delete is a direct deletion, not an eviction
PDB only protects against the Eviction API used by:
- kubectl drain (node maintenance)
- kubectl rollout restart (deployment updates)
- Cluster Autoscaler (scale down)
- Kubernetes scheduler (pod preemption)
In production, these tools use eviction, so PDB works as intended

Cleanup

# Undeploy operator
make undeploy

# Or scale down for testing
kubectl scale deployment -n postgres-operator-system controller-manager --replicas=1

Lab Summary

In this lab, you:

Enabled leader election via --leader-elect flag
Deployed multiple replicas by updating config/manager/manager.yaml
Configured resource limits
Tested failover by deleting leader pod
Set up Pod Disruption Budget
Tested PDB using the Eviction API

Key Learnings

Leader election is enabled via command-line flag in kubebuilder
Increase replicas in config/manager/manager.yaml for HA
Use make deploy to apply all configurations
Failover is automatic - standby pods acquire the lease
PDB only protects against voluntary disruptions (evictions, not direct deletion)
Use kubectl drain or Eviction API to test PDB - NOT kubectl delete pod
Health checks are pre-configured by kubebuilder

Solutions

Complete working solutions for this lab are available in the solutions directory:

Leader Election Configuration - Complete leader election setup
HA Deployment - HA deployment with PDB

Next Steps

Now let’s optimize performance!

Navigation: ← Previous Lab: RBAC

Related Lesson

Next Lab: Performance →