Lesson 7.3: High Availability

Navigation: ← Previous: RBAC and Security

Introduction

Production operators need to be highly available - they should continue operating even if individual pods fail. This lesson covers leader election, multiple replicas, failover handling, and resource management for high availability.

Theory: High Availability

High availability ensures operators continue operating despite failures.

Why High Availability?

Reliability:

Operators manage critical workloads
Single point of failure is unacceptable
Redundancy prevents outages
Failover ensures continuity

Scalability:

Handle increased load
Distribute work across replicas
Scale horizontally
Performance under load

Resilience:

Survive pod failures
Survive node failures
Automatic recovery
Zero-downtime deployments

Leader Election

Why Leader Election?

Controllers must not conflict
Only one should reconcile at a time
Prevents duplicate work
Ensures consistency

How It Works:

Controllers compete for lease
Winner becomes leader
Leader reconciles resources
Others wait as standby

Failover:

Leader renews lease periodically
If leader fails, lease expires
Another controller acquires lease
New leader takes over

Resource Management

Resource Requests:

Guaranteed resources
Scheduler uses for placement
Ensures operator has resources

Resource Limits:

Maximum resources
Prevents resource exhaustion
Protects other workloads
Enables overcommitment

Understanding high availability helps you build reliable, production-ready operators.

High Availability Architecture

Here’s how HA works for operators:

graph TB
    OPERATOR[Operator Deployment]
    
    OPERATOR --> REPLICA1[Replica 1]
    OPERATOR --> REPLICA2[Replica 2]
    OPERATOR --> REPLICA3[Replica 3]
    
    REPLICA1 --> LEADER[Leader Election]
    REPLICA2 --> LEADER
    REPLICA3 --> LEADER
    
    LEADER --> ACTIVE[Active Controller]
    LEADER --> STANDBY[Standby Controllers]
    
    style LEADER fill:#90EE90
    style ACTIVE fill:#FFB6C1

Leader Election

How Leader Election Works

sequenceDiagram
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3
    participant API as API Server
    
    R1->>API: Acquire Lease
    API-->>R1: Lease Acquired
    R1->>R1: Become Leader
    
    R2->>API: Try Acquire Lease
    API-->>R2: Lease Already Held
    R2->>R2: Standby Mode
    
    R3->>API: Try Acquire Lease
    API-->>R3: Lease Already Held
    R3->>R3: Standby Mode
    
    Note over R1: Leader runs controller
    
    R1->>API: Renew Lease
    API-->>R1: Renewed
    
    Note over R1: If leader fails...
    R2->>API: Acquire Lease
    API-->>R2: Lease Acquired
    R2->>R2: Become Leader

Leader Election in Kubebuilder

Kubebuilder’s generated cmd/main.go already includes leader election support via command-line flags:

// In cmd/main.go (generated by kubebuilder)
var enableLeaderElection bool
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
    "Enable leader election for controller manager. "+
    "Enabling this will ensure there is only one active controller manager.")

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme:                 scheme,
    Metrics:                metricsserver.Options{BindAddress: metricsAddr},
    HealthProbeBindAddress: probeAddr,
    LeaderElection:         enableLeaderElection,
    LeaderElectionID:       "postgres-operator.example.com",
    // LeaderElectionReleaseOnCancel defines if the leader should step down
    // when the Manager ends. This requires the binary to immediately end
    // when the Manager is stopped.
    LeaderElectionReleaseOnCancel: true,
})

To enable leader election when running your operator:

# Run with leader election enabled
./manager --leader-elect=true

For production deployments, update config/manager/manager.yaml:

args:
- --leader-elect

Multiple Replicas

Kubebuilder Deployment Configuration

Update the replica count in config/manager/manager.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
  labels:
    control-plane: controller-manager
spec:
  replicas: 3  # Increase from 1 to 3 for HA
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    metadata:
      labels:
        control-plane: controller-manager
    spec:
      containers:
      - name: manager
        image: controller:latest
        args:
        - --leader-elect  # Enable leader election for HA

Or use kustomize patches in config/default/manager_config_patch.yaml.

Replica Coordination

graph TB
    REPLICAS[3 Replicas]
    
    REPLICAS --> LEADER[1 Leader]
    REPLICAS --> STANDBY[2 Standby]
    
    LEADER --> RECONCILE[Reconciles Resources]
    STANDBY --> WAIT[Wait for Leadership]
    
    LEADER --> FAIL[Leader Fails]
    FAIL --> ELECT[New Leader Elected]
    ELECT --> RECONCILE
    
    style LEADER fill:#90EE90
    style STANDBY fill:#FFE4B5

Failover Process

Failover Flow

flowchart TD
    NORMAL[Leader Running] --> FAILURE[Leader Fails]
    FAILURE --> DETECT[Lease Expires]
    DETECT --> ELECT[New Leader Elected]
    ELECT --> RESUME[Resume Reconciliation]
    RESUME --> NORMAL
    
    style FAILURE fill:#FFB6C1
    style ELECT fill:#90EE90

Handling Failover

// Leader election handles failover automatically
// When leader fails:
// 1. Lease expires (after LeaseDuration)
// 2. Another replica acquires lease
// 3. New leader starts reconciling
// 4. No reconciliation is lost (idempotent operations)

Resource Management

Resource Limits

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Resource Sizing

graph LR
    RESOURCES[Resources]
    
    RESOURCES --> SMALL[Small Operator<br/>100m CPU, 128Mi]
    RESOURCES --> MEDIUM[Medium Operator<br/>500m CPU, 512Mi]
    RESOURCES --> LARGE[Large Operator<br/>1000m CPU, 1Gi]
    
    style SMALL fill:#90EE90
    style MEDIUM fill:#FFE4B5
    style LARGE fill:#FFB6C1

Pod Disruption Budget

PDB Configuration

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: postgres-operator-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: postgres-operator

PDB Protection

graph TB
    PDB[Pod Disruption Budget]
    
    PDB --> MIN[Min Available: 2]
    PDB --> PROTECT[Protects Replicas]
    
    PROTECT --> DRAIN[Prevents Drain]
    PROTECT --> DELETE[Prevents Delete]
    
    style PDB fill:#90EE90

Health Checks

Kubebuilder’s generated cmd/main.go automatically sets up health endpoints:

// In cmd/main.go (generated by kubebuilder)
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
    setupLog.Error(err, "unable to set up health check")
    os.Exit(1)
}
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
    setupLog.Error(err, "unable to set up ready check")
    os.Exit(1)
}

Liveness and Readiness in Deployment

The health probes are already configured in config/manager/manager.yaml:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8081
  initialDelaySeconds: 15
  periodSeconds: 20

readinessProbe:
  httpGet:
    path: /readyz
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 10

Key Takeaways

Leader election is built into kubebuilder via --leader-elect flag
Multiple replicas provide redundancy (update config/manager/manager.yaml)
Failover is automatic with leader election
Health checks are pre-configured by kubebuilder (/healthz, /readyz)
Resource limits prevent resource exhaustion
Pod Disruption Budgets protect availability
Idempotent operations handle failover gracefully

Understanding for Building Operators

When implementing high availability with kubebuilder:

Add --leader-elect to deployment args in config/manager/manager.yaml
Increase replicas to 3 in the deployment
Health probes are already configured by kubebuilder
Set appropriate resource limits in the deployment
Add Pod Disruption Budgets in config/manager/
Ensure your reconciliation logic is idempotent
Use make deploy to deploy with HA configuration

Lab 7.3: Implementing HA - Hands-on exercises for this lesson

References

Official Documentation

Next Steps

Now that you understand high availability, let’s learn about performance optimization.

Navigation: ← Previous: RBAC and Security

Module Overview

Next: Performance and Scalability →

Lesson 7.3: High Availability

Introduction

Theory: High Availability

Why High Availability?

Leader Election

Resource Management

High Availability Architecture

Leader Election

How Leader Election Works

Leader Election in Kubebuilder

Multiple Replicas

Kubebuilder Deployment Configuration

Replica Coordination

Failover Process

Failover Flow

Handling Failover

Resource Management

Resource Limits

Resource Sizing

Pod Disruption Budget

PDB Configuration

PDB Protection

Health Checks

Liveness and Readiness in Deployment

Key Takeaways

Understanding for Building Operators

Related Lab

References

Official Documentation

Further Reading

Related Topics

Next Steps