Lesson 6.4: Debugging and Observability

Navigation: ← Previous: Integration Testing

Introduction

Even with comprehensive tests, operators can fail in production. Debugging and observability are essential for understanding what’s happening, diagnosing issues, and ensuring operators run smoothly. This lesson covers debugging techniques and adding observability to operators.

Debugging Workflow

Here’s a typical debugging workflow:

flowchart TD
    ISSUE[Issue Reported] --> LOGS[Check Logs]
    LOGS --> METRICS[Check Metrics]
    METRICS --> EVENTS[Check Events]
    EVENTS --> DEBUG[Debug Locally]
    DEBUG --> FIX[Fix Issue]
    FIX --> TEST[Test Fix]
    TEST --> DEPLOY[Deploy]
    
    style ISSUE fill:#FFB6C1
    style FIX fill:#90EE90

Debugging with Delve

Setting Up Delve

# Install Delve
go install github.com/go-delve/delve/cmd/dlv@latest

# Run operator with Delve
dlv debug ./cmd/manager/main.go

Using Delve

sequenceDiagram
    participant Dev
    participant Delve
    participant Operator
    
    Dev->>Delve: Start Debug Session
    Delve->>Operator: Launch with Debugger
    Dev->>Delve: Set Breakpoint
    Dev->>Operator: Trigger Operation
    Operator->>Delve: Hit Breakpoint
    Delve->>Dev: Show State
    Dev->>Delve: Inspect Variables
    Dev->>Delve: Step Through Code
    Dev->>Delve: Continue

Example: Debugging Reconcile

// Set breakpoint in Reconcile function
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)
    
    // Breakpoint here
    db := &databasev1.Database{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return ctrl.Result{}, err
    }
    
    // Inspect db here
    log.Info("Reconciling", "name", db.Name, "spec", db.Spec)
    
    // Continue debugging...
}

Structured Logging

Adding Structured Logs

import (
    "sigs.k8s.io/controller-runtime/pkg/log"
    "sigs.k8s.io/controller-runtime/pkg/log/zap"
)

func main() {
    // Use structured logging
    ctrl.SetLogger(zap.New(zap.UseDevMode(true)))
    
    // In controller
    log := log.FromContext(ctx)
    log.Info("Reconciling Database",
        "name", db.Name,
        "namespace", db.Namespace,
        "generation", db.Generation,
        "replicas", db.Spec.Replicas,
    )
    
    log.Error(err, "Failed to reconcile",
        "name", db.Name,
        "error", err.Error(),
    )
}

Log Levels

graph TB
    LOGS[Logging]
    
    LOGS --> DEBUG[Debug: Detailed Info]
    LOGS --> INFO[Info: Normal Operations]
    LOGS --> WARN[Warn: Warnings]
    LOGS --> ERROR[Error: Errors]
    
    style DEBUG fill:#90EE90
    style ERROR fill:#FFB6C1

Metrics with Prometheus

Exposing Metrics

import (
    "sigs.k8s.io/controller-runtime/pkg/metrics"
    "github.com/prometheus/client_golang/prometheus"
)

var (
    reconcileTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "database_reconcile_total",
            Help: "Total number of reconciliations",
        },
        []string{"result"}, // success, error
    )
    
    reconcileDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "database_reconcile_duration_seconds",
            Help: "Duration of reconciliations",
        },
        []string{"result"},
    )
)

func init() {
    metrics.Registry.MustRegister(reconcileTotal, reconcileDuration)
}

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    start := time.Now()
    defer func() {
        duration := time.Since(start).Seconds()
        result := "success"
        if err != nil {
            result = "error"
        }
        reconcileDuration.WithLabelValues(result).Observe(duration)
        reconcileTotal.WithLabelValues(result).Inc()
    }()
    
    // Reconciliation logic...
}

Kubernetes Events

Emitting Events

import (
    "k8s.io/client-go/tools/record"
)

type DatabaseReconciler struct {
    client.Client
    Scheme   *runtime.Scheme
    Recorder record.EventRecorder
}

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Emit event on success
    r.Recorder.Event(db, "Normal", "Reconciled", "Database reconciled successfully")
    
    // Emit event on error
    if err != nil {
        r.Recorder.Event(db, "Warning", "ReconcileFailed", err.Error())
    }
}

Event Flow

graph LR
    OPERATOR[Operator] --> EVENT[Emit Event]
    EVENT --> API[API Server]
    API --> STORAGE[Event Storage]
    STORAGE --> USER[User sees event]
    
    style OPERATOR fill:#FFB6C1
    style EVENT fill:#90EE90

Observability Stack

graph TB
    OPERATOR[Operator]
    
    OPERATOR --> LOGS[Logs]
    OPERATOR --> METRICS[Metrics]
    OPERATOR --> EVENTS[Events]
    OPERATOR --> TRACES[Traces]
    
    LOGS --> LOGGING[Logging System]
    METRICS --> PROMETHEUS[Prometheus]
    EVENTS --> KUBERNETES[Kubernetes]
    TRACES --> OTEL[OpenTelemetry]
    
    style OPERATOR fill:#FFB6C1
    style METRICS fill:#90EE90

Common Debugging Scenarios

Scenario 1: Reconcile Not Triggering

// Check if controller is running
kubectl get pods -l control-plane=controller-manager

// Check logs
kubectl logs -l control-plane=controller-manager

// Check if resource exists
kubectl get database test-db

// Check events
kubectl get events --field-selector involvedObject.name=test-db

Scenario 2: Resource Not Created

// Add detailed logging
log.Info("Creating StatefulSet",
    "name", statefulSet.Name,
    "namespace", statefulSet.Namespace,
    "spec", statefulSet.Spec,
)

// Check for errors
if err := r.Create(ctx, statefulSet); err != nil {
    log.Error(err, "Failed to create StatefulSet",
        "name", statefulSet.Name,
        "error", err.Error(),
    )
    return ctrl.Result{}, err
}

Scenario 3: Status Not Updating

// Verify status update
log.Info("Updating status",
    "phase", db.Status.Phase,
    "ready", db.Status.Ready,
)

if err := r.Status().Update(ctx, db); err != nil {
    log.Error(err, "Failed to update status")
    return ctrl.Result{}, err
}

// Verify update succeeded
log.Info("Status updated successfully")

Key Takeaways

Delve enables debugging operators with breakpoints
Structured logging provides context and traceability
Metrics expose operational data to Prometheus
Events communicate state changes to users
Observability stack combines logs, metrics, events, traces
Debug systematically using logs, metrics, and events
Add observability from the start
Use appropriate log levels (Debug, Info, Warn, Error)

Understanding for Building Operators

When debugging and adding observability:

Use Delve for local debugging
Add structured logging throughout
Expose metrics for monitoring
Emit events for user feedback
Use appropriate log levels
Debug systematically
Add observability early
Monitor in production

Lab 6.4: Adding Observability - Hands-on exercises for this lesson

References

Official Documentation

Next Steps

Congratulations! You’ve completed Module 6. You now understand:

Testing fundamentals and strategies
Unit testing with envtest
Integration testing with real clusters
Debugging and observability

In Module 7, you’ll learn about production deployment and best practices.