Lesson 6.4: Debugging and Observability
| Navigation: ← Previous: Integration Testing | Module Overview |
Introduction
Even with comprehensive tests, operators can fail in production. Debugging and observability are essential for understanding what’s happening, diagnosing issues, and ensuring operators run smoothly. This lesson covers debugging techniques and adding observability to operators.
Debugging Workflow
Here’s a typical debugging workflow:
flowchart TD
ISSUE[Issue Reported] --> LOGS[Check Logs]
LOGS --> METRICS[Check Metrics]
METRICS --> EVENTS[Check Events]
EVENTS --> DEBUG[Debug Locally]
DEBUG --> FIX[Fix Issue]
FIX --> TEST[Test Fix]
TEST --> DEPLOY[Deploy]
style ISSUE fill:#FFB6C1
style FIX fill:#90EE90
Debugging with Delve
Setting Up Delve
# Install Delve
go install github.com/go-delve/delve/cmd/dlv@latest
# Run operator with Delve
dlv debug ./cmd/manager/main.go
Using Delve
sequenceDiagram
participant Dev
participant Delve
participant Operator
Dev->>Delve: Start Debug Session
Delve->>Operator: Launch with Debugger
Dev->>Delve: Set Breakpoint
Dev->>Operator: Trigger Operation
Operator->>Delve: Hit Breakpoint
Delve->>Dev: Show State
Dev->>Delve: Inspect Variables
Dev->>Delve: Step Through Code
Dev->>Delve: Continue
Example: Debugging Reconcile
// Set breakpoint in Reconcile function
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// Breakpoint here
db := &databasev1.Database{}
if err := r.Get(ctx, req.NamespacedName, db); err != nil {
return ctrl.Result{}, err
}
// Inspect db here
log.Info("Reconciling", "name", db.Name, "spec", db.Spec)
// Continue debugging...
}
Structured Logging
Adding Structured Logs
import (
"sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
)
func main() {
// Use structured logging
ctrl.SetLogger(zap.New(zap.UseDevMode(true)))
// In controller
log := log.FromContext(ctx)
log.Info("Reconciling Database",
"name", db.Name,
"namespace", db.Namespace,
"generation", db.Generation,
"replicas", db.Spec.Replicas,
)
log.Error(err, "Failed to reconcile",
"name", db.Name,
"error", err.Error(),
)
}
Log Levels
graph TB
LOGS[Logging]
LOGS --> DEBUG[Debug: Detailed Info]
LOGS --> INFO[Info: Normal Operations]
LOGS --> WARN[Warn: Warnings]
LOGS --> ERROR[Error: Errors]
style DEBUG fill:#90EE90
style ERROR fill:#FFB6C1
Metrics with Prometheus
Exposing Metrics
import (
"sigs.k8s.io/controller-runtime/pkg/metrics"
"github.com/prometheus/client_golang/prometheus"
)
var (
reconcileTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "database_reconcile_total",
Help: "Total number of reconciliations",
},
[]string{"result"}, // success, error
)
reconcileDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "database_reconcile_duration_seconds",
Help: "Duration of reconciliations",
},
[]string{"result"},
)
)
func init() {
metrics.Registry.MustRegister(reconcileTotal, reconcileDuration)
}
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
result := "success"
if err != nil {
result = "error"
}
reconcileDuration.WithLabelValues(result).Observe(duration)
reconcileTotal.WithLabelValues(result).Inc()
}()
// Reconciliation logic...
}
Kubernetes Events
Emitting Events
import (
"k8s.io/client-go/tools/record"
)
type DatabaseReconciler struct {
client.Client
Scheme *runtime.Scheme
Recorder record.EventRecorder
}
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Emit event on success
r.Recorder.Event(db, "Normal", "Reconciled", "Database reconciled successfully")
// Emit event on error
if err != nil {
r.Recorder.Event(db, "Warning", "ReconcileFailed", err.Error())
}
}
Event Flow
graph LR
OPERATOR[Operator] --> EVENT[Emit Event]
EVENT --> API[API Server]
API --> STORAGE[Event Storage]
STORAGE --> USER[User sees event]
style OPERATOR fill:#FFB6C1
style EVENT fill:#90EE90
Observability Stack
graph TB
OPERATOR[Operator]
OPERATOR --> LOGS[Logs]
OPERATOR --> METRICS[Metrics]
OPERATOR --> EVENTS[Events]
OPERATOR --> TRACES[Traces]
LOGS --> LOGGING[Logging System]
METRICS --> PROMETHEUS[Prometheus]
EVENTS --> KUBERNETES[Kubernetes]
TRACES --> OTEL[OpenTelemetry]
style OPERATOR fill:#FFB6C1
style METRICS fill:#90EE90
Common Debugging Scenarios
Scenario 1: Reconcile Not Triggering
// Check if controller is running
kubectl get pods -l control-plane=controller-manager
// Check logs
kubectl logs -l control-plane=controller-manager
// Check if resource exists
kubectl get database test-db
// Check events
kubectl get events --field-selector involvedObject.name=test-db
Scenario 2: Resource Not Created
// Add detailed logging
log.Info("Creating StatefulSet",
"name", statefulSet.Name,
"namespace", statefulSet.Namespace,
"spec", statefulSet.Spec,
)
// Check for errors
if err := r.Create(ctx, statefulSet); err != nil {
log.Error(err, "Failed to create StatefulSet",
"name", statefulSet.Name,
"error", err.Error(),
)
return ctrl.Result{}, err
}
Scenario 3: Status Not Updating
// Verify status update
log.Info("Updating status",
"phase", db.Status.Phase,
"ready", db.Status.Ready,
)
if err := r.Status().Update(ctx, db); err != nil {
log.Error(err, "Failed to update status")
return ctrl.Result{}, err
}
// Verify update succeeded
log.Info("Status updated successfully")
Key Takeaways
- Delve enables debugging operators with breakpoints
- Structured logging provides context and traceability
- Metrics expose operational data to Prometheus
- Events communicate state changes to users
- Observability stack combines logs, metrics, events, traces
- Debug systematically using logs, metrics, and events
- Add observability from the start
- Use appropriate log levels (Debug, Info, Warn, Error)
Understanding for Building Operators
When debugging and adding observability:
- Use Delve for local debugging
- Add structured logging throughout
- Expose metrics for monitoring
- Emit events for user feedback
- Use appropriate log levels
- Debug systematically
- Add observability early
- Monitor in production
Related Lab
- Lab 6.4: Adding Observability - Hands-on exercises for this lesson
References
Official Documentation
Further Reading
- Kubernetes Operators by Jason Dobies and Joshua Wood - Chapter 11: Debugging
- Programming Kubernetes by Michael Hausenblas and Stefan Schimanski - Chapter 11: Observability
- Observability Engineering
Related Topics
Next Steps
Congratulations! You’ve completed Module 6. You now understand:
- Testing fundamentals and strategies
- Unit testing with envtest
- Integration testing with real clusters
- Debugging and observability
In Module 7, you’ll learn about production deployment and best practices.