Skip to content

Monitoring & Metrics

Zelyo Operator exposes custom Prometheus metrics at the standard /metrics endpoint, giving you full observability into the operator's performance and security posture.

Prometheus Integration

Zelyo Operator's metrics are automatically available to any Prometheus instance scraping the operator pod. If you're using the kube-prometheus-stack, a ServiceMonitor is included in the Helm chart.

Quick Check

To see metrics locally while running the operator with make run:

curl http://localhost:8080/metrics | grep zelyo_operator_

Available Metrics

Controller Metrics

These tell you how your controllers are performing:

Metric Type Labels What It Tells You
zelyo_operator_controller_reconcile_total Counter controller, result How many reconciles happened, and whether they succeeded or failed
zelyo_operator_controller_reconcile_duration_seconds Histogram controller How long each reconcile takes (useful for spotting slowdowns)

Example queries:

# Reconcile error rate for SecurityPolicy controller
rate(zelyo_operator_controller_reconcile_total{controller="securitypolicy", result="error"}[5m])

# p99 reconcile latency
histogram_quantile(0.99, rate(zelyo_operator_controller_reconcile_duration_seconds_bucket{controller="securitypolicy"}[5m]))

Scanner Metrics

These track what the security scanners are finding:

Metric Type Labels What It Tells You
zelyo_operator_scanner_findings_total Counter scanner, severity Total findings by scanner type and severity
zelyo_operator_scanner_scan_duration_seconds Histogram scanner How long scans take to complete
zelyo_operator_scanner_resources_scanned_total Counter scanner How many resources (pods) have been scanned

Example queries:

# Critical findings per minute
rate(zelyo_operator_scanner_findings_total{severity="critical"}[5m])

# Total resources scanned in the last hour
increase(zelyo_operator_scanner_resources_scanned_total[1h])

Policy Metrics

These track the security posture of your policies:

Metric Type Labels What It Tells You
zelyo_operator_policy_violations Gauge policy, namespace, severity Current violation count per policy (goes up and down)
zelyo_operator_policy_phase_info Gauge policy, namespace, phase Current phase of each policy (Active, Error, etc.)

Example queries:

# All policies with violations
zelyo_operator_policy_violations > 0

# Policies in Error state
zelyo_operator_policy_phase_info{phase="Error"} == 1

ClusterScan Metrics

These track scheduled cluster-wide scans:

Metric Type Labels What It Tells You
zelyo_operator_clusterscan_completed_total Counter scan, namespace How many scans have completed
zelyo_operator_clusterscan_findings Gauge scan, namespace Findings from the most recent scan run

Cost Metrics

Metric Type Labels What It Tells You
zelyo_operator_cost_rightsizing_recommendations Gauge policy, namespace How many pods need resource limit adjustments

Grafana Dashboard

You can build a Grafana dashboard using these metrics. Here's a suggested layout:

Panel Ideas

Panel Query Visualization
Security Score 1 - (zelyo_operator_policy_violations / zelyo_operator_scanner_resources_scanned_total) Gauge (0-100%)
Critical Findings zelyo_operator_scanner_findings_total{severity="critical"} Stat (big number)
Violations Over Time sum(zelyo_operator_policy_violations) by (severity) Time series
Reconcile Latency histogram_quantile(0.95, ...) Time series
Scan Completion Rate rate(zelyo_operator_clusterscan_completed_total[24h]) Stat

Alerting Rules

Here are example Prometheus alerting rules for Zelyo Operator:

groups:
  - name: zelyo-operator
    rules:
      - alert: Zelyo OperatorCriticalViolations
        expr: zelyo_operator_scanner_findings_total{severity="critical"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical security violations detected"
          description: "{{ $labels.scanner }} found {{ $value }} critical findings"

      - alert: Zelyo OperatorReconcileErrors
        expr: rate(zelyo_operator_controller_reconcile_total{result="error"}[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Zelyo Operator controller {{ $labels.controller }} has high error rate"

      - alert: Zelyo OperatorSlowReconcile
        expr: histogram_quantile(0.99, rate(zelyo_operator_controller_reconcile_duration_seconds_bucket[5m])) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Zelyo Operator controller {{ $labels.controller }} reconcile taking >30s"