Architecture
Overview
Zelyo Operator is your Digital SRE and Security Engineer — a Kubernetes Operator built with Kubebuilder that autonomously observes, reasons about, and acts on security and reliability issues in your production clusters. It runs as a single deployment, continuously protecting your workloads while you focus on building features.
How It Works: Observe → Reason → Act
SecurityPolicy scans pods → Correlator groups signals → LLM diagnoses root cause → GitHub PR with fix
MonitoringPolicy watches → Anomaly detector fires → into a unified incident → or Slack/PagerDuty alert
ClusterScan evaluates CIS → Compliance framework maps → →
- You declare intent by creating CRDs like
SecurityPolicy, MonitoringPolicy, or ClusterScan - Zelyo Operator observes — scanning pods, watching restart rates, evaluating compliance
- The brain reasons — anomaly detector builds baselines, correlator groups events into incidents, LLM generates structured JSON fix plans
- The engine acts — remediation engine validates fixes, GitHub engine opens PRs, notifier routes alerts
- It never stops — continuous reconciliation catches new violations, drift, and anomalies
System Architecture
graph TB
subgraph "Kubernetes Cluster — Read-Only Access"
Pods[Running Pods]
Secrets[K8s Secrets]
NS[Namespaces]
Events[K8s Events]
Logs[Pod Logs & Metrics]
end
subgraph "Zelyo Operator — The Digital SRE"
subgraph "Observe (Controllers)"
SecPolCtrl["SecurityPolicy<br/>pod scanning"]
MonCtrl["MonitoringPolicy<br/>pod restart watch"]
ScanCtrl["ClusterScan<br/>scheduled compliance"]
CostCtrl["CostPolicy<br/>resource analysis"]
GitCtrl["GitOpsRepository<br/>repo discovery"]
end
subgraph "Reason (The Brain)"
AD["anomaly<br/>σ-deviation baselines"]
CE["correlator<br/>incident grouping"]
CF["compliance<br/>CIS/NIST/SOC2 mapping"]
LD["drift<br/>cluster vs Git diffing"]
LLM["llm<br/>structured JSON reasoning"]
end
subgraph "Act (Execution)"
RE["remediation<br/>risk-scored fix plans"]
GH["github<br/>JWT auth, PR lifecycle"]
NF["notifier<br/>dedup + rate limit"]
end
end
subgraph "External Integrations"
GitHub["Your GitOps Repo"]
Alerts["Slack · Teams · PagerDuty"]
Prometheus["Prometheus · Grafana"]
ArgoFlux["ArgoCD / Flux"]
end
Pods --> SecPolCtrl & ScanCtrl & CostCtrl
Events & Logs --> MonCtrl
Secrets --> GitCtrl
SecPolCtrl -->|findings| CE
MonCtrl -->|pod restarts| AD
ScanCtrl -->|findings| CF
AD -->|anomalies| CE
CF -->|violations| CE
CostCtrl -->|waste| CE
LD -->|drift| CE
CE -->|correlated incidents| LLM
LLM -->|JSON fix plan| RE
RE -->|Protect Mode| GH
RE -->|Audit Mode| NF
GH --> GitHub
GitHub --> ArgoFlux
NF --> Alerts
SecPolCtrl & MonCtrl --> Prometheus
The Digital SRE Brain (internal/)
The intelligence lives entirely within the internal/ packages. These form the autonomous pipeline that converts raw Kubernetes telemetry into actionable GitOps Pull Requests.
Observe Layer
| Package | What the Digital SRE Observes |
scanner | 8 pluggable scanners — RBAC, container security, images, PodSecurity, secrets, network, privilege escalation, resource limits |
monitor | Real-time Kubernetes resource watcher with event dispatch |
costoptimizer | Resource utilization analysis — idle workloads, rightsizing, spot readiness |
Reason Layer
| Package | How the Digital SRE Thinks |
anomaly | Statistical baseline engine — σ-deviation detection with sliding windows (1000 data points per metric) |
correlator | Time-windowed event grouping — merges security findings + anomalies + crashes into unified incidents |
compliance | Maps findings to CIS Kubernetes Benchmark controls (15 controls) with evidence attachment |
drift | Live drift detector — recursive object diffing across 9 resource types, shadow resource detection |
llm | Multi-provider LLM client — OpenRouter, OpenAI, Anthropic, Azure, Ollama with circuit breaker + retry |
Act Layer
| Package | How the Digital SRE Acts |
remediation | LLM-powered fix generation — structured JSON output, risk scoring (0-100), blast radius protection |
github | GitHub App engine — RS256 JWT auth, token caching, branch → commit → PR → label lifecycle (stdlib only) |
gitops | GitOps interface + ArgoCD/Flux/Kustomize/Helm source discovery |
notifier | Multi-channel delivery — Slack, Teams, PagerDuty, webhooks with severity filtering + deduplication |
Controllers — The Digital SRE's Responsibilities
| Controller | Observe | Reason | Act |
| SecurityPolicy | Scans pods for violations | Feeds findings → correlator | — |
| MonitoringPolicy | Watches pod restart counts | Feeds → anomaly detector → correlator | — |
| ClusterScan | Runs scheduled scans | Evaluates CIS compliance | Creates ScanReport CRs, emits ComplianceViolation events |
| RemediationPolicy | — | Queries correlator for open incidents | LLM plan → validates → opens GitOps PR |
| GitOpsRepository | Discovers repo structure | — | Provides Git context for remediation |
| CostPolicy | Analyzes resource utilization | Identifies waste | — |
| ZelyoConfig | — | — | Configures global settings |
Controller Lifecycle
Every controller follows the standard lifecycle pattern:
stateDiagram-v2
[*] --> Pending: Resource created
Pending --> Active: Validation passes
Pending --> Error: Validation fails
Pending --> Degraded: Partial validation
Active --> Active: Periodic re-reconcile
Error --> Active: Issue resolved
Degraded --> Active: Issue resolved
Active --> Error: Runtime failure
Scanner Engine
The scanner engine is pluggable — each scanner registers by rule type, and controllers look them up from a shared registry.
SecurityPolicy.spec.rules[].type → Registry.Get(type) → scanner.Scan(pods) → []Finding
Available Scanners
| Scanner | Rule Type | What It Checks |
| Container Security Context | container-security-context | runAsNonRoot, privileged, readOnlyRootFilesystem, allowPrivilegeEscalation |
| Resource Limits | resource-limits | Missing CPU/memory requests and limits |
| Image Pinning | image-vulnerability | :latest tags, missing digest pins |
| Pod Security | pod-security | hostNetwork, hostPID, hostIPC, hostPath, SYS_ADMIN, NET_RAW |
| Privilege Escalation | privilege-escalation | Root UID, auto-mounted tokens, unmasked /proc |
| Secrets Exposure | secrets-exposure | Hardcoded secrets in env vars, sensitive patterns |
| Network Policy | network-policy | Unlabeled pods, hostPort usage |
| RBAC Audit | rbac-audit | Default service account usage, admin-named SAs |
Status Conditions
Every resource uses Kubernetes-standard status conditions:
| Condition | Meaning |
Ready | Fully reconciled and operational |
SecretResolved | Referenced K8s Secret is accessible |
ScanCompleted | Security scan finished |
GitOpsConnected | GitOps repository available |
Prometheus Metrics
| Metric | Type | What It Tracks |
zelyo_operator_controller_reconcile_total | Counter | Reconcile operations per controller |
zelyo_operator_controller_reconcile_duration_seconds | Histogram | Reconcile latency |
zelyo_operator_scanner_findings_total | Counter | Findings by scanner and severity |
zelyo_operator_scanner_resources_scanned_total | Counter | Total resources scanned |
zelyo_operator_policy_violations | Gauge | Current violations per policy |
zelyo_operator_clusterscan_completed_total | Counter | Completed cluster scans |
zelyo_operator_clusterscan_findings | Gauge | Findings from last scan |
zelyo_operator_cost_rightsizing_recommendations | Gauge | Pending rightsizing recommendations |
Security Model
- Read-only cluster access: Only
get, list, watch verbs on cluster resources - No direct mutations: All fixes delivered as GitOps PRs, never applied directly
- API key isolation: LLM keys in Kubernetes Secrets, never logged
- Non-root container: UID 65532,
scratch image, read-only rootfs - Signed artifacts: Cosign-signed images with SBOM attestations
- Admission webhooks: Validates SecurityPolicy resources before persistence
Project Layout
zelyo-operator/
├── api/v1alpha1/ # CRD type definitions (9 types + conditions)
├── cmd/main.go # Entrypoint — wires controllers, brain, scanners
├── config/ # Kustomize manifests (CRDs, RBAC, webhook, samples)
├── internal/
│ ├── controller/ # 7 controllers (Observe → Reason → Act)
│ ├── scanner/ # 8 security scanners + registry
│ ├── anomaly/ # σ-deviation baseline engine
│ ├── correlator/ # Time-windowed incident correlation
│ ├── compliance/ # CIS/NIST/SOC2 framework mapping
│ ├── drift/ # Live cluster-vs-Git drift detection
│ ├── remediation/ # LLM-powered fix generation + risk scoring
│ ├── llm/ # Multi-provider LLM client + circuit breaker
│ ├── github/ # GitHub App engine (stdlib only)
│ ├── gitops/ # GitOps interface + source discovery
│ ├── notifier/ # Multi-channel notifications
│ ├── monitor/ # Real-time resource watcher
│ ├── conditions/ # Status condition helpers
│ ├── metrics/ # Prometheus metrics
│ └── webhook/ # Admission webhook
├── charts/ # Helm chart
├── test/ # E2E tests
└── docs/ # Documentation