A production-ready, AI-powered Kubernetes incident response platform that works with any tech stack.
KubeAI-Ops automatically detects issues in your Kubernetes cluster, analyzes root causes using Claude AI, and takes remediation actions - all while you sleep.
Traditional monitoring tells you something is wrong. KubeAI-Ops tells you why it's wrong and fixes it automatically.
Traditional Alerting: KubeAI-Ops:
Alert: "Pod CrashLoopBackOff" -> Alert received
| |
v v
Page on-call engineer AI analyzes metrics + logs
| |
v v
SSH into cluster Root cause: "Memory leak in
| user-service causing OOM kills.
v Heap grew 300% in 2 hours."
Dig through logs |
| v
v Auto-remediation: Pod restarted,
Maybe find the issue deployment scaled, team notified
| |
v v
Manual fix Team reviews summary,
not firefighting
KubeAI-Ops doesn't care what language your services are written in. As long as they expose:
| Requirement | Purpose | Example |
|---|---|---|
/metrics endpoint |
Prometheus scraping | prom-client (Node), prometheus_client (Python), micrometer (Java) |
/health endpoint |
Liveness probe | Return {"status": "ok"} |
| JSON logs to stdout | Log aggregation | winston (Node), structlog (Python), logback (Java) |
That's it. Three things, and your service is fully integrated with AI-powered incident response.
// 1. Add metrics
const promClient = require('prom-client');
promClient.collectDefaultMetrics();
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.send(await promClient.register.metrics());
});
// 2. Add health endpoint
app.get('/health', (req, res) => res.json({ status: 'ok' }));
// 3. Use JSON logging
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.json(),
transports: [new winston.transports.Console()]
});
# 1. Add metrics
from prometheus_client import make_asgi_app, REGISTRY
app.mount("/metrics", make_asgi_app())
# 2. Add health endpoint
@app.get("/health")
def health():
return {"status": "ok"}
# 3. Use JSON logging
import structlog
structlog.configure(
processors=[structlog.processors.JSONRenderer()]
)
// 1. Add metrics
import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())
// 2. Add health endpoint
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
})
// 3. Use JSON logging
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
# application.yml - that's literally it for Spring Boot
management:
endpoints:
web:
exposure:
include: health,prometheus
endpoint:
health:
show-details: always
# Add to pom.xml
# spring-boot-starter-actuator
# micrometer-registry-prometheus
// 1. Add metrics (using actix-web-prom)
use actix_web_prom::PrometheusMetrics;
let prometheus = PrometheusMetrics::new("api", Some("/metrics"), None);
App::new().wrap(prometheus)
// 2. Add health endpoint
#[get("/health")]
async fn health() -> impl Responder {
HttpResponse::Ok().json(json!({"status": "ok"}))
}
// 3. Use JSON logging (tracing + tracing-subscriber)
tracing_subscriber::fmt().json().init();
# Clone the repo
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops
# Start everything locally
./local-dev/setup.sh
# That's it! Access:
# - Your apps: http://localhost:8080
# - Grafana: http://localhost:3000 (admin/admin)
# - Dashboard: http://localhost:5173
# 1. Configure AWS credentials
aws configure
# 2. Deploy infrastructure
cd terraform/environments/dev
terragrunt apply
# 3. Deploy platform
kubectl apply -k kubernetes/overlays/dev
kubectl apply -k argocd/install
# Install just the AI agent and observability stack
helm install kubeai-ops ./kubernetes/helm-charts/app-chart \
--set aiAgent.enabled=true \
--set observability.enabled=true \
--set sampleApps.enabled=false
┌─────────────────────────────────────────────────────────────────────┐
│ YOUR APPLICATIONS │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node.js │ │ Python │ │ Go │ │ Java │ ... │
│ │ Service │ │ Service │ │ Service │ │ Service │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ /metrics │ /metrics │ /metrics │ /metrics │
│ │ /health │ /health │ /health │ /health │
│ │ JSON logs │ JSON logs │ JSON logs │ JSON logs │
└─────────┼─────────────┼─────────────┼─────────────┼────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ KUBEAI-OPS PLATFORM │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PROMETHEUS │ │ LOKI │ │ ALERTMANAGER │ │
│ │ (metrics) │ │ (logs) │ │ (alerts) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └──────────────────────┼──────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ AI INCIDENT AGENT │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ AI ENGINES: Claude │ OpenAI │ Ollama │ Bedrock │ Mock │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ 1. Receive alert (webhook) 5. Learn from resolution │ │
│ │ 2. Correlate metrics + logs 6. Notify via ChatOps │ │
│ │ 3. AI root cause analysis 7. Create tickets (Jira/GH) │ │
│ │ 4. Execute remediation 8. Escalate (PD/Opsgenie) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐│
│ │REMEDIATE │ │DASHBOARD │ │ CHATOPS │ │ TICKETS │ │ESCALATION││
│ │ Restart │ │ Timeline │ │ Slack │ │ Jira │ │ PagerDuty││
│ │ Scale │ │ Analytics│ │ Discord │ │ GitHub │ │ Opsgenie ││
│ │ Rollback │ │ Metrics │ │ Teams │ │ Issues │ │ ││
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘│
│ │
│ ┌────────────────────────────────────────────────────────────────┐│
│ │ CLI: kubeai status │ diagnose │ incidents │ runbooks ││
│ └────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘
| Component | Description |
|---|---|
| AI Incident Agent | Multi-engine AI (Claude, OpenAI, Ollama, Bedrock) for root cause analysis and auto-remediation |
| Incident Dashboard | Real-time SvelteKit UI with timeline, analytics, and remediation controls |
| CLI Tool | Full-featured CLI with diagnosis, runbooks, incident management, and shell mode |
| Observability Stack | Pre-configured Prometheus, Grafana dashboards, Loki, AlertManager |
| Engine | Description |
|---|---|
| Claude (Anthropic) | Production-ready with Claude 3.5 Sonnet/Opus support |
| OpenAI | GPT-4, GPT-4 Turbo, GPT-3.5 Turbo support |
| Ollama | Local/self-hosted LLMs (Llama 3, Mistral, CodeLlama) |
| AWS Bedrock | Managed AI with Claude, Titan, Llama models |
| Mock Engine | Testing without API costs with configurable scenarios |
| Category | Integrations |
|---|---|
| ChatOps | Slack, Discord, Microsoft Teams - interactive incident response |
| Alerting | PagerDuty, Opsgenie - escalation and on-call management |
| Ticketing | Jira, GitHub Issues - automatic ticket creation |
| Monitoring | Datadog - metrics forwarding and enrichment |
| Component | Description |
|---|---|
| RBAC | Role-based access control (Admin, Operator, Viewer, Service) |
| OIDC/SSO | Enterprise SSO with any OIDC provider |
| API Keys | Service-to-service authentication |
| Privacy Controls | PII redaction, data retention policies, audit logging |
| OPA/Gatekeeper | Policy-as-code for Kubernetes admission control |
| Network Policies | Default-deny with service-specific rules |
| Feature | Description |
|---|---|
| Incident Learning | Learns from past incidents to improve recommendations |
| Pattern Recognition | Identifies recurring issues and suggests preventive actions |
| Similarity Matching | Finds related past incidents for faster resolution |
| Feedback Loop | Operator feedback improves AI accuracy over time |
| Component | Description |
|---|---|
| Terraform Modules | Production-ready AWS infrastructure (VPC, EKS, RDS, S3) |
| Kubernetes Manifests | Kustomize-based deployments for local/dev/staging/prod |
| ArgoCD Setup | GitOps-ready ApplicationSets and projects |
| CI/CD Pipelines | GitHub Actions for testing, building, deploying |
| Service | Purpose |
|---|---|
| API Gateway | FastAPI service demonstrating auth, rate limiting, circuit breaker |
| Order Service | CRUD service with PostgreSQL, events, proper error handling |
| Notification Service | Multi-channel notifications (email, SMS, webhook) |
# ai-incident-agent/config/agent-config.yaml
ai_engine:
backend: "claude" # claude, openai, ollama, bedrock, mock
claude:
model: "claude-sonnet-4-20250514"
api_key: "${ANTHROPIC_API_KEY}"
openai:
model: "gpt-4-turbo"
api_key: "${OPENAI_API_KEY}"
ollama:
model: "llama3:8b"
base_url: "http://ollama:11434"
bedrock:
model_id: "anthropic.claude-3-sonnet-20240229-v1:0"
region: "us-east-1"
mock: # For testing without API costs
response_delay_ms: 500
mock_scenarios:
- trigger: "OOMKilled"
root_cause: "Memory leak in application"
action: "restart_pod"
remediation:
enabled: true
auto_approve:
- restart_pod # Low risk - auto-approve
- scale_replicas
require_approval:
- rollback_deployment # Higher risk - require human approval
- delete_pvc
# ChatOps integrations
chatops:
slack:
enabled: true
webhook_url: "${SLACK_WEBHOOK_URL}"
bot_token: "${SLACK_BOT_TOKEN}"
channel: "#incidents"
interactive: true # Enable slash commands and buttons
discord:
enabled: false
webhook_url: "${DISCORD_WEBHOOK_URL}"
bot_token: "${DISCORD_BOT_TOKEN}"
teams:
enabled: false
webhook_url: "${TEAMS_WEBHOOK_URL}"
# External integrations
integrations:
pagerduty:
enabled: true
api_key: "${PAGERDUTY_API_KEY}"
service_id: "${PAGERDUTY_SERVICE_ID}"
opsgenie:
enabled: false
api_key: "${OPSGENIE_API_KEY}"
jira:
enabled: true
url: "https://your-org.atlassian.net"
email: "${JIRA_EMAIL}"
api_token: "${JIRA_API_TOKEN}"
project_key: "OPS"
github:
enabled: false
token: "${GITHUB_TOKEN}"
repo: "your-org/incidents"
datadog:
enabled: false
api_key: "${DATADOG_API_KEY}"
# Security & Authentication
auth:
enabled: true
provider: "oidc" # oidc, api_key, both
oidc:
issuer_url: "https://your-idp.com"
client_id: "${OIDC_CLIENT_ID}"
client_secret: "${OIDC_CLIENT_SECRET}"
# Privacy & Compliance
privacy:
pii_redaction: true
retention_days: 90
audit_logging: true
Add Prometheus annotations to your deployment:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Create alert rules (optional): ```yaml
groups:
kubeai-ops/
├── ai-incident-agent/ # The AI-powered incident response agent
│ ├── agent/ # Core agent logic
│ │ ├── ai_engine/ # Multi-engine AI support
│ │ │ ├── base.py # Abstract base class
│ │ │ ├── claude_engine.py # Anthropic Claude
│ │ │ ├── openai_engine.py # OpenAI GPT models
│ │ │ ├── ollama_engine.py # Local Ollama models
│ │ │ ├── bedrock_engine.py # AWS Bedrock
│ │ │ └── mock_engine.py # Testing engine
│ │ ├── remediation/ # Auto-remediation system
│ │ ├── chatops/ # ChatOps integrations
│ │ │ ├── slack.py # Slack with interactive commands
│ │ │ ├── discord.py # Discord integration
│ │ │ └── teams.py # Microsoft Teams
│ │ ├── integrations/ # External integrations
│ │ │ ├── pagerduty.py # PagerDuty escalation
│ │ │ ├── opsgenie.py # Opsgenie alerts
│ │ │ ├── jira.py # Jira ticket creation
│ │ │ ├── github.py # GitHub Issues
│ │ │ └── datadog.py # Datadog forwarding
│ │ ├── auth/ # Authentication & authorization
│ │ │ ├── rbac.py # Role-based access control
│ │ │ └── oidc.py # OIDC/SSO integration
│ │ ├── learning/ # ML incident learning
│ │ │ └── incident_learner.py
│ │ ├── privacy/ # Privacy & compliance
│ │ │ └── pii_redactor.py
│ │ └── database/ # SQLAlchemy models
│ ├── config/ # Agent configuration
│ └── tests/ # 428+ passing tests
│
├── incident-dashboard/ # SvelteKit real-time dashboard
│ ├── src/
│ │ ├── routes/ # Pages and API routes
│ │ ├── lib/
│ │ │ ├── components/ # Reusable UI components
│ │ │ ├── stores/ # Svelte stores
│ │ │ └── api/ # API client
│ │ └── types/ # TypeScript definitions
│ └── tests/ # 287+ passing tests
│
├── cli/ # KubeAI CLI tool
│ ├── kubeai/
│ │ ├── commands/ # CLI commands
│ │ │ ├── status.py # Service status
│ │ │ ├── diagnose.py # AI diagnosis
│ │ │ ├── incidents.py # Incident management
│ │ │ ├── runbooks.py # Runbook execution
│ │ │ └── config.py # Configuration
│ │ └── api.py # API client
│ └── tests/ # 261+ passing tests
│
├── observability/ # Pre-configured monitoring stack
│ ├── prometheus/ # Metrics + alert rules
│ ├── grafana/ # Dashboards
│ └── loki/ # Log aggregation
│
├── kubernetes/ # Kubernetes manifests
│ ├── base/ # Base resources
│ └── overlays/ # Environment configs
│ ├── local/
│ ├── dev/
│ ├── staging/
│ └── prod/
│
├── terraform/ # AWS infrastructure (optional)
│ ├── modules/ # Reusable modules
│ └── environments/ # Per-environment configs
│
├── docker-compose-demo/ # Local demo environment
│ ├── docker-compose.yml # Full stack with all features
│ ├── demo-scenarios.sh # Interactive demo script
│ └── grafana/ # Pre-configured dashboards
│
├── argocd/ # GitOps setup
├── ci-cd/ # GitHub Actions + policies
├── security/ # Network policies, RBAC
├── local-dev/ # One-command local setup
├── services/ # Sample applications
│
└── docs/ # Documentation
├── architecture.md # System architecture
├── getting-started.md # Setup guide
├── adr/ # Architecture Decision Records
└── runbooks/ # Operational runbooks
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone your fork
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops
# Start local environment
./local-dev/setup.sh
# Run tests
cd ai-incident-agent && pytest
cd services/api-gateway && pytest
cd incident-dashboard && npm test
# Make changes, then submit a PR
MIT License - see LICENSE for details.
Use it, modify it, sell it, whatever. Just don't blame us if your AI agent becomes sentient and refuses to restart pods on Fridays.
Reduce alert fatigue with AI-powered incident response.