kubeai-ops Svelte Themes

Kubeai Ops

AI-powered Kubernetes incident response platform. Detects issues, analyzes root causes using Claude/OpenAI/Ollama, and auto-remediates. Features ChatOps (Slack/Discord/Teams), PagerDuty/Jira integration, RBAC, ML-based learning, CLI, and real-time dashboard.

KubeAI-Ops

A production-ready, AI-powered Kubernetes incident response platform that works with any tech stack.

KubeAI-Ops automatically detects issues in your Kubernetes cluster, analyzes root causes using Claude AI, and takes remediation actions - all while you sleep.


Why KubeAI-Ops?

Traditional monitoring tells you something is wrong. KubeAI-Ops tells you why it's wrong and fixes it automatically.

Traditional Alerting:                    KubeAI-Ops:

Alert: "Pod CrashLoopBackOff"     ->    Alert received
       |                                     |
       v                                     v
  Page on-call engineer           AI analyzes metrics + logs
       |                                     |
       v                                     v
  SSH into cluster                 Root cause: "Memory leak in
       |                           user-service causing OOM kills.
       v                           Heap grew 300% in 2 hours."
  Dig through logs                          |
       |                                     v
       v                           Auto-remediation: Pod restarted,
  Maybe find the issue             deployment scaled, team notified
       |                                     |
       v                                     v
  Manual fix                       Team reviews summary,
                                   not firefighting

Works With Any Tech Stack

KubeAI-Ops doesn't care what language your services are written in. As long as they expose:

Requirement Purpose Example
/metrics endpoint Prometheus scraping prom-client (Node), prometheus_client (Python), micrometer (Java)
/health endpoint Liveness probe Return {"status": "ok"}
JSON logs to stdout Log aggregation winston (Node), structlog (Python), logback (Java)

That's it. Three things, and your service is fully integrated with AI-powered incident response.

Language Examples

Node.js / Express
// 1. Add metrics
const promClient = require('prom-client');
promClient.collectDefaultMetrics();
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.send(await promClient.register.metrics());
});

// 2. Add health endpoint
app.get('/health', (req, res) => res.json({ status: 'ok' }));

// 3. Use JSON logging
const winston = require('winston');
const logger = winston.createLogger({
  format: winston.format.json(),
  transports: [new winston.transports.Console()]
});
Python / FastAPI
# 1. Add metrics
from prometheus_client import make_asgi_app, REGISTRY
app.mount("/metrics", make_asgi_app())

# 2. Add health endpoint
@app.get("/health")
def health():
    return {"status": "ok"}

# 3. Use JSON logging
import structlog
structlog.configure(
    processors=[structlog.processors.JSONRenderer()]
)
Go
// 1. Add metrics
import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())

// 2. Add health endpoint
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
})

// 3. Use JSON logging
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
Java / Spring Boot
# application.yml - that's literally it for Spring Boot
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus
  endpoint:
    health:
      show-details: always

# Add to pom.xml
# spring-boot-starter-actuator
# micrometer-registry-prometheus
Rust
// 1. Add metrics (using actix-web-prom)
use actix_web_prom::PrometheusMetrics;
let prometheus = PrometheusMetrics::new("api", Some("/metrics"), None);
App::new().wrap(prometheus)

// 2. Add health endpoint
#[get("/health")]
async fn health() -> impl Responder {
    HttpResponse::Ok().json(json!({"status": "ok"}))
}

// 3. Use JSON logging (tracing + tracing-subscriber)
tracing_subscriber::fmt().json().init();

Quick Start

Option 1: Local Development (5 minutes)

# Clone the repo
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops

# Start everything locally
./local-dev/setup.sh

# That's it! Access:
# - Your apps: http://localhost:8080
# - Grafana:   http://localhost:3000 (admin/admin)
# - Dashboard: http://localhost:5173

Option 2: Deploy to AWS EKS

# 1. Configure AWS credentials
aws configure

# 2. Deploy infrastructure
cd terraform/environments/dev
terragrunt apply

# 3. Deploy platform
kubectl apply -k kubernetes/overlays/dev
kubectl apply -k argocd/install

Option 3: Add to Existing Cluster

# Install just the AI agent and observability stack
helm install kubeai-ops ./kubernetes/helm-charts/app-chart \
  --set aiAgent.enabled=true \
  --set observability.enabled=true \
  --set sampleApps.enabled=false

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        YOUR APPLICATIONS                            │
│    ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│    │ Node.js  │  │  Python  │  │   Go     │  │   Java   │  ...     │
│    │ Service  │  │ Service  │  │ Service  │  │ Service  │          │
│    └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │
│         │  /metrics   │  /metrics   │  /metrics   │  /metrics      │
│         │  /health    │  /health    │  /health    │  /health       │
│         │  JSON logs  │  JSON logs  │  JSON logs  │  JSON logs     │
└─────────┼─────────────┼─────────────┼─────────────┼────────────────┘
          │             │             │             │
          ▼             ▼             ▼             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     KUBEAI-OPS PLATFORM                             │
│                                                                     │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ │
│  │   PROMETHEUS    │    │      LOKI       │    │  ALERTMANAGER   │ │
│  │  (metrics)      │    │    (logs)       │    │   (alerts)      │ │
│  └────────┬────────┘    └────────┬────────┘    └────────┬────────┘ │
│           │                      │                      │          │
│           └──────────────────────┼──────────────────────┘          │
│                                  ▼                                  │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                      AI INCIDENT AGENT                        │ │
│  │  ┌─────────────────────────────────────────────────────────┐  │ │
│  │  │  AI ENGINES: Claude │ OpenAI │ Ollama │ Bedrock │ Mock  │  │ │
│  │  └─────────────────────────────────────────────────────────┘  │ │
│  │                                                                │ │
│  │  1. Receive alert (webhook)     5. Learn from resolution      │ │
│  │  2. Correlate metrics + logs    6. Notify via ChatOps         │ │
│  │  3. AI root cause analysis      7. Create tickets (Jira/GH)   │ │
│  │  4. Execute remediation         8. Escalate (PD/Opsgenie)     │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                 │                                   │
│    ┌────────────────────────────┼────────────────────────────────┐ │
│    ▼              ▼             ▼              ▼                 ▼ │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐│
│  │REMEDIATE │ │DASHBOARD │ │ CHATOPS  │ │ TICKETS  │ │ESCALATION││
│  │ Restart  │ │ Timeline │ │ Slack    │ │ Jira     │ │ PagerDuty││
│  │ Scale    │ │ Analytics│ │ Discord  │ │ GitHub   │ │ Opsgenie ││
│  │ Rollback │ │ Metrics  │ │ Teams    │ │ Issues   │ │          ││
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘│
│                                                                    │
│  ┌────────────────────────────────────────────────────────────────┐│
│  │  CLI: kubeai status │ diagnose │ incidents │ runbooks          ││
│  └────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘

What's Included

Core Platform

Component Description
AI Incident Agent Multi-engine AI (Claude, OpenAI, Ollama, Bedrock) for root cause analysis and auto-remediation
Incident Dashboard Real-time SvelteKit UI with timeline, analytics, and remediation controls
CLI Tool Full-featured CLI with diagnosis, runbooks, incident management, and shell mode
Observability Stack Pre-configured Prometheus, Grafana dashboards, Loki, AlertManager

AI Engine Support

Engine Description
Claude (Anthropic) Production-ready with Claude 3.5 Sonnet/Opus support
OpenAI GPT-4, GPT-4 Turbo, GPT-3.5 Turbo support
Ollama Local/self-hosted LLMs (Llama 3, Mistral, CodeLlama)
AWS Bedrock Managed AI with Claude, Titan, Llama models
Mock Engine Testing without API costs with configurable scenarios

Integrations

Category Integrations
ChatOps Slack, Discord, Microsoft Teams - interactive incident response
Alerting PagerDuty, Opsgenie - escalation and on-call management
Ticketing Jira, GitHub Issues - automatic ticket creation
Monitoring Datadog - metrics forwarding and enrichment

Security & Authentication

Component Description
RBAC Role-based access control (Admin, Operator, Viewer, Service)
OIDC/SSO Enterprise SSO with any OIDC provider
API Keys Service-to-service authentication
Privacy Controls PII redaction, data retention policies, audit logging
OPA/Gatekeeper Policy-as-code for Kubernetes admission control
Network Policies Default-deny with service-specific rules

Machine Learning

Feature Description
Incident Learning Learns from past incidents to improve recommendations
Pattern Recognition Identifies recurring issues and suggests preventive actions
Similarity Matching Finds related past incidents for faster resolution
Feedback Loop Operator feedback improves AI accuracy over time

Infrastructure (Optional)

Component Description
Terraform Modules Production-ready AWS infrastructure (VPC, EKS, RDS, S3)
Kubernetes Manifests Kustomize-based deployments for local/dev/staging/prod
ArgoCD Setup GitOps-ready ApplicationSets and projects
CI/CD Pipelines GitHub Actions for testing, building, deploying

Sample Applications

Service Purpose
API Gateway FastAPI service demonstrating auth, rate limiting, circuit breaker
Order Service CRUD service with PostgreSQL, events, proper error handling
Notification Service Multi-channel notifications (email, SMS, webhook)

Configuration

AI Agent Configuration

# ai-incident-agent/config/agent-config.yaml
ai_engine:
  backend: "claude"  # claude, openai, ollama, bedrock, mock

  claude:
    model: "claude-sonnet-4-20250514"
    api_key: "${ANTHROPIC_API_KEY}"

  openai:
    model: "gpt-4-turbo"
    api_key: "${OPENAI_API_KEY}"

  ollama:
    model: "llama3:8b"
    base_url: "http://ollama:11434"

  bedrock:
    model_id: "anthropic.claude-3-sonnet-20240229-v1:0"
    region: "us-east-1"

  mock:  # For testing without API costs
    response_delay_ms: 500
    mock_scenarios:
      - trigger: "OOMKilled"
        root_cause: "Memory leak in application"
        action: "restart_pod"

remediation:
  enabled: true
  auto_approve:
    - restart_pod        # Low risk - auto-approve
    - scale_replicas
  require_approval:
    - rollback_deployment  # Higher risk - require human approval
    - delete_pvc

# ChatOps integrations
chatops:
  slack:
    enabled: true
    webhook_url: "${SLACK_WEBHOOK_URL}"
    bot_token: "${SLACK_BOT_TOKEN}"
    channel: "#incidents"
    interactive: true  # Enable slash commands and buttons

  discord:
    enabled: false
    webhook_url: "${DISCORD_WEBHOOK_URL}"
    bot_token: "${DISCORD_BOT_TOKEN}"

  teams:
    enabled: false
    webhook_url: "${TEAMS_WEBHOOK_URL}"

# External integrations
integrations:
  pagerduty:
    enabled: true
    api_key: "${PAGERDUTY_API_KEY}"
    service_id: "${PAGERDUTY_SERVICE_ID}"

  opsgenie:
    enabled: false
    api_key: "${OPSGENIE_API_KEY}"

  jira:
    enabled: true
    url: "https://your-org.atlassian.net"
    email: "${JIRA_EMAIL}"
    api_token: "${JIRA_API_TOKEN}"
    project_key: "OPS"

  github:
    enabled: false
    token: "${GITHUB_TOKEN}"
    repo: "your-org/incidents"

  datadog:
    enabled: false
    api_key: "${DATADOG_API_KEY}"

# Security & Authentication
auth:
  enabled: true
  provider: "oidc"  # oidc, api_key, both
  oidc:
    issuer_url: "https://your-idp.com"
    client_id: "${OIDC_CLIENT_ID}"
    client_secret: "${OIDC_CLIENT_SECRET}"

# Privacy & Compliance
privacy:
  pii_redaction: true
  retention_days: 90
  audit_logging: true

Adding Your Services

  1. Add Prometheus annotations to your deployment:

    metadata:
      annotations:
     prometheus.io/scrape: "true"
     prometheus.io/port: "8080"
     prometheus.io/path: "/metrics"
    
  2. Create alert rules (optional): ```yaml

    observability/prometheus/alerting-rules/my-service-alerts.yaml

    groups:

  • name: my-service rules:
    • alert: MyServiceHighErrorRate expr: rate(http_requests_total{app="my-service", status=~"5.."}[5m]) > 0.1 for: 5m labels: severity: critical ```
  1. Deploy and watch the magic happen

Project Structure

kubeai-ops/
├── ai-incident-agent/        # The AI-powered incident response agent
│   ├── agent/                # Core agent logic
│   │   ├── ai_engine/        # Multi-engine AI support
│   │   │   ├── base.py       # Abstract base class
│   │   │   ├── claude_engine.py    # Anthropic Claude
│   │   │   ├── openai_engine.py    # OpenAI GPT models
│   │   │   ├── ollama_engine.py    # Local Ollama models
│   │   │   ├── bedrock_engine.py   # AWS Bedrock
│   │   │   └── mock_engine.py      # Testing engine
│   │   ├── remediation/      # Auto-remediation system
│   │   ├── chatops/          # ChatOps integrations
│   │   │   ├── slack.py      # Slack with interactive commands
│   │   │   ├── discord.py    # Discord integration
│   │   │   └── teams.py      # Microsoft Teams
│   │   ├── integrations/     # External integrations
│   │   │   ├── pagerduty.py  # PagerDuty escalation
│   │   │   ├── opsgenie.py   # Opsgenie alerts
│   │   │   ├── jira.py       # Jira ticket creation
│   │   │   ├── github.py     # GitHub Issues
│   │   │   └── datadog.py    # Datadog forwarding
│   │   ├── auth/             # Authentication & authorization
│   │   │   ├── rbac.py       # Role-based access control
│   │   │   └── oidc.py       # OIDC/SSO integration
│   │   ├── learning/         # ML incident learning
│   │   │   └── incident_learner.py
│   │   ├── privacy/          # Privacy & compliance
│   │   │   └── pii_redactor.py
│   │   └── database/         # SQLAlchemy models
│   ├── config/               # Agent configuration
│   └── tests/                # 428+ passing tests
│
├── incident-dashboard/       # SvelteKit real-time dashboard
│   ├── src/
│   │   ├── routes/           # Pages and API routes
│   │   ├── lib/
│   │   │   ├── components/   # Reusable UI components
│   │   │   ├── stores/       # Svelte stores
│   │   │   └── api/          # API client
│   │   └── types/            # TypeScript definitions
│   └── tests/                # 287+ passing tests
│
├── cli/                      # KubeAI CLI tool
│   ├── kubeai/
│   │   ├── commands/         # CLI commands
│   │   │   ├── status.py     # Service status
│   │   │   ├── diagnose.py   # AI diagnosis
│   │   │   ├── incidents.py  # Incident management
│   │   │   ├── runbooks.py   # Runbook execution
│   │   │   └── config.py     # Configuration
│   │   └── api.py            # API client
│   └── tests/                # 261+ passing tests
│
├── observability/            # Pre-configured monitoring stack
│   ├── prometheus/           # Metrics + alert rules
│   ├── grafana/              # Dashboards
│   └── loki/                 # Log aggregation
│
├── kubernetes/               # Kubernetes manifests
│   ├── base/                 # Base resources
│   └── overlays/             # Environment configs
│       ├── local/
│       ├── dev/
│       ├── staging/
│       └── prod/
│
├── terraform/                # AWS infrastructure (optional)
│   ├── modules/              # Reusable modules
│   └── environments/         # Per-environment configs
│
├── docker-compose-demo/      # Local demo environment
│   ├── docker-compose.yml    # Full stack with all features
│   ├── demo-scenarios.sh     # Interactive demo script
│   └── grafana/              # Pre-configured dashboards
│
├── argocd/                   # GitOps setup
├── ci-cd/                    # GitHub Actions + policies
├── security/                 # Network policies, RBAC
├── local-dev/                # One-command local setup
├── services/                 # Sample applications
│
└── docs/                     # Documentation
    ├── architecture.md       # System architecture
    ├── getting-started.md    # Setup guide
    ├── adr/                  # Architecture Decision Records
    └── runbooks/             # Operational runbooks

Roadmap

Completed

  • Core AI incident agent
  • Multi-engine AI support (Claude, OpenAI, Ollama, Bedrock)
  • Prometheus/Grafana/Loki integration
  • ChatOps (Slack, Discord, Microsoft Teams)
  • Auto-remediation (restart, scale, rollback)
  • Incident dashboard (SvelteKit)
  • CLI tool with diagnosis and runbooks
  • Mock AI mode for testing
  • PagerDuty integration
  • Opsgenie integration
  • Jira/GitHub issue creation
  • Datadog integration
  • RBAC & OIDC authentication
  • Machine learning incident learner
  • Privacy controls & PII redaction
  • Comprehensive test suite (976 tests)

Planned

  • Multi-cluster support
  • Cost analysis integration
  • Runbook auto-generation
  • Custom remediation plugins
  • Mobile app
  • Advanced anomaly detection

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone your fork
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops

# Start local environment
./local-dev/setup.sh

# Run tests
cd ai-incident-agent && pytest
cd services/api-gateway && pytest
cd incident-dashboard && npm test

# Make changes, then submit a PR

Areas We Need Help

  • Multi-cluster federation support
  • Additional AI engine integrations
  • Language-specific integration guides (Ruby, .NET)
  • Mobile app development
  • Performance optimizations for large-scale deployments
  • Documentation translations

Community

  • GitHub Issues: Bug reports and feature requests
  • Discussions: Questions and ideas
  • Discord: Join our server (coming soon)

License

MIT License - see LICENSE for details.

Use it, modify it, sell it, whatever. Just don't blame us if your AI agent becomes sentient and refuses to restart pods on Fridays.


Acknowledgments


Reduce alert fatigue with AI-powered incident response.

Top categories

Loading Svelte Themes