AI-powered troubleshooting system for OpenShift HyperShift clusters using Claude Code as an intelligent investigation agent.
This repository implements a GitOps-based system that enables Platform Engineers to trigger Claude Code troubleshooting sessions in OpenShift clusters. Claude Code runs as Kubernetes Jobs with full observability through OpenTelemetry/Logfire.
- Insecure Zone: Platform engineers edit prompts and trigger investigations via Git
- Secure Zone: Claude Code runs as pods within target clusters with appropriate RBAC
- Observability: Full execution traces exported to Logfire via OpenTelemetry
- Access to GitHub repository
- ArgoCD deployed in target clusters
- Logfire configured for OTLP ingestion
- Anthropic API key
# Clone repository
git clone https://github.com/intility/devinfra-claude-warroom
cd devinfra-claude-warroom
# Create API key secret in target cluster (from SAW)
kubectl create secret generic claude-api-secret \
--from-literal=api-key=$ANTHROPIC_API_KEY \
-n claude-warroom
# Apply ArgoCD application (from SAW)
kubectl apply -f argocd/application-hub.yaml
# From insecure workstation
./scripts/trigger-investigation.sh api-unreachable production-cluster hub
# Or manually edit and commit
vim manifests/hub/job-override.yaml
git add -A && git commit -m "Investigate issue" && git push
Filter by:
- Service:
claude-warroom
- Cluster name
- Investigation type
Type | Description |
---|---|
comprehensive-dump |
Full cluster diagnostic data collection |
control-plane-health |
Control plane component health check |
api-unreachable |
API server connectivity issues |
nodes-not-joining |
Nodes failing to join cluster |
nodes-not-ready |
Nodes in NotReady state |
etcd-performance |
etcd performance analysis |
pod-scheduling-failures |
Pod scheduling issues |
operator-degraded |
Degraded operator investigation |
network-connectivity |
Network connectivity problems |
cluster-upgrade-stuck |
Stuck cluster upgrade |
hosted-cluster-partial |
Cluster stuck in Partial state |
control-plane-restart |
Emergency control plane restart |
pause-reconciliation |
Pause/resume cluster reconciliation |
scale-nodepool |
Scale NodePool up/down |
├── manifests/
│ ├── base/ # Shared K8s resources
│ ├── hub/ # Hub cluster specific
│ ├── hosted-*/ # Hosted cluster specific
│ └── prompts/ # Investigation prompt templates
├── scripts/ # Helper scripts
├── argocd/ # ArgoCD applications
├── Dockerfile # Claude Code container image
└── .github/workflows/ # GitHub Actions for image build
- Create prompt file in
manifests/prompts/
- Add to
kustomization.yaml
- Use variables:
${CLUSTER_NAME}
,${NAMESPACE}
, etc. - Commit and push
Jobs support these variables:
CLUSTER_TYPE
: hub or hostedCLUSTER_NAME
: Target cluster nameHOSTED_CLUSTER_NAME
: For hub investigationsNAMESPACE
: Target namespaceISSUE_TYPE
: Investigation typeNODEPOOL_NAME
: For scaling operationsTARGET_REPLICAS
: For scaling operations
- Claude Code runs with read-only RBAC (+ debug exec)
- API key stored in Kubernetes Secret
- Network policies restrict egress
- Jobs auto-cleanup after 24 hours
- Check API key secret exists
- Verify RBAC is applied
- Check image pull permissions
- Verify OTEL_EXPORTER_OTLP_ENDPOINT
- Check network connectivity to Logfire
- Ensure Logfire collector is running
- Default timeout is 30 minutes
- Adjust
activeDeadlineSeconds
in job spec - Break complex investigations into smaller parts
- Fork the repository
- Create feature branch
- Add/modify prompts or features
- Test in development cluster
- Submit pull request
Internal use only - Intility