Kubernetes Operations Style and Safety
Conventions and guardrails for Kubernetes operations in Optum clusters, emphasizing read-only diagnostics and GitOps-driven changes.
Kubernetes Operations Style and Safety Guide
Overview
This guide covers conventions and safety guardrails for Kubernetes operations in Optum clusters. Production namespaces are read-only for LLM agents; all changes flow through GitOps pipelines.
Critical Safety Rules
Read-Only Default
MUST treat all environments as read-only by default:
# ✅ ALLOWED - Read-only operations
kubectl get pods -n production
kubectl describe deployment app -n production
kubectl logs pod/app-xyz123 -n production
kubectl top pods -n production
# ❌ FORBIDDEN - Direct mutations in production
kubectl delete pod app-xyz123 -n production
kubectl scale deployment app --replicas=0 -n production
kubectl apply -f manifest.yaml -n production
kubectl edit deployment app -n production
Change Flow Requirements
ALL changes to production MUST flow through:
- Git commit to manifest repository
- Pull request with required reviews
- GitOps sync (ArgoCD or Flux)
- Automated validation before promotion
# Change flow diagram
# Developer → Git → PR Review → Merge → GitOps → Cluster
# ↑
# CI Validation
Allowed Operations by Environment
Development Environment
allowed_operations:
read:
- get, describe, logs, top, events
- port-forward (for debugging)
write:
- apply, delete, scale, rollout
- exec (for debugging)
restrictions:
- No changes to shared infrastructure
- No changes to istio-system, cert-manager
QA Environment
allowed_operations:
read:
- All read operations
write:
- Requires approval for mutations
- GitOps preferred but direct apply allowed for hotfixes
restrictions:
- No namespace deletion
- No PVC deletion
- Changes logged and audited
Production Environment
allowed_operations:
read:
- All read operations
write:
- GitOps only (no direct kubectl apply)
- Emergency break-glass with dual approval
restrictions:
- No direct mutations
- No exec into pods (except break-glass)
- No port-forward (use ingress/mesh)
Diagnostic Commands
Pod Investigation
MUST use these patterns for pod diagnostics:
# List pods with status
kubectl get pods -n $NAMESPACE -o wide
# Get pod details
kubectl describe pod $POD_NAME -n $NAMESPACE
# Check pod events
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | grep $POD_NAME
# View pod logs
kubectl logs $POD_NAME -n $NAMESPACE --tail=100
# View previous container logs (after crash)
kubectl logs $POD_NAME -n $NAMESPACE --previous
# Multi-container pod logs
kubectl logs $POD_NAME -n $NAMESPACE -c $CONTAINER_NAME
# Stream logs
kubectl logs -f $POD_NAME -n $NAMESPACE
Resource Investigation
# Deployment status
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE
# Deployment history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE
# ReplicaSet details
kubectl get rs -n $NAMESPACE -o wide
# Service endpoints
kubectl get endpoints $SERVICE -n $NAMESPACE
# ConfigMap contents
kubectl get configmap $CM_NAME -n $NAMESPACE -o yaml
# Secret metadata (never output values)
kubectl get secret $SECRET_NAME -n $NAMESPACE -o jsonpath='{.metadata}'
Cluster Health
# Node status
kubectl get nodes -o wide
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -n $NAMESPACE
# Resource quotas
kubectl describe resourcequota -n $NAMESPACE
# Limit ranges
kubectl describe limitrange -n $NAMESPACE
Forbidden Operations
Never Execute in Production
| Operation | Risk | Alternative |
|---|---|---|
kubectl delete namespace | Data loss | Archive and recreate via GitOps |
kubectl delete pvc | Data loss | Backup first, delete via GitOps |
kubectl scale --replicas=0 | Outage | Use GitOps with canary rollback |
kubectl apply -f | Drift | Always use GitOps pipeline |
kubectl edit | Drift | Update manifests in Git |
kubectl exec -it -- sh | Security | Use ephemeral debug containers |
kubectl port-forward (prod) | Bypass security | Use proper ingress/mesh |
Dangerous Patterns
NEVER execute these patterns:
# ❌ Force delete - can cause data corruption
kubectl delete pod $POD --force --grace-period=0
# ❌ Delete all pods - will cause outage
kubectl delete pods --all -n $NAMESPACE
# ❌ Patch without review - causes drift
kubectl patch deployment $DEPLOYMENT -p '{"spec":{"replicas":0}}'
# ❌ Run privileged containers
kubectl run debug --image=alpine --privileged
# ❌ Mount host filesystem
kubectl run debug --image=alpine --overrides='{"spec":{"containers":[{"volumeMounts":[{"mountPath":"/host","name":"host"}],"volumes":[{"name":"host","hostPath":{"path":"/"}}]}]}}'
GitOps Change Patterns
Manifest Update Flow
MUST follow this flow for changes:
# 1. Clone manifest repository
git clone https://github.com/org/k8s-manifests.git
cd k8s-manifests
# 2. Create feature branch
git checkout -b feature/update-app-replicas
# 3. Make changes to manifests
vim apps/production/app/deployment.yaml
# 4. Validate locally
kubectl diff -f apps/production/app/
# 5. Commit and push
git add .
git commit -m "feat(app): increase replicas to 5 for traffic spike"
git push origin feature/update-app-replicas
# 6. Create PR (GitOps will handle apply after merge)
Kustomize Patterns
PREFER Kustomize overlays for environment-specific changes:
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 2
template:
spec:
containers:
- name: app
image: app:latest
resources:
requests:
memory: "256Mi"
cpu: "100m"
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
patchesStrategicMerge:
- deployment-patch.yaml
# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 5
template:
spec:
containers:
- name: app
resources:
requests:
memory: "1Gi"
cpu: "500m"
Incident Response Operations
Break-Glass Procedure
For emergency operations requiring direct cluster access:
break_glass_procedure:
prerequisites:
- Active P1/P2 incident
- GitOps too slow for remediation
- Dual approval from on-call leads
steps:
1_document:
- Create incident ticket
- Record justification
- Get verbal approval (record in ticket)
2_execute:
- Perform minimum necessary changes
- Log all commands executed
- Take screenshots of before/after
3_reconcile:
- Create PR to sync manifests with actual state
- Document changes in incident postmortem
- Review and refine runbooks
allowed_break_glass_operations:
- Restart failing pods
- Scale deployment (up or down)
- Rollback to previous revision
- Update ConfigMap for critical fixes
still_forbidden:
- Namespace deletion
- PVC deletion
- Security policy changes
Emergency Rollback
# View rollout history
kubectl rollout history deployment/$DEPLOYMENT -n $NAMESPACE
# Rollback to previous revision
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE
# Rollback to specific revision
kubectl rollout undo deployment/$DEPLOYMENT -n $NAMESPACE --to-revision=2
# Verify rollback
kubectl rollout status deployment/$DEPLOYMENT -n $NAMESPACE
Manifest Best Practices
Resource Definitions
MUST include resource requests and limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
labels:
app: app
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
version: v1
spec:
containers:
- name: app
image: registry.internal/app:v1.2.3
resources:
requests:
memory: '256Mi'
cpu: '100m'
limits:
memory: '512Mi'
cpu: '500m'
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Security Context
MUST set security context:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Pod Disruption Budget
MUST define PDB for production services:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Or use maxUnavailable: 1
selector:
matchLabels:
app: app
Network Policies
MUST define network policies for production:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: app-network-policy
spec:
podSelector:
matchLabels:
app: app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
Labeling Standards
MUST apply standard labels:
metadata:
labels:
# Required labels
app.kubernetes.io/name: app
app.kubernetes.io/instance: app-production
app.kubernetes.io/version: v1.2.3
app.kubernetes.io/component: backend
app.kubernetes.io/part-of: platform
app.kubernetes.io/managed-by: argocd
# Optum required labels
optum.com/owner: platform-team
optum.com/environment: production
optum.com/cost-center: PLAT-001
Observability Requirements
Pod Annotations for Monitoring
metadata:
annotations:
# Prometheus scraping
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'
# Logging configuration
logging.optum.com/format: 'json'
logging.optum.com/parser: 'application'
Required Metrics
MUST expose standard metrics:
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests |
http_request_duration_seconds | Histogram | Request latency |
http_requests_in_flight | Gauge | Current requests |
process_cpu_seconds_total | Counter | CPU usage |
process_resident_memory_bytes | Gauge | Memory usage |
Review Checklist
When reviewing Kubernetes changes:
Security
- SecurityContext defined (non-root, read-only fs)
- NetworkPolicy defined for production
- No privileged containers
- No hostPath mounts
- Secrets not hardcoded
Reliability
- Resource requests and limits set
- Liveness and readiness probes configured
- PodDisruptionBudget defined
- Replica count appropriate for environment
- Anti-affinity rules for HA
Observability
- Prometheus annotations present
- Logging format documented
- Standard labels applied
GitOps
- All changes in manifest repository
- No direct kubectl apply in PR
- Kustomize overlays for environment differences
Related Assets
Kubernetes Pod Debug Assistant
Diagnose failing or unhealthy Kubernetes pods using cluster state, events, and logs. Produces structured root cause analysis with safe remediation recommendations.
Owner: epic-platform-sre
Kubernetes Operations Assistant
Assist with Kubernetes cluster operations, debugging, and troubleshooting using read-only diagnostics and GitOps-safe recommendations.
Owner: epic-platform-sre
Kubernetes Deployment Best Practices
Comprehensive best practices for deploying and managing applications on Kubernetes (Pods, Deployments, Services, Ingress, health checks, resource limits, scaling, and security contexts).
Owner: epic-platform-sre
kubernetes-expert
Kubernetes and Kustomize operations with GitOps-first safety, debugging patterns, and production deployment guidance
Owner: epic-platform-sre
Dynatrace Kubernetes Service Triage
Systematic triage of a Dynatrace-monitored Kubernetes service using DQL queries for entity discovery, JVM health, thread analysis, pod generation comparison, and Davis problem correlation. Produces structured root cause analysis with Splunk query handoffs for restricted log environments.
Owner: epic-platform-sre
Incident Triage and Timeline Builder
Build comprehensive incident timelines from logs, metrics, and tickets. Produces structured chronological summaries for postmortems and RCAs.
Owner: epic-platform-sre

