Your approach is excellent because it follows a state-checking → action → verification → repeat loop. Here's how to optimize and expand it:
# Pipeline Health Check & Recovery
## Execution Mode: Iterative with Progress Tracking
Check our CI/CD pipeline status and execute recovery procedures with terminal progress indicators.
### Phase 1: Status Assessment
1. Check last commit status: `git log -1 --pretty=format:"%H %s" && git status`
2. Fetch latest CI/CD results from GitHub Actions/GitLab CI/Jenkins
3. Parse build logs for failures
4. Generate initial status report
### Phase 2: Failure Recovery (if needed)
If any builds failed:
- Extract error messages from logs
- Identify root cause (dependency, syntax, test failure, deployment)
- Apply fix based on failure type
- Commit fix with conventional commit message
- Push and re-check status
- **LOOP**: Repeat Phase 1 until all builds pass
### Phase 3: Pre-Merge Validation (only if Phase 2 passed)
Run this checklist with terminal progress bar:
[█░░░░░░░░░] 1/9 Cleanup scripts (remove debug logs, temp files)
[██░░░░░░░░] 2/9 Update CHANGELOG.md with new changes
[███░░░░░░░] 3/9 Update PROJECT_STATUS.md with current sprint status
[████░░░░░░] 4/9 Update README.md (installation, usage, new features)
[█████░░░░░] 5/9 Review unfinished deliverables (check TODO, FIXME comments)
[██████░░░░] 6/9 Update VERSION file (semantic versioning)
[███████░░░] 7/9 Organize project root (move stray .md files to /docs)
[████████░░] 8/9 Wait for all GitHub Actions to complete
[█████████░] 9/9 Final verification - all green?
### Phase 4: Merge Readiness
If all checks pass:
- Verify branch is up-to-date with main/develop
- Check for merge conflicts
- Generate merge command or create PR
- Print final summary
### Phase 5: Post-Merge Cleanup
- Delete local feature branch
- Pull latest changes
- Verify deployment to staging/production
## Output Requirements
- Real-time progress updates
- Color-coded status (✓ green, ✗ red, ⚠ yellow)
- Error logs with file:line references
- Actionable recommendations for each failure
- Final status summary with metrics
## Error Handling
- Maximum 3 retry attempts per failed check
- Escalate to manual intervention if auto-fix fails
- Log all actions to `devops-log-$(date).txt`Run pre-commit quality checks before allowing commit:
1. **Linting**: Run ESLint/Pylint/RuboCop with auto-fix
2. **Type Checking**: Run TypeScript/MyPy strict mode
3. **Unit Tests**: Run fast tests only (< 30s)
4. **Format**: Run Prettier/Black auto-format
5. **Security**: Scan for secrets (git-secrets, trufflehog)
6. **Dependencies**: Check for vulnerabilities (npm audit, pip-audit)
If any check fails:
- Show specific failures with file paths
- Apply auto-fixes where possible
- Block commit if critical issues remain
- Suggest fixes for manual issues
Only proceed with commit if all checks pass.Debug failed CI/CD pipeline with systematic approach:
## Step 1: Identify Failure Point
- Fetch latest pipeline run from GitHub Actions
- Parse YAML workflow file
- Identify which job/step failed
- Extract error messages and exit codes
## Step 2: Local Reproduction
- Recreate the exact environment (Docker image, Node version, etc.)
- Run the failed command locally
- Capture detailed error output
## Step 3: Root Cause Analysis
Check common failure categories:
- [ ] Dependency version mismatch
- [ ] Environment variable missing
- [ ] File permission issues
- [ ] Network/API timeout
- [ ] Flaky test
- [ ] Resource constraints (memory, disk)
## Step 4: Apply Fix
Based on root cause:
- Update package.json/requirements.txt versions
- Add missing env vars to .env.example
- Fix file permissions in CI config
- Add retry logic for network calls
- Mark flaky tests as @skip with ticket reference
- Increase resource limits in CI config
## Step 5: Verification
- Push fix to feature branch
- Monitor new pipeline run
- Compare before/after build times
- Document fix in troubleshooting guide
## Step 6: Prevention
- Add pre-commit hook to catch this earlier
- Update CI/CD workflow with better error messages
- Add this failure pattern to runbookPrepare for production release with zero-downtime checklist:
## Pre-Release Validation (30-45 min)
### Code Quality
[█░░░] Run full test suite (unit, integration, e2e)
[██░░] Check code coverage (minimum 80%)
[███░] Run security audit (Snyk, npm audit, OWASP)
[████] Verify no console.logs or debug statements
### Documentation
[█░░░] Update CHANGELOG.md with all changes since last release
[██░░] Update API documentation (OpenAPI/Swagger)
[███░] Update deployment runbook
[████] Create release notes for stakeholders
### Infrastructure
[█░░░] Verify staging environment matches production
[██░░] Run database migration dry-run
[███░] Check disk space and resource availability
[████] Verify backup systems are operational
### Dependencies
[█░░░] Update all patch-level dependencies
[██░░] Check for breaking changes in minor updates
[███░] Verify third-party API status pages
[████] Test payment gateway in sandbox
### Rollback Plan
[█░░░] Tag current production as rollback point
[██░░] Document rollback procedure
[███░] Verify rollback can complete in < 5 minutes
[████] Assign rollback decision maker
## Release Execution
If all checks pass:
1. Merge to main branch
2. Tag with semantic version: `v{major}.{minor}.{patch}`
3. Trigger production deployment
4. Monitor error rates and performance metrics
5. Send release notification to team/stakeholders
## Post-Release Monitoring (2 hours)
- Watch error tracking (Sentry, Rollbar)
- Monitor APM metrics (response times, throughput)
- Check logs for anomalies
- Verify key user flows in production
- Update status page if needed
## Rollback Trigger
Execute rollback if:
- Error rate increases > 5%
- Response time increases > 50%
- Critical feature is broken
- Payment processing fails
Rollback command: `./scripts/rollback.sh v{previous_version}`Detect and fix configuration drift between environments:
## Scan Target: Dev → Staging → Production
### Phase 1: Inventory Collection
For each environment, collect:
- Environment variables (.env files)
- Infrastructure config (Terraform state, CloudFormation)
- Feature flags (LaunchDarkly, etc.)
- Service versions (Docker images, npm packages)
- Database schema versions
- SSL certificates and expiration dates
- DNS records and load balancer configs
### Phase 2: Drift Detection
Compare environments and flag differences:🔴 CRITICAL DRIFT - Production has different database schema version Production: v2.4.1 Staging: v2.5.0 Action: Hold production deployment until schemas match
🟡 WARNING - Staging missing API key Variable: THIRD_PARTY_API_KEY Exists in: Production, Dev Missing in: Staging Action: Add to Staging secrets
🟢 OK - All other configs match
### Phase 3: Sync Recommendations
Generate sync commands:
```bash
# Apply to Staging
export THIRD_PARTY_API_KEY="***"
# Apply to Production (after approval)
kubectl set image deployment/api api=myapp:v2.5.0
terraform apply -target=aws_db_instance.main
- Run smoke tests in each environment
- Verify API endpoints return expected responses
- Check application logs for errors
- Confirm monitoring dashboards show healthy metrics
Report format:
Environment Sync Report
=======================
Dev ✓ Healthy
Staging ⚠ 2 warnings fixed
Production ✓ Synced successfully
Drift Score: 97% (3% drift resolved)
Last Sync: 2025-11-10 04:20 UTC
Next Scheduled Sync: 2025-11-11 04:20 UTC
### 5. **Kubernetes Cluster Health Check**
```markdown
Perform comprehensive Kubernetes cluster health assessment:
## Health Check Sequence
### 1. Node Status
```bash
kubectl get nodes -o wide
# Check: All nodes Ready, sufficient disk/memory
Validate:
- All nodes in Ready state
- CPU usage < 80%
- Memory usage < 85%
- Disk usage < 75%
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
# Check: No pods in CrashLoopBackOff, Error, or PendingFor each unhealthy pod:
- Describe pod:
kubectl describe pod {pod_name} - Check logs:
kubectl logs {pod_name} --previous - Identify issue: OOMKilled, ImagePullBackOff, etc.
- Apply fix or restart
kubectl describe quota --all-namespaces
# Check: No quotas exceededTest service-to-service connectivity:
kubectl run test-pod --image=curlimages/curl --rm -it -- curl http://service.namespace.svc.cluster.localkubectl get pv,pvc --all-namespaces
# Check: All PVCs bound, no pending volumesCheck cert-manager certificates:
kubectl get certificate --all-namespaces
# Check: No certificates expiring in < 30 daysIf issues found:
- Scale up nodes if resource constrained
- Restart crashlooping pods (max 3 attempts)
- Evict pods from nodes with disk pressure
- Renew expiring certificates
- Clear old replica sets and completed jobs
Kubernetes Cluster Health Report
================================
Cluster: production-us-east-1
Nodes: 8/8 healthy ✓
Pods: 156 running, 2 issues found ⚠
- api-worker-7d8f: CrashLoopBackOff (fixing...)
- cache-redis-9b2c: OOMKilled (increasing memory limit)
Resource Usage:
CPU: 62% (healthy ✓)
Memory: 71% (healthy ✓)
Disk: 54% (healthy ✓)
Certificates: 3 expiring in 45 days ⚠
- *.api.example.com (expires: 2025-12-25)
Actions Taken:
1. Restarted api-worker-7d8f
2. Updated cache-redis memory limit: 2Gi → 4Gi
3. Queued certificate renewal for December
Overall Status: HEALTHY with minor issues resolved
### 6. **Database Migration Safety Check**
```markdown
Execute database migration with zero-downtime strategy:
## Pre-Migration Validation
### 1. Backup Verification
- [ ] Full database backup completed
- [ ] Backup file integrity verified (checksum)
- [ ] Backup restore tested in separate environment
- [ ] Backup stored in 3 locations (local, S3, offsite)
### 2. Migration Review
Analyze migration file for risks:
```sql
-- Check for:
❌ DROP COLUMN (data loss risk)
❌ ALTER COLUMN type (potential data truncation)
❌ Locks on large tables (downtime risk)
✓ ADD COLUMN with DEFAULT (safe)
✓ CREATE INDEX CONCURRENTLY (PostgreSQL safe)
- Estimate migration duration (dry-run on staging)
- Check table sizes:
SELECT pg_size_pretty(pg_total_relation_size('table_name')); - Identify long-running queries that could block
- Schedule during low-traffic window
Create reverse migration:
-- If migration adds column:
ALTER TABLE users DROP COLUMN new_field;
-- If migration changes data:
-- Restore from backup taken at: 2025-11-10 04:20 UTC
-- Command: psql mydb < backup_20251110_0420.sql# Snapshot current state
pg_dump mydb > pre_migration_$(date +%Y%m%d_%H%M%S).sql
# Record table counts
SELECT table_name,
(SELECT COUNT(*) FROM table_name) as row_count
FROM information_schema.tables
WHERE table_schema = 'public';# Apply with transaction wrapper
BEGIN;
\i migrations/0042_add_user_preferences.sql
-- Verify changes
SELECT COUNT(*) FROM users WHERE preferences IS NOT NULL;
COMMIT; -- or ROLLBACK if verification fails- Compare row counts (before vs after)
- Run application smoke tests
- Check for slow queries (new indexes working?)
- Monitor error logs for 15 minutes
- Verify data integrity constraints
- Deploy backward-compatible app version first
- Verify old code works with new schema
- Deploy new app version using new columns
- Monitor performance metrics
# Database rolled back automatically by transaction
# No action needed - investigate and retry# Rollback application only
git revert HEAD
kubectl rollout undo deployment/api
# Database schema remains (backward compatible)# Full rollback required
# 1. Take current state snapshot (for forensics)
pg_dump mydb > corrupted_state_$(date +%Y%m%d_%H%M%S).sql
# 2. Restore from pre-migration backup
psql mydb < pre_migration_20251110_0420.sql
# 3. Rollback application
git revert HEAD && git push
kubectl rollout undo deployment/api
# 4. Post-mortem: Analyze corrupted_state file✓ Migration completed in < 30 seconds ✓ Zero data loss ✓ All tests passing ✓ Error rate unchanged ✓ Response times within 10% of baseline ✓ No rollback needed for 24 hours
---
## 🎨 Advanced Pattern: State Machine DevOps
```markdown
# Intelligent DevOps State Machine
Execute DevOps workflows with state tracking and automatic recovery.
## State Definition
```javascript
const pipelineStates = {
INIT: 'initializing',
CHECK: 'checking_status',
BUILD: 'building',
TEST: 'testing',
DEPLOY: 'deploying',
VERIFY: 'verifying',
ROLLBACK: 'rolling_back',
SUCCESS: 'completed',
FAILED: 'failed'
};
Load from: .devops-state.json
INIT → CHECK → BUILD → TEST → DEPLOY → VERIFY → SUCCESS
↓ ↓ ↓ ↓ ↓ ↓
└──────┴───────┴───────┴────────┴────────→ ROLLBACK → FAILED
Actions:
git status- verify clean working directorygit fetch origin- check for upstream changes- Check CI/CD platform API for last build status
- Parse build logs for errors
Transitions:
- If checks pass → BUILD
- If checks fail → Fix issues, stay in CHECK
- If critical failure → FAILED
Actions:
npm run buildor equivalent- Monitor build output for warnings
- Verify build artifacts created
- Check bundle size (< 500KB for frontend)
Transitions:
- If build succeeds → TEST
- If build fails → ROLLBACK or stay in BUILD (auto-retry 3x)
Actions:
npm test -- --coverage- Run integration tests
- Run e2e tests (Playwright/Cypress)
- Generate coverage report (minimum 80%)
Transitions:
- If all tests pass → DEPLOY
- If tests fail → Fix and return to BUILD
- If > 3 failures → ROLLBACK
Actions:
- Tag release:
git tag v1.2.3 - Push to deployment branch
- Trigger deployment pipeline
- Monitor deployment progress
Transitions:
- If deploy succeeds → VERIFY
- If deploy fails → ROLLBACK
Actions:
- Run smoke tests against deployed environment
- Check health endpoints
- Monitor error rates for 10 minutes
- Verify key metrics (response time, throughput)
Transitions:
- If verification passes → SUCCESS
- If verification fails → ROLLBACK
Actions:
- Revert to previous stable version
git revert HEADor redeploy previous tag- Restore database if migrations were applied
- Notify team of rollback
Transitions:
- Always → FAILED (requires manual intervention)
Actions:
- Update CHANGELOG.md
- Send success notification
- Archive build artifacts
- Clean up temporary files
- Update
.devops-state.jsonwith SUCCESS
Actions:
- Generate detailed failure report
- Attach logs and stack traces
- Create GitHub issue with failure details
- Notify on-call engineer
- Update
.devops-state.jsonwith FAILED
Save after each state change:
{
"currentState": "VERIFY",
"previousState": "DEPLOY",
"timestamp": "2025-11-10T04:20:00Z",
"attemptCount": 1,
"metadata": {
"commitHash": "abc123",
"branch": "feature/new-checkout",
"triggeredBy": "j.",
"buildNumber": "456"
},
"history": [
"INIT → CHECK → BUILD → TEST → DEPLOY → VERIFY"
]
}If Claude process is interrupted:
- Load
.devops-state.json - Resume from
currentState - Replay last action with idempotency checks
- Continue state machine execution
╔════════════════════════════════════════╗
║ DevOps State Machine - Status Report ║
╚════════════════════════════════════════╝
Current State: VERIFY
Progress: [████████░] 8/9 steps
State History:
✓ INIT (00:00:02)
✓ CHECK (00:00:15)
✓ BUILD (00:02:34)
✓ TEST (00:05:12)
✓ DEPLOY (00:03:45)
⟳ VERIFY (00:00:30) - In Progress
Next Actions:
- Monitoring error rates (8/10 min complete)
- Checking response times
- Validating user flows
Estimated completion: 2 minutes
---
## 🔧 Prompt Engineering Best Practices for DevOps
### 1. **Always Include State Checking**
```markdown
Before taking any action:
1. Check current state
2. Verify preconditions
3. Document assumptions
Each step should be safely repeatable:
- Check if already done before executing
- Use `if not exists` patterns
- Don't fail on already-completed stepsShow real-time progress:
[█████░░░░░] 50% - Running tests (234/500)
Not just:
Running tests...Explicit exit conditions:
✓ All tests pass (347/347)
✓ Coverage > 80% (actual: 87%)
✓ Build size < 500KB (actual: 423KB)
✓ No security vulnerabilities (0 found)
Result: PASS - Proceed to next stageMaximum retry attempts: 3
Timeout per step: 5 minutes
Escalation: After 2 failures, require manual approval
Emergency abort: Type 'ABORT' to stop immediatelyCreate audit trail:
- Timestamp each action
- Save command outputs
- Capture error messages
- Record state changes
- Store in: `logs/devops-{timestamp}.log`Give me a 30-second DevOps health check:
1. Last commit status
2. CI/CD pipeline status
3. Production health metrics
4. Any critical alerts
Format: Traffic light (🟢/🟡/🔴) with one-line summary each.EMERGENCY: Rollback production immediately
1. Identify current production version
2. Identify last known good version
3. Execute rollback
4. Verify rollback succeeded
5. Notify team
Time budget: 5 minutes maximumUpdate all dependencies safely:
1. Check for available updates
2. Categorize: patch, minor, major
3. Update patches automatically
4. Test minors in separate branch
5. Create tickets for majors
6. Run full test suite
7. Create PR if all tests pass-
Use Checkpoints: Save state after each major step so you can resume if interrupted
-
Parallel Where Possible: Run independent checks concurrently
Run in parallel: - Linting & formatting - Unit tests - Security scans Wait for all to complete before proceeding
-
Fail Fast: Check cheapest/fastest validations first
Order of checks (fast → slow): 1. Syntax validation (1s) 2. Linting (5s) 3. Unit tests (30s) 4. Integration tests (2m) 5. E2E tests (10m)
-
Provide Context: Help Claude understand the bigger picture
Project: E-commerce checkout flow Stack: Next.js 14, PostgreSQL, Redis Deployment: Vercel Critical path: Payment processing Now run pre-deployment checks...
-
Version Your Prompts: Keep prompt templates in version control
.claude/ ├── prompts/ │ ├── deploy-checklist-v1.2.md │ ├── rollback-procedure-v2.0.md │ └── health-check-v1.5.md
Add to your prompts:
After completion, report:
- Total execution time
- Number of retries needed
- Commands executed (count)
- Data transferred (bytes)
- Cost estimate (API calls, compute time)
- Success rate (this run vs historical)Your iterative check pattern is solid. Enhance it with:
- State persistence (so you can resume)
- Parallel execution (where safe)
- Better progress indicators (real-time updates)
- Automatic error classification (transient vs permanent)
- Contextual help (suggest fixes based on error type)
This creates a robust, resumable DevOps automation system that Claude can execute reliably.