A Terraform-managed AWS infrastructure that demonstrates self-healing capabilities through auto-scaling, chaos engineering, and automated monitoring.
- Auto Scaling: Automatically scales EC2 instances based on CPU utilization
- Health Checks: Application Load Balancer monitors
/actuator/healthendpoint - Instance Replacement: Failed instances are automatically replaced
- Cross-AZ Deployment: High availability across multiple availability zones
- CPU Stress Testing: Simulates high CPU load (80% → 90% → 100%)
- Instance Termination: Tests ASG recovery by stopping random instances
- AWS FIS Integration: Fault Injection Simulator for controlled chaos
- CloudWatch Dashboard: Real-time metrics visualization
- CPU Alarms: Triggers when average CPU > 70%
- SNS Notifications: Email alerts for critical events
- Custom Metrics: ASG instance activity tracking
| Component | Purpose | Technology |
|---|---|---|
| VPC | Network isolation | AWS VPC with public subnets |
| Load Balancer | Traffic distribution | Application Load Balancer |
| Auto Scaling | Dynamic scaling | Auto Scaling Groups |
| Application | Business logic | Spring Boot on EC2 |
| Chaos Testing | Resilience testing | AWS FIS + SSM |
| Monitoring | Observability | CloudWatch + SNS |
- AWS CLI configured with appropriate permissions
- Terraform >= 1.0
- Git
git clone https://github.com/isaactony/self-healing-infra.git
cd self-healing-infraterraform init
terraform plan
terraform apply# Start CPU stress experiment
aws fis start-experiment --experiment-template-id you_template_id
# Start instance termination experiment
aws fis start-experiment --experiment-template-id your_template_id- CloudWatch Dashboard: View real-time metrics
- SNS Notifications: Check email for alerts
- FIS Console: Monitor experiment progress
Access the dashboard to view:
- Auto Scaling Group instance activity
- Average CPU utilization
- High CPU alarm status
AWS/AutoScaling/GroupInServiceInstancesAWS/EC2/CPUUtilizationAWS/ELB/TargetResponseTime
- Phase 1: 80% CPU load for 5 minutes
- Phase 2: 90% CPU load for 5 minutes
- Phase 3: 100% CPU load for 5 minutes
- Expected Result: ASG scales up to handle load
- Action: Stop 1 random instance
- Duration: 5 minutes
- Expected Result: ASG launches replacement instance
desired_capacity: Number of instances (default: 5)max_size: Maximum instances (default: 5)min_size: Minimum instances (default: 2)
terraform destroy- IAM Roles: Least privilege access for EC2 and FIS
- Security Groups: Restrictive inbound/outbound rules
- SSM Agent: Secure command execution via Systems Manager
- No Hardcoded Secrets: All sensitive data excluded from version control
FIS Experiment Fails:
- Ensure SSM Agent is running on instances
- Verify IAM permissions for FIS role
Auto Scaling Not Working:
- Check CloudWatch alarms configuration
- Verify ASG health check settings
Application Not Responding:
- Check security group rules
- Verify target group health checks
Built with ❤️ using Terraform and AWS