Skip to content

isaactony/self-healing-infra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Self-Healing AWS Infrastructure

A Terraform-managed AWS infrastructure that demonstrates self-healing capabilities through auto-scaling, chaos engineering, and automated monitoring.

Features

Self-Healing Capabilities

  • Auto Scaling: Automatically scales EC2 instances based on CPU utilization
  • Health Checks: Application Load Balancer monitors /actuator/health endpoint
  • Instance Replacement: Failed instances are automatically replaced
  • Cross-AZ Deployment: High availability across multiple availability zones

Chaos Engineering

  • CPU Stress Testing: Simulates high CPU load (80% → 90% → 100%)
  • Instance Termination: Tests ASG recovery by stopping random instances
  • AWS FIS Integration: Fault Injection Simulator for controlled chaos

Monitoring & Alerting

  • CloudWatch Dashboard: Real-time metrics visualization
  • CPU Alarms: Triggers when average CPU > 70%
  • SNS Notifications: Email alerts for critical events
  • Custom Metrics: ASG instance activity tracking

Infrastructure Components

Component Purpose Technology
VPC Network isolation AWS VPC with public subnets
Load Balancer Traffic distribution Application Load Balancer
Auto Scaling Dynamic scaling Auto Scaling Groups
Application Business logic Spring Boot on EC2
Chaos Testing Resilience testing AWS FIS + SSM
Monitoring Observability CloudWatch + SNS

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Terraform >= 1.0
  • Git

Quick Start

1. Clone the Repository

git clone https://github.com/isaactony/self-healing-infra.git
cd self-healing-infra

2. Deploy Infrastructure

terraform init
terraform plan
terraform apply

3. Run Chaos Experiments

# Start CPU stress experiment
aws fis start-experiment --experiment-template-id you_template_id

# Start instance termination experiment  
aws fis start-experiment --experiment-template-id your_template_id

4. Monitor Results

  • CloudWatch Dashboard: View real-time metrics
  • SNS Notifications: Check email for alerts
  • FIS Console: Monitor experiment progress

Monitoring

CloudWatch Dashboard

Access the dashboard to view:

  • Auto Scaling Group instance activity
  • Average CPU utilization
  • High CPU alarm status

Key Metrics

  • AWS/AutoScaling/GroupInServiceInstances
  • AWS/EC2/CPUUtilization
  • AWS/ELB/TargetResponseTime

Chaos Experiments

CPU Stress Test

  • Phase 1: 80% CPU load for 5 minutes
  • Phase 2: 90% CPU load for 5 minutes
  • Phase 3: 100% CPU load for 5 minutes
  • Expected Result: ASG scales up to handle load

Instance Termination Test

  • Action: Stop 1 random instance
  • Duration: 5 minutes
  • Expected Result: ASG launches replacement instance

Configuration

Key Variables

  • desired_capacity: Number of instances (default: 5)
  • max_size: Maximum instances (default: 5)
  • min_size: Minimum instances (default: 2)

Cleanup

terraform destroy

Security

  • IAM Roles: Least privilege access for EC2 and FIS
  • Security Groups: Restrictive inbound/outbound rules
  • SSM Agent: Secure command execution via Systems Manager
  • No Hardcoded Secrets: All sensitive data excluded from version control

Troubleshooting

Common Issues

FIS Experiment Fails:

  • Ensure SSM Agent is running on instances
  • Verify IAM permissions for FIS role

Auto Scaling Not Working:

  • Check CloudWatch alarms configuration
  • Verify ASG health check settings

Application Not Responding:

  • Check security group rules
  • Verify target group health checks

Built with ❤️ using Terraform and AWS

About

A Terraform-managed AWS infrastructure that demonstrates self-healing capabilities through auto-scaling, chaos engineering, and automated monitoring.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors