AWS SysOps: Automating Incident Response with CloudWatch, Lambda, and SNS

During my recent attempt at the AWS Certified SysOps Administrator exam, I encountered several questions focused on monitoring and automation. Most of them revolved around CloudWatch, CloudTrail, and responding to incidents using Lambda, SNS, or SQS.

One scenario stood out: developers were dealing with a memory leak in an application. While they worked on a fix, the SysOps Administrator needed to keep the service running with minimal disruption. This inspired me to write a post detailing how AWS tools can help us automate operational responses like this.

🧠 Real-World Scenario: Memory Leak in Production

Let’s say an EC2 instance is running an application that occasionally crashes due to memory leaks. The development team acknowledges the issue but needs more time to patch it.

Goal: Keep the application available by restarting it when memory usage reaches a critical threshold automatically.

🧰 Key AWS Services for Automation and Monitoring

Before we dive into the real-world scenario, let’s briefly cover the core AWS services used by SysOps Administrators:

Amazon CloudWatch: A monitoring service that collects metrics, logs, and events from AWS resources and custom applications. It enables you to create alarms based on thresholds (like CPU or memory usage) and trigger automated actions.
AWS Lambda: A serverless compute service that lets you run code in response to events without provisioning or managing servers. It’s commonly used to automate remediation tasks, such as restarting a service or sending notifications.
Amazon SNS (Simple Notification Service): A fully managed pub/sub messaging service. It allows you to send messages (notifications) to multiple subscribers (like Lambda functions, email addresses, or SMS) when something happens.
Amazon SQS (Simple Queue Service): A managed message queue that decouples microservices, distributed systems, and serverless apps. It’s ideal for handling asynchronous workflows or buffering high-volume events for later processing.
AWS CloudTrail: A logging service that captures all API calls and changes made within your AWS account. It’s essential for auditing, governance, and triggering events based on specific actions, like resource creation or policy changes.

🔍 Step 1: Monitoring with CloudWatch

Create a CloudWatch alarm based on a metric like mem_used_percent or MemoryUtilization (if using CloudWatch Agent or custom metrics):

aws cloudwatch put-metric-alarm \
  --alarm-name "HighMemoryUsage" \
  --metric-name "MemoryUtilization" \
  --namespace "CWAgent" \
  --statistic Average \
  --period 60 \
  --threshold 85 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:NotifyOps

📣 Step 2: Notification with SNS

The alarm should publish to an SNS topic. This allows us to notify people and trigger automated workflows.

aws sns create-topic --name NotifyOps

Then, subscribe a Lambda function to this topic.

🛠️ Step 3: Remediation with Lambda

Your Lambda function listens to the SNS topic and can perform actions such as:

Restarting a service via SSM (for EC2)
Rebooting the instance
Starting a new container task (for ECS)
Scaling up another instance (if using Auto Scaling)

Example: Restarting EC2 Service via SSM

import boto3

ssm = boto3.client('ssm')

def lambda_handler(event, context):
    instance_id = "i-0abcdef1234567890"
    command = "sudo systemctl restart myapp"
    response = ssm.send_command(
        InstanceIds=[instance_id],
        DocumentName="AWS-RunShellScript",
        Parameters={"commands": [command]},
    )
    return response

Make sure SSM Agent is installed and IAM permissions are properly set.

📜 Bonus: Audit Trails with CloudTrail

While CloudWatch handles metrics, CloudTrail provides event-based triggers. You can detect actions like:

Unauthorized access
API failures
EC2 reboots

CloudTrail logs can be streamed to CloudWatch Logs or S3 and analyzed using Athena or GuardDuty.

🎯 Alternative Remediation Strategies

Send alerts to SQS → processed by a central orchestrator Lambda
Trigger a Systems Manager Automation document
Launch a replacement EC2 instance from an AMI

💡 Final Thoughts

This kind of operational automation isn’t just useful, but it’s essential. Whether you’re preparing for the AWS SysOps certification or managing production systems, learning how to detect, notify, and remediate using AWS-native tools can save hours of manual work and keep your services reliable. If you’ve failed the exam like I did that’s okay: we must use the feedback to go deeper into areas like CloudWatch alarms, SNS topics, SSM automation, and Lambda integrations. Keep studying, keep building. You’re getting closer with every iteration.