Role Overview:
Looking for a Site Reliability Engineer (SRE) to support the reliability, availability, and performance of business鈥慶ritical systems. This role focuses on AWS cloud infrastructure, DevOps tools, and core SRE practices. You will work closely with development, platform, and operations teams to ensure systems are stable, scalable, and well monitored.
Key Responsibilities:
Reliability & Operations:
- Support high availability, scalability, and performance of production systems
- Implement and maintain SLIs, SLOs, and SLAs for services
- Identify and reduce operational toil through automation and process improvement
- Support design and implementation of fault鈥憈olerant and resilient systems
- Manage and operate systems hosted on AWS (EC2, EKS/ECS, RDS, S3, Lambda, CloudWatch, IAM, VPC)
- Support cloud deployments and infrastructure changes following best practices
- Assist with backup, disaster recovery, and resiliency planning
- Work with CI/CD pipelines and DevOps tools to support reliable deployments
- Use Infrastructure as Code tools such as Terraform or CloudFormation
- Automate repetitive tasks using scripts (Python, Bash, etc.)