We are seeking a senior cloud engineer to ensure the reliability, scalability, and performance of our cloud infrastructure. This role combines software engineering skills with cloud operations expertise, focusing on automation, monitoring, incident response, and performance optimization across AWS services.
- Ensure Availability & Reliability: Maintain highly available and resilient cloud infrastructure, meeting agreed SLO targets.
- Monitoring & Alerting: configure, and optimize monitoring solutions to detect anomalies early and maintain system health.
- Performance Optimization: Analyze and tune system performance, networking, and workloads to improve efficiency and reduce operational costs.
- Incident Response & Change Request: Respond to infrastructure incidents, perform root cause analysis, and implement permanent preventative solutions. Perform change request.
- Collaboration with DevOps & Development Teams: Partner with ACP Platform teams, contribute to product design, and provide operation feedback to ensure seamless service delivery and support operations.
- Disaster Recovery & Resilience Engineering: Lead and ensure backup, replication, and failover plans across AWS regions for business continuity are well maintain and tested.
- Postmortem & Continuous Improvement: Document incidents, update runbooks, and improve processes based on lessons learned.
- Automation: Build self-healing systems and automated remediation workflows to reduce manual intervention
- Capacity Planning: Forecast and optimize AWS resources to handle traffic spikes using Auto Scaling and Load Balancing.
- Security & Compliance: Follow security best practice and compliance with Accor security standards.
- Infrastructure-as-Code (IaC): maintain IaC using tools Terraform , CloudFormation , or AWS CDK to deploy and manage AWS resources.