Role Overview
We are seeking a Senior Site Reliability Engineer with strong experience in building and
maintaining scalable, resilient systems. The ideal candidate will have hands-on expertise in
cloud-native technologies, infrastructure as code, observability, and automation, with a
focus on Google Cloud Platform (GCP).
Key Responsibilities
- Ensure the stability and reliability of cloud-native applications deployed on GCP, containerized with Docker and orchestrated via Kubernetes.
- Define, implement, and monitor SLOs, SLAs, and SLIs to measure system performance and user experience.
- Automate infrastructure provisioning using Terraform and manage Kubernetes configurations with Kustomize and Helm.
- Develop and maintain monitoring and alerting systems using Datadog and GCP-native tools.
- Conduct incident analysis and postmortems to drive continuous improvement.
- Collaborate with development teams to integrate reliability practices into CI/CD pipelines using GitHub Actions.
- Manage and troubleshoot database systems, particularly PostgreSQL and Cassandra.
- Apply networking knowledge and Linux system administration skills to troubleshoot and optimize system connectivity and performance.