Why You’ll Love Working With Us:

True Blameless Culture: We tackle incidents as a team. Our strict policy is: Fix the incident first, investigate the root cause later—absolutely no finger-pointing.
100% Cloud & Massive Scale: Run entirely on the Google Cloud (GCP) ecosystem and Google Kubernetes Engine (GKE), managing auto scale-up for high-traffic events.
AI Integration: We are actively leveraging AI to speed up daily tasks, automate log analysis/troubleshooting, and accelerate software releases.
Empowerment & Trust: Access rights start at a minimum but scale up based on your capability. Master the system, and you’ll be granted the highest level of system access.

Key Responsibilities (50% Automation / 50% Operations):

This is a key role requiring solid engineering knowledge, production experience, and hands-on implementation ability. You will:

Act as the first line of defense for incident handling, tackling issues manually and promptly when they occur.
Ensure the highest levels of production system performance, availability, and scalability.
Automate the provisioning of infrastructure on the cloud, systems, and software.
Design and operate build & release pipelines, configuration management, and code deployments to multiple environments.
Work closely with the development team to integrate new deployment processes and strategies.
Seek out problems or opportunities in critical high-impact areas and solve them.

Your First 6 Months:

Months 1-2 (Learning Phase): Dedicate time to adapt to Chợ Tốt's core infrastructure. We will sponsor your learning via Coursera to study and pass mandatory Google Cloud / K8s certificates. You will grasp the infrastructure across all 3 environments.
Months 3-6 (Execution Phase): Fully master the infrastructure, especially Production. You will handle support requests from Engineers, take on Group-level tasks, and participate in on-call duties.

Site Reliability Engineer (GCP)

Job Description