This engineer is expected to lead by example through hands-on contributions, deep technical expertise, and cross-team influence, particularly in the area of infrastructure bootstrap orchestration and automation at scale.
Key Responsibilities:
Platform Ownership & Reliability:
Own the end-to-end lifecycle (design, provisioning, upgrades, and decommissioning) of core platform components, including:
- Cloud infrastructure primitives
- Kubernetes clusters and cluster services
- Networking, ingress, and service discovery
- Service Mesh and supporting data-plane components
Ensure platform components are resilient by design, applying SRE principles such as:
- Fault isolation and graceful degradation
- Capacity planning and saturation control
- Reduced operational toil and clear failure modes
- Continuously assess and mitigate reliability risks, proactively improving platform stability and operational readiness.
Infrastructure Bootstrap & Automation Leadership:
Lead the design and implementation of infrastructure bootstrap orchestration, including:
- Automated cluster and environment provisioning
- Deterministic, repeatable platform bring-up and teardown
- Dependency-aware orchestration across cloud, network, and Kubernetes layers
Drive a strong Infrastructure-as-Code and GitOps-first approach, ensuring:
- Platform components are reproducible and auditable
- Changes are automated, testable, and reversible
- Manual intervention is minimized or eliminated
- Identify automation gaps and lead initiatives that significantly reduce human effort, onboarding time, and operational risk.
SRE Practices & Operational Excellence:
Apply and promote SRE practices across the platform, including:
- Clear ownership and runbooks for platform components
- Participation in on-call rotation as a platform reliability escalation point
- Incident response, post-incident reviews, and problem management
Improve platform operability by:
- Simplifying day-2 operations
- Standardizing upgrade and rollback strategies
- Reducing Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR)
- Ensure platform operations align with security, compliance, and internal control requirements.
This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.