Platform Engineering & Operational Intelligence
Context & Mission
Believe鈥檚 Platform Engineering organization is building a resilient, observable, and scalable cloud platform serving engineering teams across the company.
Within Operational Intelligence (OI), our mission is to improve performance, safety, productivity, and operational maturity while preserving team autonomy through a Platform-as-a-Product approach.
We are looking for a Staff Resilience Engineer to elevate the platform鈥檚 reliability posture and proactively strengthen its behavior under failure. This role operates at the intersection of resilience engineering, incident leadership, observability strategy, and distributed systems architecture.
You will not only respond to critical incidents鈥攜ou will design systems, practices, and experiments that ensure failures are anticipated, controlled, and continuously learned from.
What You Will Own
1. Incident Leadership & Systemic Improvement
Own the organization鈥檚 response to high-severity incidents.
Establish a clear, scalable incident management model across teams.
Turn incidents into structural platform improvements.
Improve reliability KPIs (MTTR, detection latency, recurrence).
2. Resilience & Failure Engineering
Identify systemic architectural risks and scalability limits.
Design and institutionalize proactive resilience practices.
Lead and scale chaos engineering and controlled failure experimentation.
Ensure systems behave predictably under stress and partial failure.
Drive the evolution of self-healing and failover capabilities.
3. Observability & Reliability Strategy
Define the platform-wide observability vision and standards.
Improve signal quality, detection speed, and SLO maturity.
Align telemetry architecture with performance and cost efficiency goals.
Standardize instrumentation and reliability practices across squads.
4. Platform Reliability Evolution
Influence the long-term reliability posture of the platform.
Embed operational excellence in platform capabilities.
Partner with engineering and product leadership on roadmap priorities.
Raise the reliability bar across the organization.
Staff-Level Expectations
At Staff level, you are expected to:
Influence architecture and operational strategy across multiple engineering groups.
Lead large-scale, multi-quarter initiatives with high autonomy.
Identify cross-team risks before they materialize.
Shape engineering culture around resilience and operational maturity.
Mentor engineers and elevate platform squads technically.
Drive alignment among diverse stakeholders.
Balance strategic thinking with hands-on technical execution.