The Role
As a Senior Site Reliability Engineer at Blitzy's Pune headquarters, you will be the backbone of our platform's reliability, scalability, and operational excellence. You'll work at the intersection of software engineering and infrastructure, ensuring our AI-powered development platform remains highly available and performant as we scale rapidly. This is a high-impact, hands-on role for an engineer who thrives in a fast-moving environment and takes deep ownership of the systems they build.
What Success Looks Like
- In 30 days: You have a deep understanding of Blitzy's infrastructure architecture, have identified key reliability risks, and are actively contributing to on-call rotations.
- In 90 days: You have shipped meaningful improvements to observability, incident response workflows, and deployment pipelines that measurably reduce MTTR and increase system uptime.
- In 6 months: You have driven at least one major reliability initiative from inception to production, established SLO/SLA frameworks for critical services, and are a trusted technical voice shaping our infrastructure roadmap.
Areas of Ownership
- Design, build, and operate scalable, fault-tolerant infrastructure across cloud environments (AWS, GCP, or Azure).
- Define and enforce SLOs, SLAs, and error budgets; lead blameless postmortems and drive systemic improvements.
- Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure.
- Own observability: design and maintain logging, metrics, tracing, and alerting stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
- Partner closely with software engineering teams to embed reliability practices into the development lifecycle.
- Drive capacity planning, performance benchmarking, and cost optimization across our infrastructure.
- Champion security best practices within the infrastructure and deployment layers.
Required Experience
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
- Strong proficiency in at least one major cloud platform (AWS preferred); experience with Kubernetes and container orchestration at scale.
- Hands-on experience with infrastructure-as-code tools (Terraform, Pulumi, or equivalent).
- Proven track record designing and maintaining high-availability, distributed systems.
- Deep expertise in observability tooling, incident management, and on-call practices.
- Strong scripting and automation skills (Python, Go, Bash, or similar).
- Excellent communication skills with the ability to collaborate across engineering teams and present technical findings to leadership.
What Makes You Stand Out
- Experience supporting AI/ML workloads or GPU-accelerated infrastructure.
- Prior experience in a high-growth startup environment where you wore multiple hats.
- Familiarity with eBPF, service mesh technologies (Istio, Linkerd), or advanced networking.
- Contributions to open-source SRE/DevOps tooling or communities.
- Experience building global, multi-region infrastructure with strict latency and