The Role
As a Senior Site Reliability Engineer at Blitzy's Kendall Square headquarters, you will be a foundational force behind the reliability, scalability, and operational excellence of our AI-powered software development platform. Sitting at the intersection of software engineering and infrastructure, you'll ensure that the systems enabling enterprise customers to autonomously build production-ready software remain performant, resilient, and always available. This is a high-ownership, high-impact role for an engineer who operates with urgency, thinks in systems, and takes pride in building infrastructure that doesn't break.
What Success Looks Like
- Blitzy's platform maintains industry-leading uptime — incidents are rare, and when they occur, they are resolved quickly with clear root cause analysis and lasting fixes.
- SLOs and error budgets are defined for every critical service and actively used to drive engineering decisions, not just tracked passively.
- Observability is a first-class capability — engineers across the company have the dashboards, traces, and alerts they need to understand system behavior without asking SRE.
- Deployment pipelines are fast, safe, and reliable — releases go out with confidence and rollbacks are automated when something goes wrong.
- Infrastructure is entirely codified — no manual provisioning, no configuration drift, every environment reproducible from source.
- Engineering teams are more productive because of your work — platform friction is low, developer tooling is sharp, and SRE is seen as an accelerant, not a gatekeeper.
- You are a trusted technical leader at HQ, influencing how Blitzy thinks about reliability as we scale our platform and our team.
Areas of Ownership
- Design, build, and operate highly available, fault-tolerant infrastructure across cloud environments supporting Blitzy's AI platform and enterprise customers.
- Define and own SLOs, SLAs, and error budgets for critical services; lead blameless postmortems and drive systemic improvements that prevent recurrence.
- Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure that empower engineers to ship with speed and safety.
- Own the full observability stack — logging, metrics, distributed tracing, and alerting (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
- Manage Kubernetes clusters and container infrastructure supporting AI agent workloads and production application services.
- Drive infrastructure-as-code practices using Terraform; ensure all provisioning is automated, auditable, and version-controlled.
- Partner with engineering teams at HQ to embed reliability and operational best practices early in the development lifecycle.
- Lead capacity planning, performance benchmarking, and cloud cost optimization as the platform scales.
Required Experience
- 5–8 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
- Deep expertise in Kubernetes — cluster management, workload deployment, scaling strategies, and troubleshooting in production.
- Strong proficiency with at least one major cloud platform (AWS preferred); experience designing and operating distributed, high-availability systems.
- Hands-on Terraform experience for infrastructure-as-code provisioning and management.
- Proven ability to define and operationalize SLOs, SLAs, and incident response processes.
- Strong scripting and automation skills in Python, Go, or Bash.
- Experience designing and maintaining comprehensive observability systems across complex, multi-service environments.