We are seeking a talented Site Reliability Engineer (SRE) to join our team in Pune, India. As an SRE, you will play a crucial role in ensuring the reliability, scalability, and performance of our large-scale distributed systems. You will work closely with development teams to implement and maintain robust infrastructure solutions that support our growing business needs.
Role Overview
The Site Reliability Engineer (SRE) is responsible for designing, implementing, and maintaining scalable, reliable, and secure infrastructure and applications. This role blends software engineering with systems engineering to ensure high availability, performance, and observability across cloud-native environments.
- Key Responsibilities
- Architect for Resilience: Design systems with redundancy, fault tolerance, and graceful degradation.
- Observability & Monitoring: Implement full-stack observability including monitoring, logging, tracing, and alerting.
- Automation First: Build workflows to automate deployments, incident response, and routine tasks.
- Incident Management: Enable blameless postmortems and continuous improvement.
- Release Planning: Collaborate with DevOps and engineering teams to manage lifecycle work items and release cycles.
- Global Collaboration: Work in a shared responsibility model with 50–60% overlap with onshore teams for effective communication.
Required Skills & Experience
- Cloud Platforms: Azure (preferred), AWS (acceptable with upskilling plan)
- Infrastructure as Code: Terraform, Helm, GitHub Actions
- Containerization & Orchestration: Docker, Kubernetes, Argo CD, Flux
- DevOps Tools: CI/CD pipelines, GitOps, REST APIs
- Programming: Bash, Python (moderate proficiency)
- Data Ecosystems: Azure Data Factory, Databricks, Fabric (optional but preferred)
- Team Integration & Expectations
- Work closely with technical leads on support tasks and playbook development.
- Participate in onboarding and training programs outlined in internal documentation.
- Contribute to offshore delivery excellence and maintain high standards of reliability and performance.
🛠️ Required Technical Skills
- Strong Azure Infrastructure and Networking skills.
- Strong Terraform IaC experience and skills.
- BICEP knowledge a plus.
- Strong previous experience in troubleshooting complex issues on an unfamiliar tech stack.
- Strong Github Actions/Azure DevOps Pipelines.
- Moderate knowledge of the operations and Infrastructure patterns for working in Data Ecosystems.
- Data Factory, Databricks, Fabric knowledge a plus.
- Moderate knowledge of SRE, Observability, and other maintenance style knowledge.
- Moderate Bash skills.
- Moderate Python skills.
- Moderate experience in CI/CD and Git operations for software releases.
- Moderate AKS/Helm/Kustomize Skills.
- Flux/Argo/GitOps experience a plus.
- Moderate Docker operations knowledge.
- Moderate REST API Knowledge.