The primary goal of this role is to ensure high availability, scalability, and reliability of production systems by applying best practices in reliability and performance engineering. This position blends software development expertise with operational knowledge to automate processes, monitor infrastructure and applications, and respond swiftly to incidents.
As a Senior SRE, you will play a key role in designing and maintaining monitoring systems, conducting root cause analyses, and improving deployment processes. You will be responsible for the overall reliability strategy, operational efficiency, and fostering collaboration between development and operations teams.
- Performance Monitoring & Tuning
Configure and optimize monitoring tools (e.g., Dynatrace, Zabbix, Prometheus, Grafana, ELK stack) to meet application team needs. - Dashboard Development & OptimizationBuild and enhance dashboards for resource usage and system performance analysis (Dynatrace, GCP, Grafana).
- Synthetic Monitoring
Implement external network monitoring using Dynatrace or custom platforms (Java, Python, .NET, JavaScript, TypeScript, Selenium, Playwright, CucumberJS). - CI/CD Pipeline Management
Design and maintain CI/CD pipelines (GitLab CI, GCP). - Architecture Consulting
Advise on software architecture with a focus on resilience, performance, and scalability. - Software Quality Assurance
Analyze logs, traces, performance metrics, errors, and security using tools like Dynatrace and SonarQube. - Infrastructure as Code (IaC)
Design, implement, and maintain infrastructure using Terraform and other IaC tools. - Code Review & Testing
Review code across multiple languages (NodeJS, JavaScript, Python, Bash, Terraform, Java, Go). - Cloud Migration Support
Assist in migrating services to Google Cloud Platform (GCP), including system assessment, configuration optimization, and IAM management. - Secure Infrastructure Design
Build infrastructure aligned with security best practices and cloud IAM configurations. - FinOps OperationsConfigure GCP billing accounts, set budget alerts, and monitor cloud spending.
- Root Cause Analysis & Incident Resolution
Investigate production incidents, identify root causes, and implement long-term solutions. - Cross-Team Collaboration
Facilitate effective communication and collaboration between development and operations teams.