Anticipated Contract End Date/Length: August 28, 2026
Work Set Up: Hybrid (must be eligible for BPSS)
Our client in the Information Technology and Services industry is looking for a Site Reliability Engineer (SRE) to support and enhance a complex, multi-cloud Kubernetes platform environment. This role is focused on driving platform reliability, automation, observability, and security across AWS, Azure, and on-premise infrastructure.
The successful candidate will play a key role in improving uptime, reducing operational toil through GitOps and automation, strengthening platform security posture, and enabling scalable onboarding of new tenants and workloads. This is a hands-on engineering role operating within regulated environments and modern cloud-native architectures.
What you will do:
- Operate and enhance Kubernetes platforms across AWS, Azure, and on-premise environments.
- Lead incident response, problem management, and root cause analysis activities.
- Deliver cluster lifecycle management including upgrades, patching, node pool management, CNI and CSI configuration, ingress management, and Rancher operations.
- Own observability strategy including dashboards, alerting, monitoring, and definition of SLOs and SLIs.
- Implement GitOps practices using Fleet and reduce operational toil through automation and governance.
- Apply secure API gateway and Web Application Firewall (WAF) patterns.
- Design and support distributed systems including event brokers and asynchronous messaging architectures.
- Maintain platform security posture including CVE remediation, GRC controls, and security scanning pipelines.
- Provision and manage infrastructure using Terraform and Crossplane as orchestration layers.
- Implement and maintain CI/CD pipelines using Concourse, GitHub Actions, and Azure DevOps.
- Ensure compliance with PCI DSS and GDPR security patterns.