We are seeking a hands-on Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems.
This is an SRE-heavy, infrastructure-first role, focused on ensuring AI systems operating in production are:
Reliable
Observable
Scalable
Secure
Cost-efficient
Safe to deploy and operate
You will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale.
Key Responsibilities
1. Infrastructure Provisioning & Automation
Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)
Provision and maintain Kubernetes clusters and supporting services
Automate environment setup across development, staging, and production
Manage networking, IAM, secrets, storage, and compute scaling
Ensure high availability, resilience, and disaster recovery readiness
2. CI/CD & Deployment Engineering
Build and maintain CI/CD pipelines for:
AI services
Agent frameworks
Orchestrators
Model artifacts
Implement automated testing and reliability validation gates
Enable blue/green and canary deployments
Build safe rollback mechanisms for services and models
Integrate reliability and health checks into deployment workflows
3. Model & Agent Deployment Governance
Package, version, and deploy models into containerized environments
Manage model artifact storage and promotion across environments
Monitor model performance and detect degradation
Support retraining cycle integration and model refresh workflows
Ensure safe rollout and rollback of model versions
Implement monitoring for inference latency, throughput, and cost
4. Data Pipelines for Telemetry & Observability
Design and maintain data pipelines to ingest, clean, and process high-volume telemetry (logs, metrics, traces, events)
Enable structured telemetry for AI and orchestration workflows
Ensure reliability for real-time and batch processing
Optimize pipeline scalability and performance
5. AIOps Platform Integration
Evaluate, deploy, and integrate AIOps platforms
Improve anomaly detection, correlation, and alert intelligence
Reduce alert noise and improve signal quality
Integrate AIOps outputs into operational workflows and incident management
6. Intelligent Incident Automation
Automate incident detection and remediation workflows
Build self-healing scripts and intelligent runbooks
Reduce MTTD and MTTR through automation
Integrate AI-driven root cause analysis insights into operational tooling
Improve prevention of recurring incidents
7. Production Reliability & SRE Excellence
Define and manage SLIs, SLOs, and error budgets
Implement monitoring, dashboards, and alerting systems
Participate in on-call rotation
Lead incident triage and root cause analysis
Improve resilience, scaling, and failure handling
Implement circuit breakers, rate limits, and failover mechanisms
8. Security & Governance
Implement least-privilege access controls
Manage secrets and credential rotation
Enforce environment isolation
Ensure auditability and compliance for AI systems