POSITION:
Reporting to the Manager, Software Engineering, we’re seeking a mid-level DevOps Engineer who can design, automate, and scale cloud-native infrastructure and CI/CD pipelines. You’ll collaborate closely with product, data, and ML engineering teams to enable rapid, reliable delivery—including MLOps workflows for training, deploying, and monitoring machine learning models in production.
RESPONSIBILITES:
Platform & Automation
- Design, implement, and maintain CI/CD pipelines using GitHub Actions, Jenkins, and ArgoCD for application and ML model delivery.
- Build Infrastructure as Code (IaC) with Terraform and Ansible across environments (dev/stage/prod).
- Containerize services with Docker and manage orchestration on Kubernetes (including Helm charts, Operators, and secrets management).
- Implement artifact versioning and release management for application and ML model artifacts.
Cloud & Networking
- Deploy and operate workloads on AWS (EC2, ECS/EKS, Lambda, S3, CloudFront, ELB) and integrate security controls (IAM, Cognito, Secrets Manager/Vault).
- Support multi-cloud patterns (basic exposure to GCP: BigQuery, Pub/Sub, Dataflow; Azure: Data Factory, Synapse Analytics) for data/ML pipelines.
- Optimize networking, load balancing, caching (e.g., ElastiCache/Redis) and CDN configurations for performance and cost-efficiency.
Observability & Reliability
- Implement end-to-end monitoring, logging, and tracing with Prometheus, Grafana, Loki, Jaeger, OpenTelemetry.
- Establish SLOs/SLIs, alerting, and incident workflows using PagerDuty/Opsgenie; drive post-incident reviews and reliability improvements.
- Build observability for ML systems (data drift, model performance metrics, feature store health, pipeline latency).
Security & Compliance
- Enforce security-by-design: least-privilege IAM, vulnerability scanning, image signing, Let’s Encrypt and certificate automation.
- Implement secrets and keys management via Vault and AWS Secrets Manager.
- Support data governance & compliance (e.g., GDPR, HIPAA) alongside engineering and risk teams; contribute to audit-ready documentation.
Data/ML & MLOps Enablement
- Productionize ML workflows using MLflow (tracking, model registry) and Kubeflow (pipelines, serving).
- Support Generative AI integrations and RAG pipelines on Amazon Bedrock and model endpoints (e.g., Anthropic Claude).
- Operationalize ETL/ELT jobs and data pipelines with Airflow/dbt, PySpark, and streaming systems (Kafka, RabbitMQ).
- Partner with data scientists/ML engineers to standardize feature stores, model packaging, A/B testing, canary/blue-green deployments, and shadow mode releases.
- Set up model monitoring: accuracy/latency, data quality, concept drift, and automated rollback or retraining triggers.
Collaboration & Quality
- Work closely with product managers and cross-functional teams to deliver software solutions.
- Participate in agile development processes including design, implementation, and deployment.
- Write technical documentation and contribute to end-user guides.
REQUIREMENTS:
- 3–5 years in DevOps/SRE or platform engineering roles.
- Strong hands-on with Docker and Kubernetes (Helm, Operators, multi-namespace/multi-tenant setups).
- Proven experience building CI/CD with GitHub Actions, Jenkins, ArgoCD.
- Proficiency with Terraform (modules, workspaces) and Ansible (playbooks, roles).
- Solid AWS experience: EC2, ECS/EKS, Lambda, S3, CloudFront, ELB, and CloudWatch/X-Ray.
- Monitoring/observability using Prometheus, Grafana, Loki, Jaeger, OpenTelemetry.
- Scripting proficiency in Python and/or Bash.
- Understanding of security best practices, secrets management (Vault/Secrets Manager), and compliance requirements (GDPR, HIPAA).
- Experience with MLflow/Kubeflow and ML deployment patterns (batch/real-time serving, GPU scheduling).
- Exposure to Amazon Bedrock, Anthropic Claude, and RAG architectures (vector stores, embedding pipelines).
- Familiarity with dbt, Airflow, PySpark, Kafka/RabbitMQ for data/ML pipelines.
- Knowledge of OpenSearch, ElastiCache/Redis, and PostgreSQL/MySQL/Snowflake.
- Performance tuning, cost optimization (rightsizing, spot instances, autoscaling), and FinOps awareness.