We are seeking an exceptional Senior Lead who combines deep hands-on SysOps/HPC expertise with the strategic vision of a solution architect. This is a rare dual-track role: you operate at the intersection of elite technical execution and client-facing presales, designing and running mission-critical GPU, HPC, and Kubernetes platforms while simultaneously co-creating opportunity with our commercial teams.
This role carries both SysOps, HPC depth and DevOps. You are expected to spend at least 60% of your time on implementation and technical execution
Presales & Business Development
• Partner with sales and solution teams to identify and qualify new opportunities
• Lead or support technical presales activities: discovery workshops, RFP responses, architecture presentations
• Build and deliver proof-of-concepts (POCs) that demonstrate platform capabilities to prospective clients
• Prepare high-quality technical materials
• Act as a trusted technical advisor during client conversations, proposing solutions aligned to business goals
In-Account Delivery — SysOps & DevOps Execution
• Operate directly within client accounts as a senior SysOps/DevOps engineer
• Run, troubleshoot, and optimize production-grade Kubernetes clusters and GPU/HPC environments hands-on
• Own Linux system administration at a deep level: kernel tuning, storage, networking, performance profiling
• Implement and maintain IaC pipelines, GitOps workflows, and CI/CD systems
• Serve as the senior escalation point for complex operational incidents within accounts
Architecture & Solution Design
• Design end-to-end platform architectures spanning cloud, hybrid, and on-premises HPC environments
• Define workload isolation models, networking architectures, and storage strategies for multi-tenant platforms
• Recommend and validate technology choices aligned to client scale, budget, and team maturity
• Produce architecture decision records (ADRs), solution blueprints, and technical runbooks
1. Architecture & System Design
• Design production-grade multi-cluster Kubernetes platforms:
◦ RKE2, EKS (AWS), AKS (Azure) at enterprise scale
◦ GPU-aware clusters: NVIDIA H100 / A100 / B200 node pools
◦ Hybrid cloud + on-premises HPC infrastructure
• Define and document:
◦ Workload isolation: namespaces, MIG partitioning, multi-tenancy models
◦ Networking: BGP peering, Ingress controllers, service mesh (Istio / Cilium)
◦ Storage: Longhorn, Ceph, distributed and high-throughput file systems
2. Platform Engineering & GitOps Strategy
• Define and enforce platform standards across the delivery lifecycle
• GitOps tooling: ArgoCD, Fleet — declarative cluster management
• CI/CD pipelines: Azure DevOps, Jenkins — build, test, promote
• Infrastructure as Code: Terraform (modules, remote state, workspaces), Ansible
• Standardize cluster bootstrapping, app deployment lifecycle, environment promotion (Dev → QA → Prod)
3. AI / GPU Infrastructure Architecture (Priority Competency)
• Design and operate GPU compute platforms at scale:
◦ GPU Operator deployment and lifecycle management
◦ MIG (Multi-Instance GPU) partitioning for multi-tenant workloads
◦ Advanced scheduling: Run:AI, Kubernetes-native GPU scheduling (device plugins)
• Understand AI workload classes and their infrastructure implications:
◦ Distributed training workloads (data/model/pipeline parallelism)
◦ Inference pipelines — NVIDIA Triton Inference Server, TensorRT optimization
• Align infrastructure to the full AI stack:
◦ CUDA stack, cuDNN, NCCL collective communication libraries
◦ High-speed networking: InfiniBand (HDR/NDR), RoCE for RDMA
◦ GPUDirect RDMA / GPUDirect Storage for low-latency data paths
4. Observability & Reliability Engineering
• Define and implement full-stack observability:
◦ Metrics: Prometheus, Thanos (long-term retention, multi-cluster)
◦ Logs: Loki, Fluent Bit
◦ GPU telemetry: DCGM Exporter, NVIDIA Nsight Systems
• Build operational frameworks:
◦ SLO / SLA definitions and error budget tracking
◦ Alerting strategy — noise reduction, severity routing
◦ Incident response playbooks and on-call runbooks
5. Security & Multi-Tenancy Architecture
• Design zero-trust security postures for multi-tenant platforms
• Secret management: HashiCorp Vault, External Secrets Operator
• Identity and access: IAM, RBAC, SSO/OIDC integration
• Network isolation: NetworkPolicy, micro-segmentation, mTLS
• Secure GPU sharing: MIG isolation, VGPU licensing, tenant boundary enforcement
6. HPC, Data & Storage Architecture (Priority Competency)
• Understand the high-performance storage for AI/HPC workloads:
◦ GPUDirect Storage — bypassing CPU for GPU-native I/O
◦ Distributed file systems: Weka (high-throughput NFS/S3), Ceph (scalable object/block)
◦ Storage tiering, caching strategies, and data lifecycle management
• Size and validate storage architectures against workload I/O profiles
7. Operational Leadership & Linux Systems
• Lead incident response and root cause analysis (RCA) for critical production issues
• Define upgrade strategies, change management procedures, and disaster recovery plans
• Write and maintain runbooks, operational playbooks, and knowledge base content
• Integrate organizational processes, compliance requirements, and security policies into operational frameworks
• Deep Linux expertise:
◦ Kernel tuning (CPU governor, NUMA, IRQ affinity, hugepages)
◦ Storage I/O scheduling, NVMe optimization
◦ Network stack tuning for RDMA / InfiniBand
◦ System performance profiling and bottleneck analysis
• you are comfortable running production systems.
• You have stronger SysOps and HPC depth than DevOps breadth, and you embrace that identity
• You can shift fluidly between running a live incident, presenting an architecture to a CTO, and reviewing a POC demo environment
• You communicate technical complexity clearly — to engineers and to C-level stakeholders
• You understand why specific tooling choices matter (not just how to configure them) and can articulate trade-offs in presales conversations
• You are comfortable owning outcomes across both commercial (presales) and delivery (operations) dimensions
• You thrive in ambiguity and can scope both short POCs and long-horizon platform programs
Requirements
Required
• 10+ years in platform/infrastructure engineering, with at least 2 years in architect-l