About Cantina:
Cantina Labs is a social AI company, developing a suite of advanced real-time models that push the boundaries of expression, personality, and realism. We bring characters to life, transforming how people tell stories, connect, and create. We build and power ecosystems. Cantina, our flagship social AI platform, is just the beginning.
If you're excited about the potential AI has to shape human creativity and social interactions, join us in building the future!
About the Role:
Weâre looking for an Applied ML Engineer with handsâon experience building largeâscale video generation modelsâfrom data and training to distillation and acceleration into a fast, productionâready model. Our models are humanâcentric and productâoriented: think interactive characters that can respond to text/audio/image inputs and generate video with very low latency.
This is an applied research + engineering role: youâll work on training runs, data, model optimization, and the âmake it fastâ path that turns a capable research model into a realâtime experience.
Typical time split (roughly):
60â75% training / fineâtuning / distillation of large video models
15â25% inference optimization (latency/memory/cost), model runtime work
10â15% prototyping + product integration (demos â shipped features)
What Youâll Do:
Train and scale video generation models: run largeâscale training/fineâtuning on multiâGPU (and when needed multiânode) setups; own the training loop, stability, checkpoints, and iteration speed.
Own data for video modeling: build and improve video datasets/pipelines (decode/sampling, filtering/quality, conditioning alignment, storage formats), and keep the pipeline fast and reliable at scale.
Distill and compress big models into fast ones: teacherâstudent distillation, step reduction, architectural simplifications, and quality/speed tradeâoffs to hit realâtime constraints.
Make models run in real time: profiling, memory optimizations, quantization-aware tactics where appropriate, kernel/runtime improvements, and practical throughput/latency wins.
Build the bridge to product: package models into simple inference APIs and prototypes; collaborate with product to turn research progress into user-facing experiences (interactive characters, conversational video).
Evaluate what matters: set up evaluation harnesses that track perceptual quality + temporal consistency + identity/character fidelity + latency/cost.
What Youâll Bring:
2+ years building and shipping ML systems (or equivalent), with clear ownership and delivery.
Strong PyTorch + Python, comfortable touching both training and inference code.
Handsâon experience training or scaling generative models, ideally video generation (diffusion/transformers/VAEs or similar), not just using preâtrained checkpoints.
Experience with distributed training and large runs (e.g., DDP/FSDP/DeepSpeedâstyle workflows), and the practical debugging that comes with them.
Proven ability to improve performance in practice: latency/memory/cost optimizations, profiling, and shipping measurable wins.
Product mindset: can move from research ideas â robust implementation â iterating against real constraints.
Bonus Points For:
Experience with multimodal conditioning: audioâtoâvideo, text+audio+image control, lipâsync / gesture / character animation constraints.
Endâtoâend distillation experience (teacher/student design, eval strategy, failure analysis).
Familiarity with acceleration toolchains (Torch compile, Triton, TensorRT, ONNX, custom kernels) or model compression (quantization, pruning) where applicable.
Experience with realâtime streaming / WebRTC prototypes or lowâlatency media delivery (helpful, but not the core of the role).
Technical Stack Youâll Work With:
ML: PyTorch (training + inference)
Models: large video generation (diffusion/transformers/VAEs), multimodal conditioning
Optimization: distillation, inference acceleration, multiâGPU strategies
Product: rapid prototyping, lightweight inference APIs
Infra (supporting, not primary): Docker; cloud basics (AWSâlike services)
Location:
This role can be performed remotely in Europe, within GMT +/- 2 hours.
Compensation:
The anticipated annual base salary range for this role is between âŹ190,000-âŹ225,000, plus bonus. When determining compensation, a number of factors will be considered, including skills, experience, job scope, location, and competitive compensation market data.
cantina