About the role
The role focuses on the camera flow inside our clients' mobile apps. We don't do single-photo verification - our models run over the live camera stream, every frame, in real time, on the user's phone. Whatever the user sees, and feels while they're framing the shot is the focus of the problem you'd be working on.
The interesting problem: model output is probabilistic and noisy, the scene is different every time, and the user is operating in the real world - on foot, in the rain, gloves on, one hand free. Visual UI alone usually isn't enough. Haptics, symbolic cues (think game HUD), and visual feedback all need to play together, and at any given moment you have to decide what to surface, on which channel, in what order.
What you'll do
- Real-time feedback design. A scooter rider photographs a finished trip in the rain. The camera streams at ~30fps; the on-device model gives a confidence-shaped output per frame. The job is to turn that stream into something the rider can act on inside their visual loop - usually under 300ms.
- Parallel feedback channels. Visual, haptic, symbolic. Each has different bandwidth, different attentional cost, different latency. Mapping the right signal to the right channel - and prioritising across them when several things are wrong at once - is the headline challenge.
- Generalising across scenes. A courier at a doorstep, a rider at trip-end - same SDK, different scenes, different priorities. We want a feedback system that generalises rather than one hand-tuned per client.
- End-to-end ownership. You'd scope, prototype, ship, and measure one piece of work over the internship. A 1:1 mentor helps you scope it in week one and reviews throughout.
Technical Challenges
Real-time stream verification
- The camera runs at ~30fps. Each frame is an input. The on-device model emits probabilistic output per frame. The feedback layer aggregates that stream and decides what to surface, when (real-time systems, signal processing).
- Sub-300ms end-to-end budget. Anything you display, or vibrate has to land inside the user's visual loop, not after they've moved on (HCI, perception of latency).
- Multi-frame smoothing - confidence over the last N frames, thresholds for triggering feedback, asymmetric thresholds (different for "shot is good" vs "shot is bad") (signal processing, applied statistics).
Parallel feedback channels and priority ordering
- Visual UI is one channel. Haptics, and symbolic / game-HUD cues are others. Each has different bandwidth, different attentional cost, different latency to perception (multi-modal interfaces, HCI).
- Mapping signal to channel: haptics are good for "now, wrong" pulses; visual for fine-grained framing guidance. Why? When does that mapping break (interaction design, sensory psychology adjacent).
- Priority ordering: if three things are wrong with the shot at once, which do you tell them first? Why? Does the answer change once they've corrected one (information design, game UI / HUD design).
Context-dependence across scenes
- Same SDK, different scenes: courier at a doorstep, rider at trip-end, driver photographing a damaged parcel. Different lighting, different priorities, different "what does a good shot look like" (HCI, generalisation).
- The model output is one signal. Time-of-day, IMU motion, ambient light, the user's prior attempts in this session are others. How do you combine them into a single coherent feedback layer (sensor fusion, applied ML).
- How do you build a system that generalises rather than hand-tuning per client (software architecture, declarative configuration).
Qualifications