Preference Model is building the next generation of training data to power the future of AI. Today's models are powerful but fail to reach their potential across diverse use cases because so many of the tasks that we want to use these models are out of distribution. Preference Model creates RL environments where models encounter research and engineering problems, iterate, and learn from realistic feedback loops.
Our founding team has previous experience on Anthropic’s data team building data infrastructure, tokenizers, and datasets behind Claude. We are partnering with leading AI labs to push AI closer to achieving its transformative potential. We are backed by a16z.
Every RL environment we ship needs to survive a model that is actively trying to game it. A task with a weak grader or an exploitable reward signal is worse than no task at all: it teaches the model to hack rather than reason. We need someone whose full-time job is finding those holes before the model does.
We've learned that domain knowledge alone doesn't make a good reviewer. The people who are best at this have spent time thinking adversarially: designing problems that are hard to game, breaking other people's problems, or researching reward hacking directly.
Review RL environments and training tasks for correctness, robustness, and resistance to reward hacking
Identify ways a model could exploit graders, game evaluation criteria, or shortcut past the intended reasoning
Work directly with environment authors to tighten graders, fix reward signals, and redesign tasks that don't hold up
Develop and maintain review standards and checklists as we scale from hundreds to thousands of tasks per month
Advise on grader design during environment planning, before tasks are built, not after
You think like an attacker. You've spent real time designing problems that are hard to game, or breaking problems other people thought were solid. You have enough ML knowledge to understand what a model might try, and enough engineering sense to evaluate whether a grader actually tests what it says it tests.
Track record of adversarial or constructive problem design: competitive programming problem authoring (ICPC, Codeforces, etc.), CTF challenge design, or similar
Familiarity with RL, reward hacking, and specification gaming (you've read Amodei et al., Krakovna's list, or similar work, and you've thought about it beyond surface level)
Strong Python reading skills
Ability to articulate clearly in writing why a task is broken and what needs to change
Published research on reward hacking, specification gaming, RLHF robustness, or AI safety
Background in security engineering, penetration testing, or red-teaming (with enough ML context to apply that mindset to RL environments)
Experience authoring or reviewing problems for competitive programming contests
You've built automated evaluation systems and know where they break
You've worked on LLM evaluation, benchmarking, or alignment research
Send your resume and a short note (2-3 sentences is fine) about a time you broke something that was supposed to be robust, or designed a problem that was hard to game. Links to published problems, research, or writeups are more useful than a long cover letter.
Preference%20model
https://preference%20model.com