Research Engineer, Infrastructure

Cognition · San Francisco

On-site

OrchestrationGPUThroughputPyTorchDeep learningPythonC++Latency

WE ARE AN APPLIED AI LAB BUILDING END-TO-END SOFTWARE AGENTS. We're the makers of Devin, the first AI software engineer, and Windsurf, the AI-native IDE. Together, they represent our vision for collaborative AI teammates that enable engineers to focus on more interesting problems and empower teams to strive for more ambitious goals. Our team is small and talent-dense. Among our founding team, we have world-class competitive programmers, former founders, and leaders from companies at the cutting edge of AI including Scale AI, Palantir, Cursor, Waymo, Tesla, Lunchclub, Modal, Google DeepMind, and Nuro. Building Devin is just the first step—our hardest challenges still lie ahead. If you’re excited to solve some of the world’s biggest problems and build AI that can reason on real-world tasks, apply to join us. ROLE MISSION Research moves at the speed of the infrastructure underneath it. Every training run, evaluation loop, and experimental iteration depends on systems that are fast, reliable, and built to scale. This role exists to make sure nothing in the stack becomes the bottleneck that slows down the frontier. You will own the core systems that researchers depend on daily: distributed training infrastructure, experiment orchestration, data pipelines, and the tooling that turns raw compute into usable research velocity. This is not a support role. You will work directly alongside researchers, understand the science deeply enough to anticipate what they need next, and build systems that hold up under the pressure of training jobs running across thousands of GPUs. We don't distinguish between research and engineering; the best infrastructure engineers here are also the ones who understand why the research works. WHAT YOU'LL ACCOMPLISH - Distributed Training Infrastructure: Build and own the systems that run large-scale training jobs reliably across GPU clusters. This includes job launchers, checkpointing and recovery, fault tolerance, and the monitoring th

Apply on company site →