AI Systems Engineer, Codex Agents

OpenAI · San Francisco

$230k–385k/yr On-site

OrchestrationObservabilityGPULatencyPythonRustLLM-as-judgeEval harnessesThroughputTool useMCPOpenAI APIAnthropic APIDockerKubernetesRayCI/CDAWSGCPAzureDeep learningNLPComputer vision

AI Systems Engineer - Codex Core Agents About The Team The Codex Core Agents team builds the agent harness that turns model capability into real-world action. We own the systems around the model: prompting and interpreting model outputs, executing actions safely in real environments, and feeding production experience back into better models and better agent behavior. This team sits close to research and works across the stack: harness, model interaction, inference, sandboxed execution, orchestration, evals, production reliability, and the performance envelope around tokens, latency, cost, capacity, and quality. The harness is open source and increasingly part of how models are trained and evaluated, making this one of the highest-leverage layers in Codex. About The Role We’re looking for engineers to build the AI systems that make Codex agents dependable in production. The ideal candidate is an agent-systems builder: hands-on across low-level systems and ML workflows, able to debug Codex behavior end to end across the harness, model behavior, inference/runtime stack, GPU fleet, and product surface. You’ll work with research, infrastructure, and product to design agent harness capabilities, run experiments and ablations across the model + system prompt + harness stack, build frameworks for assessing production agent performance, and turn messy failures into durable improvements. What You’ll Do - Design and build the core agent harness and execution loop that lets Codex agents interpret model outputs, use tools, execute code, and complete long-horizon tasks safely. - Build sandboxing, isolation, orchestration, state, and workflow infrastructure for agents operating in real development environments. - Develop evaluation, experimentation, and debugging systems that distinguish harness issues, model behavior, inference/runtime issues, and product failures. - Run ablations across prompts, model-facing interfaces, context construction, tool-use strategies, an

Apply on company site →