← All jobs

Senior ML Systems Engineer, Frameworks & Tooling

Cohere · London

Remote Senior
vLLMOrchestrationTensorRTThroughputPyTorchKubernetesDockerRayDeep learningGPUPythonLatencyQdrantWeaviate

Who are we? Cohere is the leading security-first enterprise AI company. We build cutting-edge foundation AI models and end-to-end products that are designed to solve real-world business problems. We’re training and deploying frontier models for enterprises who are building AI systems. We believe that our work is instrumental to the widespread adoption of AI and we are looking for folks that want to be part of that. We obsess over what we build. Each one of us is responsible for contributing to increasing the capabilities of our models and the value they drive for our customers. Cohere is a team of researchers, engineers, designers, and more, who are all passionate about their craft. We are a global technology company co-headquartered in Toronto and San Francisco, with key offices in London, New York City, Montreal, Seoul, Germany and Paris. Join us! We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact. WHAT YOU’LL WORK ON - Build and own the training framework responsible for large-scale LLM training. - Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing). - Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100). - Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics. - Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-per

Apply on company site →