Staff Python / PyTorch Developer — Frontend Inference Compiler – Dubai

On-site Staff

vLLMObservabilityGPUTensorRTLatencyThroughputPyTorchDeep learningPythonC++Computer visionCI/CDDockerKubernetesQuantization

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. This architecture allows Cerebras to deliver industry-leading training and inference speeds; over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Cerebras works with the leading model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership https://openai.com/index/cerebras-partnership/ with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. About the Role The Inference ML Engineering team at Cerebras Systems is dedicated to enabling our fast generative inference solution through simple APIs powered by a distributed runtime that runs on large clusters of our own hardware. Our mission is to empower enterprises, developers, and researchers to unlock the full potential of our platform, leveraging its performance, scalability, and flexibility. As a Senior Software Engineer on the Inference ML Engineering team, you will design and implement APIs, machine learning features, and tools that enable state-of-the-art generative AI models to run efficiently on our custom hardware. You will collaborate with cross-functional teams to build scalable, high-performance inference solutions while helping shape the evolution of our ML ecosystem. Responsibilities - Drive and provide technical guidance to a team of software engineers working on complex machine learning integration projects. - Design and implement ML features (e.g., structured outputs, biased sampling, predicted outputs) that improve performance of generative AI models at inference time. - Design and implement high-throughput, low-latency multimodal inference models that support delivery of image, audio, and video i

Apply on company site →