Member of Technical Staff, AI Training Infrastructure

On-site Staff

OrchestrationPyTorchKubernetesDockerAWSAzureGCPGPUDeep learningLatencyThroughputCI/CD

THE ROLE: As a Training Infrastructure Engineer, you'll design, build, and optimize the infrastructure that powers our large-scale model training operations. Your work will be essential to developing high-performance AI training infrastructure. You'll collaborate with AI researchers and engineers to create robust training pipelines, optimize distributed training workloads, and ensure reliable model development. KEY RESPONSIBILITIES: - Design and implement scalable infrastructure for large-scale model training workloads - Develop and maintain distributed training pipelines for LLMs and multimodal models - Optimize training performance across multiple GPUs, nodes, and data centers - Implement monitoring, logging, and debugging tools for training operations - Architect and maintain data storage solutions for large-scale training datasets - Automate infrastructure provisioning, scaling, and orchestration for model training - Collaborate with researchers to implement and optimize training methodologies - Analyze and improve efficiency, scalability, and cost-effectiveness of training systems - Troubleshoot complex performance issues in distributed training environments MINIMUM QUALIFICATIONS: - Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience - 3+ years of experience with distributed systems and ML infrastructure - Experience with PyTorch - Proficiency in cloud platforms (AWS, GCP, Azure) - Experience with containerization, orchestration (Kubernetes, Docker) - Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP) PREFERRED QUALIFICATIONS: - Master's or PhD in Computer Science or related field - Experience training large language models or multimodal AI systems - Experience with ML workflow orchestration tools - Background in optimizing high-performance distributed computing systems - Familiarity with ML DevOps practices - Co

Apply on company site →