← All jobs

Member of Technical Staff

Fireworks AI · New York

On-site Staff
LatencyKubernetesDockerRayKubeflowAWSAzureGCPPythonTypeScriptGoC++Throughput

THE ROLE: As a Training Infrastructure Engineer, you'll design, develop, and maintain large-scale backend and cloud-native infrastructure to support distributed machine learning training, inference, and data processing pipelines for our generative AI platform. You'll architect scalable, resilient backend infrastructure, lead technical design discussions, mentor engineers, and establish best practices for large-scale machine learning systems. KEY RESPONSIBILITIES: - Architect and build scalable, resilient backend infrastructure to support distributed training, inference, and data processing pipelines - Lead technical design discussions, mentor engineers, and establish best practices for large-scale machine learning systems - Design and implement core backend services with a focus on efficiency and low latency - Drive infrastructure optimization initiatives for compute cost, storage lifecycle management, and network performance - Collaborate with machine learning, DevOps, and product teams to translate research and product requirements into robust infrastructure solutions - Evaluate and integrate cloud-native and open-source technologies such as Kubernetes, Ray, Kubeflow, and MLFlow to enhance platform reliability - Own end-to-end systems from design to deployment, emphasizing reliability, fault tolerance, and operational excellence MINIMUM QUALIFICATIONS: - Bachelor's degree or equivalent in Computer Science or related field plus four (4) years of experience in software engineering or related role - 4 years of experience designing, building, and optimizing large-scale backend infrastructure and distributed data systems (e.g., PostgreSQL, MySQL, DynamoDB, Apache Spark, Apache Flink, Apache Kafka) in cloud environments (AWS, GCP, Azure, or equivalent), including cloud-native platforms, core infrastructure components, and optimization techniques (caching, indexing, sharding, replication, transactions, ACID) - 4 years of experience with major ser

Apply on company site →