Software Engineer, Frontier Clusters Infrastructure

40 Days Old

Software Engineer, Frontier Clusters Infrastructure The Frontier Clusters team at OpenAI builds, launches, and supports the largest supercomputers used for cutting-edge model training. We design data center systems, develop software for large-scale frontier model training, and ensure these hyperscale supercomputers are reliable and efficient. About the Role You will design, build, operate, and maintain large-scale compute clusters for AI research. Your responsibilities include developing software to manage these clusters, resource allocation, and automating lifecycle operations. We seek distributed systems engineers experienced with managing large compute environments or hyperscale infrastructure. This role is based in San Francisco, CA, with a hybrid work model (3 days in-office) and relocation support. In this role, you will: Develop and optimize high-performance cluster systems for AI workloads Ensure reliability, scalability, and efficiency of cluster infrastructure Collaborate with researchers and engineers to optimize resource use Implement security measures for cluster systems You might thrive if you: Understand distributed systems and have experience with scalable, reliable compute clusters Have programming skills in Python, Go, or similar languages Have experience with public cloud platforms, especially Azure Are familiar with high-performance computing, GPU workloads, or AI/ML compute patterns Enjoy working proactively in a fast-paced environment About OpenAI OpenAI is dedicated to ensuring AI benefits all humanity, pushing AI capabilities safely, and deploying AI responsibly. We value diverse perspectives and are an equal opportunity employer. We consider applicants with arrest and conviction records and provide accommodations for disabilities. Join us in shaping the future of AI technology.
#J-18808-Ljbffr
Location:
San Francisco, CA, United States
Salary:
$200,000 - $250,000
Category:
IT & Technology