Software Engineer, Frontier Clusters Infrastructure
40 Days Old
Software Engineer, Frontier Clusters Infrastructure The Frontier Clusters team at OpenAI builds, launches, and supports the largest supercomputers used for cutting-edge model training.
We design data center systems, develop software for large-scale frontier model training, and ensure these hyperscale supercomputers are reliable and efficient.
About the Role
You will design, build, operate, and maintain large-scale compute clusters for AI research.
Your responsibilities include developing software to manage these clusters, resource allocation, and automating lifecycle operations.
We seek distributed systems engineers experienced with managing large compute environments or hyperscale infrastructure.
This role is based in San Francisco, CA, with a hybrid work model (3 days in-office) and relocation support.
In this role, you will:
Develop and optimize high-performance cluster systems for AI workloads
Ensure reliability, scalability, and efficiency of cluster infrastructure
Collaborate with researchers and engineers to optimize resource use
Implement security measures for cluster systems
You might thrive if you:
Understand distributed systems and have experience with scalable, reliable compute clusters
Have programming skills in Python, Go, or similar languages
Have experience with public cloud platforms, especially Azure
Are familiar with high-performance computing, GPU workloads, or AI/ML compute patterns
Enjoy working proactively in a fast-paced environment
About OpenAI
OpenAI is dedicated to ensuring AI benefits all humanity, pushing AI capabilities safely, and deploying AI responsibly. We value diverse perspectives and are an equal opportunity employer.
We consider applicants with arrest and conviction records and provide accommodations for disabilities. Join us in shaping the future of AI technology.
#J-18808-Ljbffr
- Location:
- San Francisco, CA, United States
- Salary:
- $200,000 - $250,000
- Category:
- IT & Technology