Staff Site Reliability Engineer, Storage, San Francisco, CA, United States

Staff Site Reliability Engineer, Storage

71 Days Old

Staff Site Reliability Engineer, Storage Join to apply for the Staff Site Reliability Engineer, Storage role at Crusoe . Crusoe is building the world’s favorite AI-first cloud infrastructure company. We’re pioneering purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Our mission is to align the future of computing with the future of the climate, with a platform recognized for reliability and performance, powered by clean energy. About This Role Our Site Reliability Engineering (SRE) team maintains the performance and reliability of our AI-optimized cloud infrastructure. The Storage-focused SRE ensures the availability, performance, and scalability of Crusoe’s cloud storage products, supporting AI and HPC workloads. You will build and optimize distributed, fault-tolerant storage systems at scale to support our sustainable cloud platform. Responsibilities Develop automation and self-healing tools for our distributed storage infrastructure, including block, file, and object storage systems. Drive reliability initiatives around data replication, encryption, backup, restore strategies, and failover mechanisms. Collaborate with storage engineers to implement high-performance NVMe and SSD-backed volumes supporting large-scale AI compute clusters. Support user-facing storage services focusing on availability, performance, and error budget adherence. Investigate and resolve storage incidents using telemetry, logs, and profiling; diagnose low-level I/O issues with hardware and kernel teams. Contribute to designing fault-tolerant, scalable storage architectures for AI cloud environments. Qualifications 8+ years of experience in Storage SRE, systems, or storage engineering. Hands-on experience with distributed storage systems like Ceph, GlusterFS, OpenEBS. Proficiency in programming languages such as Go, Python, Java, or C. Experience with Infrastructure as Code tools like Terraform, Ansible, or Puppet. Deep knowledge of Linux internals, especially I/O, memory management, and storage scheduling. Familiarity with storage protocols such as NFS, SMB, iSCSI, NVMe-oF. Experience with container orchestration platforms like Kubernetes and Docker. Strong troubleshooting, incident response, and documentation skills. Experience managing storage services on cloud platforms (AWS, GCP, Azure). Excellent communication skills and ability to pass background checks. Benefits Hybrid work schedule Competitive salary and Restricted Stock Units Health insurance, HSA contributions, paid parental leave, life insurance, disability coverage Additional perks: Teladoc, 401(k) match, paid time off, cell reimbursement, tuition reimbursement, wellness subscriptions, legal services, commuter benefits Compensation Up to $250,000/year plus bonus and RSUs, based on experience and internal equity. Additional Information Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to legally protected statuses. Job Details Seniority level: Mid-Senior level Employment type: Full-time Job function: Engineering and IT

#J-18808-Ljbffr

Apply

Location:: San Francisco, CA, United States
Salary:: $250,000 +
Job Type:: FullTime
Category:: IT & Technology