Staff Site Reliability Engineer, Compute

34 Days Old

Crusoe is building the world’s favorite AI-first cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications. Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the climate. Our AI platform is recognized as the "gold standard" for reliability and performance. Our data centers are optimized for AI workloads and powered by clean, renewable energy.
Be part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team setting the pace for responsible, transformative cloud infrastructure.
About This Role: At Crusoe, we are building the most sustainable, AI-first cloud infrastructure. Our Compute-focused Site Reliability Engineers are essential to this mission. This role focuses on supporting virtualization, hypervisor, and kernel-level performance for Crusoe’s compute infrastructure. You will deploy and optimize bare-metal and virtualized compute platforms, ensuring performance, security, and scalability for modern AI and HPC workloads.
What You'll Be Working On: You will develop automation and observability tools to monitor Crusoe’s compute infrastructure, from the kernel to orchestration layers. Supporting and scaling the virtualization stack—including technologies such as KVM, QEMU, and others—you will collaborate with Linux kernel and hardware teams to identify and resolve performance bottlenecks, driver issues, and hardware offloads. A key focus is optimizing performance across CPU, GPU, and DPU/NIC resources for AI and HPC workloads. You will participate in root cause analysis for kernel crashes, hardware-software issues, and performance regressions, and work on hypervisor enhancements to improve VM reliability and workload isolation. The role involves tuning kernel subsystems like process scheduling, NUMA, memory management, and interrupt handling. You will also work with platform teams on support for emerging hardware such as SmartNICs, BlueField devices, and TPUs.
What You’ll Bring to the Team:
8+ years of experience in Compute SRE, Linux system engineering, or related roles.
Strong knowledge of Linux kernel internals, including scheduler, memory, and driver subsystems.
Experience with virtualization technologies like KVM, Xen, QEMU, or VMware.
Familiarity with SmartNICs/DPUs (e.g., NVIDIA BlueField) and kernel bypass techniques.
Expertise in at least one programming language: Go, C, or Rust.
Experience with system-level debugging tools such as kdump, kexec, and kernel panic analysis.
Proficiency with Infrastructure as Code tools and CI/CD practices for bare-metal or cloud environments.
Deep understanding of compute scheduling, resource management, and high-throughput networking.
Benefits:
Hybrid work schedule
Competitive salary and Restricted Stock Units
Comprehensive health insurance including HDHP, PPO, vision, and dental
Employer HSA contributions
Paid Parental Leave, life insurance, disability benefits
Additional perks: Teladoc, 401(k) match, paid time off, cell reimbursement, tuition reimbursement, Calm subscription, legal services, commuter benefit
Compensation Range: Up to $250,000 per year plus bonus, with RSUs included. Compensation depends on experience, skills, and internal equity.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, or any other protected status.
#J-18808-Ljbffr
Location:
San Francisco, CA, United States
Salary:
$250,000 +
Category:
Engineering