Site Reliability Engineer, San Francisco, CA, United States

Site Reliability Engineer

New Yesterday

Primer helps B2B products break out of the B2C-centric marketing box. Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market teams. We ingest billions of rows from first- and third-party sources, map them to rich company context, and surface hyper-targeted audiences and real-time performance alerts—all without vendor lock-in. That only works if the lights stay on , queries stay fast , and incidents stay rare . That’s where you come in. As our first dedicated Site Reliability Engineer , you’ll be the force multiplier who designs, builds, and operates the infrastructure that powers everything: petabyte-scale data pipelines, LLM-backed services, and the APIs our customers (and engineers!) rely on every day. You’ll pair hard-won ops experience with a mentor’s mindset—levelling up the whole team while keeping us four steps ahead of failure. YOUR MISSION Own reliability from design to customer. Define and uphold SLOs / SLIs, manage error budgets, and lead blameless post-mortems. Automate toil out of existence—CI/CD, infra-as-code, capacity planning, and chaos testing. Drive incident response end-to-end: detection, mitigation, root-cause analysis, and long-term fixes. Scale multi-cloud data pipelines (Prefect, ClickHouse, Iceberg) and GPU/LLM workloads. Teach best practices, review designs, and coach engineers so reliability becomes a team sport. WHAT YOU’LL DO Design, implement, and tune distributed systems that handle high-throughput B2B traffic . Harden our AWS stack with IaC (e.g. Terraform) Instrument everything—logs, traces, metrics, and AI-powered anomaly detection. Champion security, cost optimization, and disaster-recovery strategies. Jump into the weeds when something breaks, fix it fast, then automate it away. WHAT YOU’LL BRING Must-Haves 5+ years owning production systems at meaningful scale (sub-second latency, “four-nines” targets). Mastery of SRE fundamentals: SLO/SLI design, error budgets, incident playbooks. Deep hands-on with Linux, networking, containers/K8s, and at least one major cloud (AWS/GCP/Azure). Proven track record automating infra with Terraform, Helm, or similar IaC tooling. Fluency in at least one systems / scripting language (Go, Python, Rust, etc.). Experience operating complex data pipelines (Prefect, Airflow, Temporal) or real-time streaming systems. History of mentoring engineers and embedding reliability culture across teams. Pragmatic decision-maker—balances uptime, velocity, and cost for startup reality. Curiosity for AI-augmented ops (LLM chat-ops, anomaly detection, self-healing). Nice-to-Haves Managed GPU clusters and ML inference workloads. Operated data lakes / lakehouses at scale (Iceberg, Delta, etc.). Meaningful open-source contributions in SRE, DevOps, or data-infra projects. WHY PRIMER Mission with impact – We’re unlocking new growth channels for thousands of B2B marketers. High-trust, low-ego culture – Fully distributed team, meeting-light weeks, Friday focus days. Work & life, balanced – Five weeks PTO, generous parental leave, and flexibility for families. Career rocket-fuel – Small team, huge problems, real ownership. Shape the future with bold innovators, driving impact that redefines industries. Diverse & global – Teammates span six countries—and counting. Intro Call with Engineering Manager – 30 min System Design – 60 min Operational Excellence Drill-down – 60 min Strategic Pragmatism Chat with CTO – 45 min Technical Coding/Systems Deep Dive – 30 min Culture & Values with CEO – 45 min Decision typically within 24-48 hrs of final conversation. READY TO LEVEL UP B2B MARKETING INFRASTRUCTURE? Email careers@sayprimer.com with your résumé, LinkedIn, GitHub, or anything that showcases your reliability superpowers. Let’s build the future—without the fire-drills.

#J-18808-Ljbffr

Apply

Location:: San Francisco, CA, United States
Salary:: $250,000 +
Job Type:: FullTime
Category:: Engineering