Our reliability vision
Our product is built on top of complex infrastructure: AI systems, enterprise integrations, and large-scale data pipelines. For our customers, downtime or latency is not an option. Reliability is not only about keeping the lights on; it’s about designing systems that scale predictably, self-heal, and deliver enterprise-grade SLAs.
As our first Site reliability engineer, you will own the foundations that make Maki stable and scalable. You’ll set the standards for observability, incident response, and automation, ensuring that reliability becomes a core product feature.
What you will do
Build resilient infrastructure: Design and maintain scalable, fault-tolerant systems that support our AI-powered HR platform.
Improve observability: Implement monitoring, logging, and alerting systems to ensure proactive detection and resolution of issues.
Automate operations: Develop tools and processes to eliminate manual toil—making deployments, scaling, and incident response smoother and faster.
Ensure performance & SLAs: Define and enforce service-level objectives (SLOs) and service-level agreements (SLAs) that meet enterprise expectations.
Run incident response: Lead incident management processes; run blameless post-mortems and continuously improve reliability practices.
Collaborate with engineering: Work closely with product and AI engineers to embed reliability, scalability, and security into every feature.
Stay ahead of growth: Anticipate scaling needs as Maki grows internationally and across enterprise customers.
Who we’re looking for
Experience: 4+ years in site reliability engineering, DevOps, or infrastructure roles in SaaS or enterprise environments.
Technical skills: Proficiency in cloud environments (AWS, GCP, or Azure), container orchestration (Kubernetes, Docker), and infrastructure-as-code (Terraform, Pulumi, etc.).
Observability mindset: Experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, etc.).
Automation first: Strong scripting/programming skills (Python, Go, or similar) to automate workflows.
Mindset: Pragmatic, rigorous, and proactive—you care about reliability as much as innovation.
Collaboration: Clear communicator who can work with both engineers and leadership to prioritize reliability.
Bonus points if you:
Have scaled systems in a startup or hyper-growth environment.
Have worked with AI/ML infrastructure and understand the unique reliability challenges.
Have experience running security or compliance-related reliability audits.
Are passionate about incident culture—blameless post-mortems, chaos engineering, continuous improvement.
Mochi screen - 15 min
A call with Mochi, our AI recruiter, to check your eligibility criteria and unveil your skills in a structured way.
Intro call – 30 min
Intro with Ben (Cofounder & CPO). A first conversation to get to know you and share more context on Maki and the role.
Technical case – 90 min
With our engineering team. You’ll solve a reliability-focused challenge and walk us through your approach.
Founder interview – 30 min
Meet one of our founders to validate culture and values fit.
Final wrap-up
Offer call (and ideally, a celebration!).
These companies are also recruiting for the position of “Cloud Computing and DevOps”.