We’re hiring a Lead Site Reliability Engineer (SRE) to revitalize and lead the SRE function at Welcome to the Jungle, a ~300-person scale-up. You’ll be stepping into a critical role not starting from scratch, but rather building upon and evolving existing foundations.
As the first new hire in this transition, you’ll be a hands-on technical leader responsible for continuing to define and implement our SRE strategy, ensuring the reliability, performance, and scalability of our platform.
You’ll collaborate closely with engineering, product, and security teams to improve automation, observability, and infrastructure practices in a modern, cloud-native stack. The ideal candidate brings deep technical expertise, strong leadership, and the ability to both stabilize and grow a function in flux.
Key Responsibilities
Define the vision, standards, and roadmap for Site Reliability Engineering at Welcome to the Jungle.
Lead the design and implementation of scalable, secure infrastructure in AWS using IaC (Terraform, Terragrunt).
Champion GitOps and CI/CD best practices via ArgoCD and CircleCI.
Own the development and enforcement of service-level objectives (SLOs) and indicators (SLIs).
Drive observability across the stack using OpenTelemetry and Datadog to ensure proactive issue detection and resolution.
Establish disaster recovery strategies, high-availability design patterns, and cost-effective infrastructure choices.
Lead incident response processes, postmortems, and on-call rotation design.
Build and maintain operational documentation and automation to reduce manual toil.
Ensure robust alerting, logging, and telemetry across all environments.
Proactively identify and remove bottlenecks in the infrastructure and deployment workflows.
Improve platform performance and reliability through rigorous monitoring, testing, and system design.
Collaborate with development teams to ensure new services are production-ready and follow reliability best practices.
Partner with Security and DevOps to ensure infrastructure meets compliance and security standards.
Mentor developers and influence reliability-focused engineering culture across the company.
Lead internal knowledge sharing and help scale the SRE mindset organization-wide.
Act as a trusted advisor to engineering leadership on system reliability, scalability, and tooling.
You have at least 5 years of infrastructure/systems engineering experience and want to maintain a strong hands-on technical focus.
You’re comfortable:
Building and maintaining large-scale distributed systems.
Managing incident response according to SLA.
Implementing automation and self-healing systems.
Developing utility scripts and functions.
Working in both French and English, in a remote context.
It’s not required, but having experience with our tech stack (Elixir, React.js) is a significant advantage.
You have strong problem-solving skills and can troubleshoot complex systems issues.
You’re reliability-focused: passionate about building resilient systems, measuring and improving reliability through data-driven approaches, and establishing sustainable operational practices.
You demonstrate excellent communication skills and can effectively collaborate with various technical and non-technical stakeholders.
Deep dive in our stack:
Our main cloud provider is AWS ;
We use Kubernetes as our container orchestrator ;
Our Infrastructure-as-Code is managed with Terraform and Terragrunt ;
We use ArgoCD and CircleCI as our integration and deployment tools ;
We use OpenTelemetry & Datadog to monitor our platforms ;
Our applications runs on GNU/Linux systems, like Debian
And if you’re not expert in all of those previous fields, you can still join us, we love sharing our knowledge.
An initial conversation with Fattoum, Talent Acquisition Manager
A take-home case, followed by a live Expertise Interview with the tech Team
And finally, two competency interviews based on our company values
These companies are also recruiting for the position of “Cloud Computing and DevOps”.