Site Reliability Engineer

Job summary
Permanent contract
Salary: Not specified
Experience: > 3 years
Skills & expertise
Generated content

Interested in this job?

Questions and answers about the job

The position

Job description

We are currently looking for a talented and motivated Site Reliability Engineer (SRE) to join our dynamic team of five other SREs. As an SRE, you will play a crucial role in ensuring the reliability, availability, resilience, and performance of the platform. If you are a curious problem-solver with a solid understanding of cloud technologies, Linux, and Kubernetes, you’ll fit right in !

Your missions :

  • Design, build, and maintain a scalable and highly available cloud infrastructure to support our cybersecurity SaaS platform.

  • Implement and enhance monitoring, alerting, and incident response systems to proactively identify and resolve any performance or availability issues.

  • Optimize system performance and troubleshoot infrastructure bottlenecks to maintain high reliability and responsiveness.

  • Automate deployment, configuration, and management processes using industry-standard tools and frameworks.

  • Participate in designing and implementing disaster recovery strategies to ensure business continuity in the face of potential failures or disasters.

  • Conduct thorough root cause analysis for incidents and implement preventive measures to minimize the risk of recurrence.

  • Troubleshoot and resolve infrastructure issues promptly to minimize downtime.

  • Stay updated with the latest industry trends and emerging technologies related to cloud computing, Linux, Kubernetes, and other relevant areas.

📍 The position is available in Rennes, Paris or full remote.

Preferred experience

🤩 We are excited to meet you if :

  • You have an engineering degree with a cloud computing major (or equivalent)

  • You have at least 3 years of prior experience in SRE-focused roles responsible for supporting, scaling and ensuring the reliability of end-to-end infrastructures.

  • You have experience with observability and monitoring of systems and services, and defining KPIs to track their health.

  • You have experience with production troubleshooting, including distributed systems, code, storage, networking, operating systems (Linux) and databases.

  • You have moderate-to-advanced programming experience, preferably in a high-level language like Python.

  • You have experience participating in a 24x7 on-call rotation for a large-scale deployment. infrastructure management, deployment and observability

  • You have knowledge of containerization technologies like Docker and container orchestration platforms like Kubernetes.

  • You are able to write automation scripts, perform system-level tasks, and develop tools to streamline infrastructure management processes.

  • You have solid understanding of networking concepts, protocols, and best practices and knowledge of TCP/IP, DNS, load balancing, and firewall configurations.

  • You have experience with relational databases like PostgreSQL for managing data storage and retrieval, as well as knowledge of SQL queries, performance optimization, and replication mechanisms.

  • You are familiar with monitoring tools such as Prometheus and Grafana to monitor system performance, collect metrics, and analyze logs for troubleshooting and performance tuning.

  • You have knowledge of messaging platforms like Kafka for building scalable and fault-tolerant event-driven architectures, as well as understanding of topics, partitions, producers, and consumers within a Kafka ecosystem.

We don’t expect you to know our whole stack from the beginning, we are looking for curious and passionate individuals who like to learn new things while having a positive impact on their collaborators.

👀 You are interested in this job but feel you haven’t ticked all the boxes? Don’t hesitate to apply, and tell us in the cover letter section why we absolutely must meet!

Recruitment process

📝 Here’s what’s in store for you if you apply :

  1. HR interview with Clémentine, Talent Acquisition (45’)

  2. Skills fit with Léo, Head of Infrastructure and another member of the team (60’)

  3. Interview with Georges (CTO) (60’)

Our process usually takes about 3 weeks depending on availability, it may include references calls.

The program: discussions rather than trick questions ! These discussions will help you understand how works and what it stands for. But they are also (and above all) an opportunity for you to tell us about your career path and your expectations for your next job!

Want to know more?

These job openings might interest you!

These companies are also recruiting for the position of “Cloud Computing and DevOps”.

See all job openings