Este puesto ya no está disponible.

Site Reliability Engineer - SRE

Salario: No especificado
Totalmente remoto


¿Te interesa esta oferta?


El puesto

Descripción del puesto

About us 👇🏼

At lempire, we're a passionate team on a mission to help individuals and businesses grow.

As a B2B SaaS company, valued at $150 million, we offer an array of products:

🔵 lemlist: our platform designed to assist people in personalising their outreach and securing more meetings with leads.

🟠 lemwarm: our premier deliverability tool on the market that assists users in keeping their emails out of the spam folder.

🔴 lemcal: our scheduling tool complete with personalised booking page to arrange meetings 2X faster.

🟣 Taplio & Tweet Hunter: our tools aimed at building strong personal brands on LinkedIn & Twitter.

Plus, 2 more products launching in 2024.

At lempire, we also believe in fostering a sense of community. We achieve this by bringing like-minded people together through our online community and onsite events.

Additionally, we provide free online learning resources, all with the goal of helping our audiences grow their businesses exponentially faster.

Our mantra: Keep growing! We live by it each and every day.

✨ Job Description

We're looking for a Senior Site Reliability Engineer (SRE) to join our team. You will play a key role in ensuring the reliability, scalability, and performance of our infrastructure. You will be responsible for maintaining and improving the stability of our systems while working closely with cross-functional teams to implement best practices in monitoring, automation, security, and incident response. Your focus will be on building and maintaining robust systems to minimize downtime and ensure a seamless user experience. We need someone who is passionate, autonomous, and loves technical challenges.

Your main missions will be:

System Monitoring and Maintenance:

Improve and maintain monitoring tools to track system health and performance.

Analyze system performance data and work proactively to prevent potential issues.

Conduct system audits to identify areas for improvement and optimization.

Incident Response and Troubleshooting:

Respond to and resolve incidents promptly, ensuring minimal impact on services.

Investigate root causes of incidents and implement preventive measures.

Develop runbooks and documentation for common incident scenarios.

Automation and Tool Development:

Design and develop automation tools for system provisioning, configuration, and deployment.

Create and maintain scripts and tools to streamline repetitive tasks and enhance efficiency.

Collaborate with development teams to integrate automation into the software development lifecycle.

Scalability and Reliability Improvement:

Work on capacity planning and scalability improvements for the infrastructure.

Identify potential bottlenecks and areas for enhancement in the system architecture.

Implement strategies to ensure high availability and reliability of systems.

Collaboration and Cross-Functional Support:

Collaborate with engineering team to ensure smooth deployments.

Provide technical guidance and support to different teams to maintain system reliability.

Provide support during incidents.

Tech Stack:

Our technology stack includes, but is not limited to:

Baremetals: Ubuntu at OVHCloud, Hetzner

Cloud platforms: Clever Cloud, OVHCloud, Cloudflare

Monitoring and logging tools: Prometheus, Grafana, Loki, homemade tools

Automation and configuration management: Ansible, Terraform

Programming languages: Javascripts, Python, Bash

Software: Caddy, Redis, Mongodb, ElasticSearch, Tailscale, S3/R2, Nexus, Jenkins, Sonar, Nodejs, Github, Slack, Google workspace…

Key results

Within 3 months you will have

Successfully familiarized yourself with the existing technology stack, gaining a comprehensive understanding of the infrastructure, tools, and systems in place.

Contributed to the improvement of system monitoring and logging, actively identifying areas for enhancement and implementing necessary changes to ensure better visibility and reliability.

Collaborated with the team to implement at least one significant automation enhancement using Ansible, Terraform, or other relevant tools, streamlining a key aspect of our infrastructure

Actively participated in at least one incident response and resolution process, contributing insights and recommendations for preventing similar incidents in the future.

Demonstrated a proactive approach to identifying and addressing potential system weaknesses, presenting at least one proposal for improved system resilience or performance optimization.

Within 12 months you will have

Played a pivotal role in enhancing the overall reliability and performance of the systems, contributing to a measurable decrease in system downtime or incidents.

Implemented and refined scalable solutions for the infrastructure, adapting the systems to accommodate growth and increased demand.

Led or significantly contributed to a major project focused on improving the system's capacity planning and scaling, showcasing a clear impact on the efficiency of resource utilization.

Developed and implemented best practices for documentation, ensuring that it is comprehensive, updated, and serves as a valuable resource for the team.

Actively engaged in knowledge-sharing initiatives within the team, mentoring newer members and contributing to a collaborative and innovative work environment.

Proposed and implemented innovative solutions that have positively impacted the overall reliability, security, or efficiency of the systems, showing a proactive and forward-thinking approach to system improvements.

What’s in it for you?

An impactful Role: Contribute significantly to the reliability and performance of critical systems, influencing the success and stability of our products and services.

Fast-paced environment: Be part of an environment that values agility and quick iteration, providing opportunities to implement solutions rapidly and see the direct impact of your work on our systems and services.

🎯 Preferred experience

  • Must have
  • Previous Experience: 4+ years of experience in a Site Reliability Engineering or a similar role.
  • System Administration Skills: Proficient in Ubuntu Linux system administration and troubleshooting.
  • Proficiency in Tools: Strong working knowledge and hands-on experience with Prometheus, Grafana, Loki, Ansible, Terraform, and other related tools within the technology stack mentioned.
  • Problem-Solving Skills: Strong analytical and problem-solving abilities with a proactive mindset toward identifying and addressing potential issues.
  • Cloud and Bare Metal Expertise: Experience in managing both bare metal and cloud-based infrastructure
  • Team Collaboration: Excellent communication skills and the ability to collaborate effectively with cross-functional teams in a dynamic environment.
  • Nice to have
  • Scripting and Programming: Proficiency in scripting languages such as Python, JavaScript, and Bash, with the ability to develop and maintain automation tools and scripts.
  • 🎁 Additional Information

  • 💰 65k€ - 70k€ + bonuses
  • ⛺️ Where you live, and work is totally up to you – we do have an office in Paris if you enjoy life at the office 🇫🇷
  • 📈 Profit sharing: When lempire wins, all team members share the profits
  • 🩺 Alan Blue: Comprehensive 100% premium medical coverage for you and your family
  • 🧠 Alan Mind: Premier mental health service
  • 🍽️ Swile Meal Tickets: Enjoy daily meal tickets to fuel productivity
  • 🚌 Navigo Card: Seamless commuting with a 100% covered Navigo card
  • 🏡 Full-remote Work Setup: A well-funded home office budget to ensure seamless remote work
  • 💻 Gear: Get a laptop + tools and equipment you need for your job
  • ✈️ Team building: We all meet once per year at really cool places around the world (you can check our video here ;) )
  • ⚙️ Recruitment process

  • Chat with Isabella, our Talent Acquisition Manager
  • Chat with Alban, our SRE Lead
  • Technical Challenge
  • Chat with Mickael, our CTO
  • Chat with Charles, our COO
  • Wondering what it's like to work with us? Peek into our world here 👉🏻

    ¿Quieres saber más?

    ¡Estas ofertas de trabajo te pueden interesar!

    Estas empresas también contratan para el puesto de "{profesión}".


    2. YOUSIGN

      Paris · Caen

      Teletrabajo a tiempo completo
      65K a 72K €
    3. VusionGroup


    4. Sport Heroes
      Sport Heroes
      Sport Heroes


      Teletrabajo a tiempo completo
    5. Opensee


    6. Metanext


      Teletrabajo a tiempo partial
    Ver todas las ofertas