Senior Site Reliability Engineer

Salario: No especificado
Sin trabajo a distancia


¿Te interesa esta oferta?


El puesto

Descripción del puesto

Meet Gorgias, the customer service platform designed for ecommerce merchants, and built to provide amazing experience to shoppers at scale on Shopify, BigCommerce, and Magento. Our product empowers merchants to manage all their customer service in one place over email, live chat, voice, Facebook, Instagram, and SMS.

Everything we do is for our customers, and we’re currently serving over 12,000+ ecommerce merchants, including : Steve Madden, Timbuk2, Decathlon, and Sports Illustrated. They love us for our innovative product, our focus on their ecommerce needs, and, of course, our lightning-fast customer service response time.

We raised $25 million in our Series B round in December 2020 and $30 million in our Series C round in 2022. We more than doubled in size in every meaningful way: annual recurring revenue, the size of our customer base, and the size of our Gorgias team, for starters.

We’re still growing fast and looking for new teammates who want to grow with us.

About The SRE Team

We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our team. As an SRE at Gorgias, you will play a crucial role in ensuring the reliability, scalability, and performance of our systems, enabling the seamless delivery of our products and services.

The SRE team at Gorgias maintains the core infrastructure and services that make up the heart of our product. We have the privilege to work with high throughput systems and TB-scale data stores serving billions of queries per day, most with sub millisecond response times.

We also design and maintain the software delivery stack, offering features such as metrics-based canary rollout strategies to all internal development teams.

We currently have a team of 4 Senior and Staff SREs operating together globally with aim to be 6 in the near term. We focus on scalable methods to provide the largest impact across the organization.

Some achievements we’re proud of:

  • Partitioned multi-TB tables in Postgres to reduce Vacuum time by 5x

  • For partitioning we studied the problem, the partitioning strategy, analyzed all queries to avoid bad surprises, utilized Debezium and Kafka to do a live copy and accomplished it with less than 20 mins maintenance window and no data loss

  • Split PostgreSQL connections proxy in multiple pools to guarantee quotas per service of our product, allowing sub-systems that heavily hit the database to be contained and not create a large incident blast radius

  • For connections proxying we had to go deeper into the BE to propose solutions, coded part of the fix in the backend, provided the path and helped teams migrate to the new methodology. In the end successfully eliminating incidents due to DB connections starvation

  • Worked with all product-engineering teams to accomplish SOC2 certification, ran a Hackerone program, refactored our whole incident management with Rootly for better visibility and resolution time, and improved our overall security posture

  • To keep the lights on the team is constantly working on upgrading our self-hosted Postgres and RabbitMQ, alongside other critical infrastructure components with minimal down time and high accuracy

What You Will Do:

  • Manage multi-TB PostgreSQL clusters in the public cloud, optimize parameters, storage settings and data structure

  • Operate RabbitMQ and Redis with tens of thousands of operations per second

  • Manage 10+ full featured GKE clusters worldwide, 10k+ Tenants

  • Adopt new stack of: Kafka, Debezium, Apache Flink

  • Facilitate rollout strategies at scale with Gitlab CI and ArgoCD

  • Roll out best practices around Kubernetes/Helm/Operators, SLIs/SLOs, Incident Management, Observability, Security, and Disaster Recovery to all Product-Engineering teams and drive adoption by them

  • Automate complex infrastructure pieces for our worldwide footprint with best practices IaC with TF, strong scripting with Python/Golang

What You Should Have:

  • Experience with cloud-native web systems at scale

  • Bachelor's degree in Computer Science or equivalent work experience.

  • 5+ years experience as a Site Reliability Engineer or similar role, with a focus on maintaining high-performance, scalable, and reliable high-throughput web systems.

  • Proficiency in using Kubernetes for container orchestration and management.

  • 5+ years experience with Cloud Providers (AWS, GCP) and a deep understanding of cloud services and architectures.

  • Proficient in scripting and programming languages such as Python, Bash, Go, or NodeJS.

  • Comfortable and confident in Linux systems and the command line.

  • Solid understanding of infrastructure as code (IaC) principles and experience with tools like Terraform.

  • Experience with continuous integration and deployment (CI/CD) pipelines.

  • Excellent problem-solving and troubleshooting skills.

  • Strong communication and collaboration skills with the ability to work effectively in a team environment.

Bonus Points If You Have:

  • Certification in Kubernetes (e.g., Certified Kubernetes Administrator - CKA).

  • Certification in a Cloud Provider platform (e.g., AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect).

  • Experience in managing and optimizing PostgreSQL databases.

Perks and Benefits

  • 🏖️ 5-week vacation plus 2 weeks RTT (We follow each country's appropriate PTO Laws)

  • 🤕 Paid sick leave

  • 🧸 Paid parental leave (16 weeks)

  • 💻 MacBook Pro

  • 🍽️ Personal credit card to buy lunches (we use Swile)

  • 🏥 We provide private health insurance (we use Alan)

  • 💆🏻‍♀️ Get up to €700 to set up your workstation at home (working from home should feel breezy)

  • 📚 Get up to €2000 of learning material and wellness support per year! This includes €1500 for learning material (such as books, courses, and individual coaching sessions) directly linked to your job scope, as well as a €500 wellness budget. Take advantage of these resources to grow in your role and prioritize your personal development and wellness.

  • 🥰 Every quarter, we organize an online company-wide summit to discuss where we’re going and strengthen social bonds. Once per year we organize offsite team retreats and company retreats!

Why join us?

🚀 We're among the fastest-growing startups in the eCommerce ecosystem

🦄 We've built an extremely efficient go-to-market engine

🥇 Work with a talented team you'll learn a lot from

🙏 Join a company where automation and good & clean data are core beliefs shared by all

🎥 Here is an interview with one of our team member’s experiences from our most recent company retreat to Cancun! 

More cool things to know about Gorgias... 😁

Gorgias ensures equal employment opportunity without discrimination or harassment based on race, color, religion, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity or expression, age, disability, national origin, marital or domestic/civil partnership status, genetic information, citizenship status, veteran status, or any other characteristic protected by law.

Gorgias is committed to the full inclusion of all qualified individuals and will take the steps to assure that people with disabilities are provided reasonable accommodations. Accordingly, if reasonable accommodation is required to fully participate in the job application or interview process, to perform the essential functions of the position, and/or to receive all other benefits and privileges of employment, please contact

¿Quieres saber más?