THE ROLE
Join our platform-oriented Site Reliability Engineering (SRE) team, focused on enabling product engineering teams to run production services reliably at scale. Our mission is to drive reliability by defining and tracking reliability KPIs, standardizing observability tooling (Datadog, ELK, PagerDuty), and governing the incident lifecycle program. We provide the essential platform, standards, automation, runbooks, and coaching that make service teams successful, while you focus on scaling our operations and driving preventive practices (like PRRs and chaos experiments) to reduce MTTR.
WHAT YOU'LL DO
- Automate the measurement and reporting of SLOs, error-budgets, and SLAs to provide clear visibility into service reliability and progress.
- Standardize and maintain our critical observability and incident tooling (e.g., Datadog, ELK, PagerDuty), ensuring platform stability and consistency across all service teams.
- Lead initiatives to improve the on-call experience, focusing on enhancing alert quality, reducing toil through automation, and refining incident response processes.
- Manage and harden critical platform infrastructure components (e.g., AWS EKS, AWS MSK, Observability agents) to provide a stable, reliable foundation for all services.
- Collaborate and coach product engineering teams on SRE principles, advocating for preventive reliability practices (e.g., Pre-Launch Reviews, chaos experiments) and best practices for change management and rollout strategies.
WHAT YOU BRING
- 4+ years of experience in software engineering, platform, or Site Reliability Engineering roles, with strong coding skills in a modern language (e.g., Python, Go, or Java).
- Demonstrated expertise in troubleshooting distributed, cloud-native systems (preferably AWS), including deep knowledge of networking, HTTP, and container technologies.
- Hands-on experience defining SLOs/error budgets, developing effective alert strategies, and leveraging automation to significantly reduce MTTA/MTTR.
- Solid understanding of infrastructure and delivery fundamentals, including containers/Kubernetes, CI/CD concepts, and Infrastructure as Code (IaC, e.g., Terraform/Helm).
- Excellent written and verbal communication skills, with the ability to influence and lead change across multiple engineering and product teams.
- We are primarily an in-office environment and therefore, you will be expected to work from the Prague, Czech Republic office in compliance with Pure’s policies, unless you are on PTO, or work travel, or other approved leave.
#LIONSITE