Introdução
Claire Joster is currently recruiting for a reference client in car rental services, who aims to strengthen its internal structure with the integration of a Site Reliability Engineer (M/F).
Função
- Define Reliability: Design, implement, and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for our production services;
- Automation: Write code and scripts (e.g., Python, Go, Bash) to automate operational tasks, system provisioning, and incident remediation;
- Incident Response: Act as a key responder for production incidents. Participate in a 24/7 on-call rotation, lead troubleshooting efforts, and drive incidents to resolution;
- Blameless Post-mortems: Lead and participate in blameless post-incident reviews to identify root causes and implement lasting corrective actions;
- System Architecture: Partner with development teams to design, build, and deploy scalable, highly available, and fault-tolerant systems;
- Monitoring & Observability: Build and maintain comprehensive monitoring and logging solutions (e.g., Prometheus, Grafana, ELK Stack, Datadog) to proactively detect and diagnose issues;
- Capacity Planning: Monitor system performance and usage, forecast demand, and plan for future capacity needs;
- Reduce Toil: Identify and eliminate manual, repetitive operational work by building durable, automated solutions.
Requisitos
- Minimum 5 years of experience in Site Reliability Engineering, software engineering, or large-scale systems administration;
- Strong experience with cloud platforms (AWS, Azure);
- Proficiency with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, CloudFormation);
- Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions);
- Solid understanding of containerization technologies (Docker) and orchestration systems (Kubernetes);
- Experience with version control systems, particularly Git;
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack);
- A systematic, data-driven approach to problem-solving and troubleshooting;
- Experience with on-call rotations and incident management.
12/11/2025