Voltar
Site Reliability Engineer (M/F) – Lisboa

Introdução

Claire Joster is currently recruiting for a reference client in car rental services, who aims to strengthen its internal structure with the integration of a Site Reliability Engineer (M/F).

Função

  • Define Reliability: Design, implement, and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for our production services;
  • Automation: Write code and scripts (e.g., Python, Go, Bash) to automate operational tasks, system provisioning, and incident remediation;
  • Incident Response: Act as a key responder for production incidents. Participate in a 24/7 on-call rotation, lead troubleshooting efforts, and drive incidents to resolution;
  • Blameless Post-mortems: Lead and participate in blameless post-incident reviews to identify root causes and implement lasting corrective actions;
  • System Architecture: Partner with development teams to design, build, and deploy scalable, highly available, and fault-tolerant systems;
  • Monitoring & Observability: Build and maintain comprehensive monitoring and logging solutions (e.g., Prometheus, Grafana, ELK Stack, Datadog) to proactively detect and diagnose issues;
  • Capacity Planning: Monitor system performance and usage, forecast demand, and plan for future capacity needs;
  • Reduce Toil: Identify and eliminate manual, repetitive operational work by building durable, automated solutions.

Requisitos

  • Minimum 5 years of experience in Site Reliability Engineering, software engineering, or large-scale systems administration;
  • Strong experience with cloud platforms (AWS, Azure);
  • Proficiency with Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, CloudFormation);
  • Hands-on experience with CI/CD tools (e.g., Jenkins, GitLab CI, GitHub Actions);
  • Solid understanding of containerization technologies (Docker) and orchestration systems (Kubernetes);
  • Experience with version control systems, particularly Git;
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack);
  • A systematic, data-driven approach to problem-solving and troubleshooting;
  • Experience with on-call rotations and incident management.
12/11/2025