Leave us your email address and we'll send you all the new jobs according to your preferences.
Site Reliability Engineering Director
Posted 4 hours 3 minutes ago by Onum
Company
Onum is a data optimization and analytics company based in Madrid. We specialize in real-time data analysis to enable rapid decision-making regarding cybersecurity, network performance, and infrastructure management. Onum helps you optimize your data analytics costs by reducing data, avoiding vendor lock-in, and aligning the value of each dataset with actions taken.
About the Role
As the Director of Site Reliability Engineering, you will lead a small but high-impact team of SREs focused on ensuring the reliability, scalability, and efficiency of our infrastructure. This role combines strategic thinking with technical leadership, giving you the opportunity to shape our reliability practices while remaining close to day-to-day operations.
You will collaborate with Engineering, DevOps and Product teams to embed reliability into everything we build, and drive continuous improvement across systems, processes, and automation. Your leadership will be critical in setting standards, prioritizing initiatives, and elevating our platform's resilience.
Responsibilities
Team Leadership & Development:
- Lead, mentor, and develop a team of 5 Site Reliability Engineers, fostering a culture of technical excellence, accountability, and collaboration.
- Set clear goals and expectations, conduct regular one-on-ones, and support career growth.
- Partner with recruiting to attract and hire top SRE talent.
Technical Strategy & Direction:
- Define the team's roadmap in alignment with company priorities, focusing on scalability, reliability, and automation.
- Lead key technical initiatives, including infrastructure modernization, observability, and incident response improvements.
- Establish and promote SRE best practices across teams.
Hands-on Technical Leadership:
- Lead by example by participating in technical discussions, incident resolution, and troubleshooting critical system issues.
- Provide guidance on best practices for system reliability, automation, and performance optimization.
- Support the team in designing and implementing reliable, scalable cloud infrastructure, ensuring smooth deployment pipelines and reducing manual toil.
Incident Management & Operational Excellence:
- Own the on-call process and incident response framework, ensuring effective resolution, communication, and postmortems.
- Continuously improve monitoring, alerting, and system health metrics to detect and respond to issues proactively.
- Reduce operational toil through automation and process optimization.
Cross-functional Collaboration:
- Work closely with Engineering, Product, DevOps, and Security teams to ensure reliability is embedded throughout the development lifecycle.
- Serve as a subject matter expert in reliability to influence technical and product direction.
Automation & Process Improvement:
- Identify opportunities for automation in daily operations, helping to improve deployment speed, incident response, and reliability of the platform.
- Ensure the team is leveraging infrastructure-as-code (e.g., Terraform) and other automation tools to reduce manual processes and increase scalability.
Operational Metrics & Monitoring:
- Work with your team to ensure systems are well-monitored and metrics are effectively captured using tools like Prometheus, Grafana, or Datadog.
- Track key performance indicators (KPIs) for system uptime, reliability, and team performance, identifying areas for continuous improvement.
Qualifications:
- 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role, with at least 5+ years experience leading a small team or mentoring junior engineers.
- Strong understanding of cloud platforms (AWS, GCP, or Azure) and modern infrastructure practices (e.g., containerization with Docker/Kubernetes, CI/CD pipelines).
- Hands-on experience with infrastructure-as-code tools (Terraform, Ansible, etc.) and cloud automation.
- Proven ability to troubleshoot complex infrastructure issues, perform root cause analysis, and implement system improvements.
- Experience with monitoring and alerting systems like Prometheus, Grafana, Datadog, or equivalent.
- Excellent communication and collaboration skills, with the ability to work cross-functionally and explain technical concepts to non-technical stakeholders.
Our Values
Own it: We take full ownership from input to outcome, lead by doing and following through, and hold ourselves accountable by listening, learning, and stepping up when things go wrong.
No Mask: We speak clearly, directly, and respectfully, stay humble by learning from our mistakes, and build trust through radical clarity.
United: We collaborate fluidly across functions and teams, lift each other up by sharing the load, and genuinely enjoy working together.
Move Boldly: We stay curious, sharp, and technically bold, learning in motion, challenging the status quo with intention, and focusing our talent on solving the highest-impact problems.
Onum
Related Jobs
Go Developer Golang API Messaging
- £100,000 Annual
- London, City Of Westminster, United Kingdom, W1B 2AG
Senior Front End Developer JavaScript Vue.js
- £85,000 - £100,000 Annual
- London, Hammersmith And Fulham, United Kingdom, W6
Traineeship junior projectleider infra
- 3 000,00 € - 3 500,00 € Monthly
- Noord-Holland, Amsterdam, Netherlands
Onderhoudsmonteur (3-ploegen)
- 3 305,00 € - 4 100,00 € Monthly
- Gelderland, Apeldoorn, Netherlands, 7311 AA
Senior Specialist / Service Delivery Consultant
- Cheshire, Chester, United Kingdom, CH1 1