Leave us your email address and we'll send you all the new jobs according to your preferences.

Site Reliability Engineer

Posted 3 days 21 hours ago by Orgvue Limited

Permanent

Not Specified

Other

London, United Kingdom

Job Description

Orgvue is an organisational design and planning platform that empowers your business to transform its workforce by understanding the work people do and the skills they have. Our platform connects strategy to structure, providing clarity of vision, so you can build a more adaptable, better performing organisation that thrives in a constantly changing world of work.

The world's largest and best-known enterprises and consulting firms use Orgvue to visualise and model current and future states of the organisation and make faster, more informed decisions. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney.

Role: Principal Site Reliability Engineer

You will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure. You will collaborate across product, platform, and operations teams to ensure our systems are reliable, observable, and resilient - even at scale.

This role combines hands-on technical skills with strategic vision, helping us build a world-class reliability culture and a robust engineering foundation for growth. We seek someone with technical expertise, excellent communication skills, and a collaborative spirit.

Responsibilities:

Define and enforce SLOs, SLIs, and error budgets across critical services
Develop and implement cloud infrastructure and tooling strategies
Enhance SRE practices across the organization
Implement robust observability metrics, logs, and traces using our observability tools
Guide the team in building automated, self-healing systems
Own and evolve incident response processes, including on-call practices and post-mortem culture
Mentor engineers on reliability, operational readiness, and scalable infrastructure best practices
Drive Infrastructure as Code (IaC) initiatives using Terraform, Kubernetes, CloudFormation, and GitOps practices
Collaborate with security, DevOps, and software teams to ensure compliance and operational excellence
Evaluate and adopt tools and practices to improve platform performance and reliability

Desired Skills & Experience:

Experience leading SRE transformations
Hands-on expertise with Kubernetes (EKS preferred) in production
Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)
Proficiency in Infrastructure as Code using Terraform and knowledge of GitOps workflows
Strong background in observability: metrics, visualization, logging, tracing
Understanding of automation, CI/CD pipelines, deployment automation, and release strategies
Experience with incident management, disaster recovery, root cause analysis, and post-incident reviews

Additional Benefits: