Site Reliability Engineer
Posted 6 days 21 hours ago by AI Tech Suite
Permanent
Not Specified
I.T. & Communications Jobs
London, United Kingdom
Job Description
Find the latest job opportunities in AI and tech.
RunPod offers GPU cloud computing for AI/ML, providing secure and community cloud options, on-demand and spot pods, and serverless GPU scaling.
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.
Contribute to a company with a global impact based in the US, Canada, and Europe.
Experience Requirements:
Fireworks AI offers a fast and efficient platform for building and deploying generative AI applications with a focus on speed, value, and scalability.
Tyk AI Studio is an AI gateway and management solution that helps organizations harness AI's potential while ensuring governance, security, compliance, and control.
Experience Requirements:
Education Requirements:
Experience Requirements:
Education Requirements:
Experience Requirements:
Education Requirements:
Education Requirements:
Travel to exotic places around the world.
Ask Sage is a versatile, secure Generative AI platform for government and commercial use, offering significant productivity improvements and LLM-agnostic support.
Experience Requirements:
Experience Requirements:
RunPod offers GPU cloud computing for AI/ML, providing secure and community cloud options, on-demand and spot pods, and serverless GPU scaling.
The flexibility of remote work with an inclusive, collaborative team.
An opportunity to grow with a company that values innovation and user-centric design.
Generous vacation policy to ensure work-life harmony and well-being.
Contribute to a company with a global impact based in the US, Canada, and Europe.
Experience Requirements:
- 5+ years of experience in Site Reliability Engineering or a similar role
- 3+ years of experience in a technical leadership or management position
- Deep understanding of Linux systems, containerization, virtualization, and networking technologies
- Strong background in managing and monitoring large-scale distributed systems and bare-metal fleets
- Expertise in infrastructure-as-code and configuration management tools
- Lead and mentor a team of Site Reliability Engineers, fostering a culture of innovation, continuous learning, and technical excellence
- Develop and implement strategic plans to enhance the reliability, scalability, and efficiency of our infrastructure
- Collaborate with cross-functional teams to align SRE initiatives with broader organizational goals
- Establish and maintain SLIs, SLOs, and SLAs for critical systems and services
- Drive the adoption of best practices in automation, monitoring, and incident response
Fireworks AI offers a fast and efficient platform for building and deploying generative AI applications with a focus on speed, value, and scalability.
Tyk AI Studio is an AI gateway and management solution that helps organizations harness AI's potential while ensuring governance, security, compliance, and control.
Experience Requirements:
- Proven experience in a senior SRE role or similar.
- Strong knowledge of cloud technologies and SLA SLO SLI management.
- Experience leading teams and implementing SCRUM processes.
- Excellent communication and leadership skills.
- Experience line managing, mentoring, and coaching.
- Collaborate with the Principal SRE to shape and implement the SRE strategic plan.
- Lead the SRE team in translating strategy into actionable plans, coordinating these through the SCRUM process.
- Address wellbeing and performance concerns, fostering a positive and productive team environment.
- Work with the Principal SRE and Scrum Master to analyze wellbeing survey outcomes and develop improvement plans.
Education Requirements:
- Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent experience.
- 5+ years of experience building and managing infrastructure at scale, particularly on the edge.
- Proficiency in Python, Docker, Linux systems, and scripting (Bash, Python).
- Strong expertise with infrastructure automation tools (Terraform, Ansible).
- Experience managing observability and monitoring systems, particularly Prometheus.
- Deep understanding of networking concepts and protocols.
- Design, build, and maintain scalable and resilient infrastructure on the edge.
- Develop automation and infrastructure-as-code solutions using Terraform, Ansible, and scripting languages (Python, Bash).
- Deploy and manage containerized applications using Docker and related technologies.
- Ensure system observability by building and optimizing monitoring systems, particularly using Prometheus.
- Troubleshoot and optimize Linux-based systems (e.g., Red Hat, CentOS, Ubuntu).
Experience Requirements:
- Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go.
- Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty.
- Expert knowledge of deployment technologies such as Pulumi or Terraform.
- Expert knowledge of Kubernetes.
- Improving our observability by adding/adjusting metrics.
- Building easily parsable dashboards.
- Designing and overseeing our on-call rotations.
- Improving our deployment process to increase reliability.
Education Requirements:
- Bachelor's or Master's degree with a First or 2:1, preferably in a technical subject.
- Excellent problem-solving skills, including diagnosing issues within complex systems.
- Ability and desire to identify root causes of issues, and propose and implement structural improvements.
- Strong communication skills and capability to perform in scenarios with urgency.
- Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL, or Elasticsearch.
- Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, Python, AWS, Ansible, Prometheus.
Experience Requirements:
- 6+ years.
- Scaling existing tools.
- Enhancing automation for scaling infrastructure.
- Playing a key role in diversifying and scaling platform.
- Evaluating options to replace existing real-time data pipeline.
- Providing platform support to engineering.
Education Requirements:
- BS in a field related to Computational Linguistics, Computer/Data Science.
- 2+ years of industry experience (desirable for Site Reliability Engineer role).
- Strong knowledge of Linux.
- Strong knowledge of AWS.
- Docker.
- Scripting languages (Bash, Python).
- Familiarity with load-testing tools.
- Must be U.S. citizen capable of obtaining a Secret clearance (for Computational Linguist and Linguist roles).
- On-call first-level response.
- Respond to customer issue reports.
- Troubleshoot problems to maintain service SLAs.
- End-to-end monitoring across infrastructure and services for metrics/alerts/logs.
Education Requirements:
- B.S. in Computer Science or a related field.
- 1+ years of site reliability engineering experience.
- Familiarity with at least one cloud service provider, preferably AWS.
- Familiar with basic SQL commands and Intent protocols.
- Proficient in cloud application orchestration tools like Kubernetes, Helm.
- Experience with monitoring stacks, preferably Datadog.
- Collaborate with engineering teams to define and maintain services SLA.
- Monitor metrics, alerts, logs across infrastructure and applications.
- Create and maintain tools to monitor the platform.
- Respond to incidents, troubleshoot, investigate root causes.
- Conduct post-incident investigation and report.
Travel to exotic places around the world.
Ask Sage is a versatile, secure Generative AI platform for government and commercial use, offering significant productivity improvements and LLM-agnostic support.
Experience Requirements:
- 3+ years in site reliability engineering, Kubernetes administration, or related role.
- Deep expertise of Kubernetes and containers.
- Strong understanding of cloud infrastructure, automation tools, and best practices for high availability and performance.
- Monitor system performance and reliability.
Experience Requirements:
- 4+ years software development experience at a venture-backed startup or top technology firm.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
- Strong expertise in managing CI/CD pipelines and deployment automation.
- Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
- Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes.
- . click apply for full job details