Site Reliability Engineer

Posted 6 days 21 hours ago by AI Tech Suite

Permanent

Not Specified

I.T. & Communications Jobs

London, United Kingdom

Job Description

Find the latest job opportunities in AI and tech.

RunPod offers GPU cloud computing for AI/ML, providing secure and community cloud options, on-demand and spot pods, and serverless GPU scaling.

The flexibility of remote work with an inclusive, collaborative team.

An opportunity to grow with a company that values innovation and user-centric design.

Generous vacation policy to ensure work-life harmony and well-being.

Contribute to a company with a global impact based in the US, Canada, and Europe.

Experience Requirements:

5+ years of experience in Site Reliability Engineering or a similar role
3+ years of experience in a technical leadership or management position
Deep understanding of Linux systems, containerization, virtualization, and networking technologies
Strong background in managing and monitoring large-scale distributed systems and bare-metal fleets
Expertise in infrastructure-as-code and configuration management tools

Responsibilities:

Lead and mentor a team of Site Reliability Engineers, fostering a culture of innovation, continuous learning, and technical excellence
Develop and implement strategic plans to enhance the reliability, scalability, and efficiency of our infrastructure
Collaborate with cross-functional teams to align SRE initiatives with broader organizational goals
Establish and maintain SLIs, SLOs, and SLAs for critical systems and services
Drive the adoption of best practices in automation, monitoring, and incident response

Software Engineer, Site Reliability Engineer.

Fireworks AI offers a fast and efficient platform for building and deploying generative AI applications with a focus on speed, value, and scalability.

Tyk AI Studio is an AI gateway and management solution that helps organizations harness AI's potential while ensuring governance, security, compliance, and control.

Experience Requirements:

Proven experience in a senior SRE role or similar.
Strong knowledge of cloud technologies and SLA SLO SLI management.
Experience leading teams and implementing SCRUM processes.
Excellent communication and leadership skills.
Experience line managing, mentoring, and coaching.

Responsibilities:

Collaborate with the Principal SRE to shape and implement the SRE strategic plan.
Lead the SRE team in translating strategy into actionable plans, coordinating these through the SCRUM process.
Address wellbeing and performance concerns, fostering a positive and productive team environment.
Work with the Principal SRE and Scrum Master to analyze wellbeing survey outcomes and develop improvement plans.

Invisible AI is an on-premise computer vision platform for manufacturing that uses AI to improve worker productivity and safety by analyzing manual assembly work.

Education Requirements:

Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent experience.

Experience Requirements:

5+ years of experience building and managing infrastructure at scale, particularly on the edge.
Proficiency in Python, Docker, Linux systems, and scripting (Bash, Python).
Strong expertise with infrastructure automation tools (Terraform, Ansible).
Experience managing observability and monitoring systems, particularly Prometheus.
Deep understanding of networking concepts and protocols.

Responsibilities:

Design, build, and maintain scalable and resilient infrastructure on the edge.
Develop automation and infrastructure-as-code solutions using Terraform, Ansible, and scripting languages (Python, Bash).
Deploy and manage containerized applications using Docker and related technologies.
Ensure system observability by building and optimizing monitoring systems, particularly using Prometheus.
Troubleshoot and optimize Linux-based systems (e.g., Red Hat, CentOS, Ubuntu).

xAI's Grok is a powerful, multilingual large language model available on X and via API, focused on accelerating scientific discovery.

Experience Requirements:

Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go.
Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty.
Expert knowledge of deployment technologies such as Pulumi or Terraform.
Expert knowledge of Kubernetes.

Responsibilities:

Improving our observability by adding/adjusting metrics.
Building easily parsable dashboards.
Designing and overseeing our on-call rotations.
Improving our deployment process to increase reliability.

Luminance is an AI-powered legal tech platform that streamlines contract lifecycle management with features including AI-powered negotiation and an intelligent contract repository.

Education Requirements:

Bachelor's or Master's degree with a First or 2:1, preferably in a technical subject.

Other Requirements:

Excellent problem-solving skills, including diagnosing issues within complex systems.
Ability and desire to identify root causes of issues, and propose and implement structural improvements.
Strong communication skills and capability to perform in scenarios with urgency.
Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL, or Elasticsearch.
Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, Python, AWS, Ansible, Prometheus.

Senior Site Reliability Engineer (Remote) Fathom is a free AI meeting assistant that records, transcribes, and summarizes your meetings, saving you time and improving productivity.

Experience Requirements:

6+ years.

Responsibilities:

Scaling existing tools.
Enhancing automation for scaling infrastructure.
Playing a key role in diversifying and scaling platform.
Evaluating options to replace existing real-time data pipeline.
Providing platform support to engineering.

AppTek.ai provides AI-powered speech and language solutions including ASR, NMT, NLP/U, LLMs, and TTS, serving diverse industries globally.

Education Requirements:

BS in a field related to Computational Linguistics, Computer/Data Science.

Experience Requirements:

2+ years of industry experience (desirable for Site Reliability Engineer role).

Other Requirements:

Strong knowledge of Linux.
Strong knowledge of AWS.
Docker.
Scripting languages (Bash, Python).
Familiarity with load-testing tools.
Must be U.S. citizen capable of obtaining a Secret clearance (for Computational Linguist and Linguist roles).

Responsibilities:

On-call first-level response.
Respond to customer issue reports.
Troubleshoot problems to maintain service SLAs.
End-to-end monitoring across infrastructure and services for metrics/alerts/logs.

Linc's CX automation platform uses AI to streamline retail customer service, boosting efficiency and delighting customers.

Education Requirements:

B.S. in Computer Science or a related field.

Experience Requirements:

1+ years of site reliability engineering experience.

Other Requirements:

Familiarity with at least one cloud service provider, preferably AWS.
Familiar with basic SQL commands and Intent protocols.
Proficient in cloud application orchestration tools like Kubernetes, Helm.
Experience with monitoring stacks, preferably Datadog.

Responsibilities:

Collaborate with engineering teams to define and maintain services SLA.
Monitor metrics, alerts, logs across infrastructure and applications.
Create and maintain tools to monitor the platform.
Respond to incidents, troubleshoot, investigate root causes.
Conduct post-incident investigation and report.

QED.ai provides AI-driven solutions for data scarcity in health and agriculture, offering tools for data digitization, geospatial mapping, and spectroscopy.

Travel to exotic places around the world.

Ask Sage is a versatile, secure Generative AI platform for government and commercial use, offering significant productivity improvements and LLM-agnostic support.

Experience Requirements:

3+ years in site reliability engineering, Kubernetes administration, or related role.
Deep expertise of Kubernetes and containers.
Strong understanding of cloud infrastructure, automation tools, and best practices for high availability and performance.

Responsibilities:

Monitor system performance and reliability.

Hebbia is an enterprise-grade AI platform that empowers knowledge workers by automating complex tasks and providing insights from various data sources. It's designed for seamless integration and high security.

Experience Requirements:

4+ years software development experience at a venture-backed startup or top technology firm.
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
Strong expertise in managing CI/CD pipelines and deployment automation.
Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop).
Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes.

Other Requirements: