Leave us your email address and we'll send you all the new jobs according to your preferences.

ML Infrastructure Engineer (ML Platform)

Posted 5 hours 35 minutes ago by CuspAI

Permanent
Full Time
Other
Cambridgeshire, Cambridge, United Kingdom, CB1 0
Job Description
About CuspAI

CuspAI is the frontier AI company on a mission to solve the breakthrough materials needed to power human progress. While nature took billions of years to perfect molecules, we are harnessing AI to unlock trillion-dollar materials breakthroughs in months, not millennia. Our founding team is the most cited in the world, comprised of world-class researchers in AI, chemistry and engineering.

We are working on some of the hardest and most important challenges including energy, clean water, the future of compute, and carbon capture, and this is just the start of what our 'search engine' for next-generation materials will unlock.

We invite you to be part of a diverse, innovative team at the intersection of AI and materials science, working to create impactful partnerships that drive innovation, scalability, and industry collaboration. This work matters. Your work matters.

We're on the cusp of the on-demand materials era. Join us.

The Role

We are seeking an ML Infrastructure Engineer / ML Platform Engineer with strong Python coding, DevOps and cloud platform expertise to build and maintain the infrastructure that powers our ML research teams. You will be the architect of the systems that enable our researchers to train and deploy models at scale.

Your Impact

In this role, you will build and maintain the ML infrastructure platform that empowers our AI researchers and materials scientists to conduct cutting edge experiments.

You will own the entire ML operations stack - from cloud architecture to deployment pipelines - ensuring our research teams can focus on science while you handle the systems.

As this is a newly created team and position, you will have the chance to help shape our entire ML infrastructure strategy and make a significant impact on our platform's architecture.

What You Will Do
  • Build the ML Infrastructure Platform: Design and implement a cloud native (GCP) platform on Kubernetes that enables researchers to easily train, evaluate, and deploy models without worrying about infrastructure complexity.

  • Own the MLOps Stack: Implement and maintain CI/CD pipelines, model registries, experiment tracking systems, and deployment automation - the full lifecycle of ML operations.

  • Scale Distributed ML Model Training: Build infrastructure to support distributed training across many GPUs, implementing solutions for data pipelines, checkpointing, and resource optimisation.

  • Platform Reliability: Ensure 99.9% uptime for our ML platform through monitoring, alerting, and automated recovery systems. Be the guardian of infrastructure stability.

  • Configuration Management: Use Kapitan to manage complex multi environment Kubernetes configurations and ensure consistent deployments across dev, staging, and production.

  • Cost Optimisation: Implement resource management strategies to optimise cloud spending and detect inefficient usage while maximising computational throughput for our research teams.

  • Developer Experience: Create tools and abstractions that make it simple for researchers to go from experiment to production without deep infrastructure knowledge.

  • Interdisciplinary Collaboration: Partner closely with ML Researchers, Chemists, Materials Scientists and Software Engineers to understand their needs, and to build the infrastructure for groundbreaking projects.

Must Have Skills and Qualifications:
  • You are someone who gets excited about the opportunity to enable scientists to work on world changing challenges in this domain, with a personal interest in the potential applications of the technology that Cusp is building.

  • You're a builder of tools and infrastructure who enjoys making life as easy as possible for the teams. As part of your interest in this area, you stay up to date with the latest relevant technologies, tools, and open source projects.

  • You are a seasoned engineer with deep expertise in building and maintaining ML infrastructure (not model development), ideally in start up environments. We need you to be able to hit the ground running and work autonomously as a subject matter expert, providing input and guidance around best practice where appropriate.

  • You bring deep experience with ML workloads on Kubernetes and multi cloud platforms (AWS, GCP, neoclouds).

  • You have strong Python and/or Go programming skills for infrastructure automation - not just scripting.

  • Expert level proficiency in Infrastructure as Code (IaC) using Terraform, Helm, Kapitan, and GitOps workflows.

  • Experience operating distributed ML platforms for JAX and/or PyTorch (Tensorflow would be less relevant).

  • You will also need to have GPU infrastructure experience, specifically managing GPU clusters and performance optimisation.

Bonus Points (But Not Critical):
  • Previous experience as a Software Engineer before moving into DevOps / MLOps / ML Infrastructure, or at least write code regularly (at work and/or in your spare time), as this should equip you with the kind of coding skills we're looking for.

  • Experience using our tech stack (Flyte, Kapitan, Pants) or similar tools

  • Experience with HPC environments and job schedulers (Slurm, PBS)

  • Knowledge of ML serving infrastructure (Triton, TorchServe, KServe)

  • Familiarity with scientific computing workflows and data management

  • Experience with multi tenant ML platforms

  • Background in supporting research teams (understanding their unique needs vs. production ML)

What We Offer
  • A competitive salary plus equity package so you have a stake in the success of the company

  • 28 days holiday

  • Professional development budget for scientific conferences and technical training

  • Opportunity to work at the forefront of AI driven scientific discovery with world class researchers

  • Direct impact on advancing materials science through cutting edge technology

  • Collaborative environment bridging AI research, computational chemistry, and experimental science

Additional Considerations

This role could be based in our Cambridge, London, Amsterdam or Berlin offices, with the expectation of being in the office three days per week. Additionally, there may be regular travel required to our other offices for collaboration and project oversight.

Join us in shaping the future of materials with AI. Together, we can create groundbreaking solutions for a more sustainable world.

CuspAI is an equal opportunities employer committed to building a diverse and inclusive workplace. We do not discriminate on the basis of sex, race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity, pregnancy or related condition (including breastfeeding), veteran status, or any other basis protected by applicable law.

We actively encourage applications from all backgrounds and value the unique perspectives and contributions that diversity brings to our team.

Please let us know If you require any specific adjustments during or after the interview process. We will do everything we can within reason to accommodate.

Email this Job