Leave us your email address and we'll send you all the new jobs according to your preferences.
Director of Software Engineering (Fleet Management)
Posted 9 days 12 hours ago by Nscale
London
About NscaleNscale is taking on the hyperscalers by building a vertically integrated GenAI cloud platform. We own the data centres, software, and applications that power today's AI applications using sustainable technology solutions. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. Collaboration is key, and we work together swiftly and respectfully, embracing adaptability and resilience in all we do.
About the RoleWe're hiring a Director of Software Engineering (Fleet Management) to lead the team that keeps Nscale's bare-metal GPU fleet running. This is a hands on role where you will be a core individual contributor as well as a leader of people and technology: you'll write production Python deployed using Helm and Kubernetes, design distributed systems, and steer architecture - while also hiring, mentoring, and driving delivery for a growing engineering team.
Fleet Manager automates the entire operational lifecycle of our compute infrastructure from initial device enrolment through multi day burn in testing, to ongoing health monitoring and automated remediation. The problems are challenging and the stakes are high: the software you design and build will determine Nscale's success in scaling its GPU fleet to meet demand, and put you at the centre of some of the highest impact work in the company.
What you'll work on- Large scale business critical automation that configures BMCs, manages DHCP reservations, drives bare metal provisioning state machines, runs GPU burn in tests and remediation workflows.
- Complex workflow orchestration and event driven state machines that span multiple days, survive crashes, resume from checkpoints, support human in the loop approval gates, and let thousands of concurrent idempotent workflows operate without stepping on each other.
- Multi site hub and spoke infrastructure tooling that works across geographically distributed data centres with independent trust boundaries.
- Integration and ensuring consistency with data centre inventory management tooling (DCIM), bare metal provisioning systems, credential stores and monitoring infrastructure.
- Observability: structured logging, metrics, distributed tracing and tooling that lets operators troubleshoot effectively.
- A team of highly talented software engineers, from the front, building hardware lifecycle automation.
- The technical roadmap and architecture for how Nscale provisions, validates, monitors, and remediates hardware at massive scale.
- Writing code in critical areas of the codebase, shipping to production regularly, and setting the bar for execution: getting things done.
- Engineering standards: code review, testing, CI/CD, incident response, and on call practices.
- Tight collaboration with Product, Infrastructure, Platform, SRE, and UI/UX to capture requirements early, align on interfaces, and ship integrations that meet operator needs.
- Hiring and developing engineers who thrive in a high autonomy, high accountability environment.
- 10+ years building, owning, and operating complex distributed systems, with at least 2 years leading engineering teams.
- Hands on experience with workflow orchestration (Temporal, Airflow, Prefect, or similar).
- Bare metal expertise across compute, networking, and storage: BMC/IPMI/Redfish, PXE boot, DHCP, VLAN management, and provisioning systems like Ironic, MAAS, or equivalent.
- Confidence working at the intersection of software and physical infrastructure - debugging sometimes means asking "is the cable plugged in?"
- You've built systems that had to be fault tolerant, resumable, and observable (so failures don't turn into 3am pages).
- You stay effective while context switching between deep work, judgement calls, and people leadership - writing a workflow activity, reviewing an ADR, and unblocking a team member the same morning.
- Use of AI as a force multiplier: to speed up specs, scaffolding, tests, refactors, data exploration, incident triage, and docs with modern AI tools.
- You've worked with OpenStack Ironic, NetBox, or similar data centre inventory and management platforms.
- You've designed multi site architectures for infrastructure tooling.
- You've built hardware burn in, validation, or remediation automation.
- You've owned results storage, analysis, and reporting for large scale computational testing. Experience with HPC simulations or ML training is a plus.
At Nscale, you'll find a collaborative, supportive, and innovative environment where your contributions spark real impact. We're building something extraordinary, and we want you at the core.
- Highly competitive package (base + equity) with reviews every 12 months.
- Join the fastest growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.
At NScale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we encourage applications from candidates of all backgrounds, experiences, and abilities. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio economic backgrounds.
If there's anything we can do to accommodate your specific situation, please let us know.
For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.
Nscale
Related Jobs
People Integration & Onboarding Manager (M&A)
- £60,000 Annual
- London, City, United Kingdom, EC1A2
Java Software Engineer
- £45,000 Annual
- Midlothian, Edinburgh, United Kingdom, EH120
Pre-Sales Executive Consultant
- Porto, Portugal
Business Development Manager
- £90,000 Annual
- London, United Kingdom
Software/DevOps Engineer 80-100%
- Zürich, Zürich, Switzerland