# Senior ML Infrastructure Engineer

**Company:** [Ellison Institute of Technology](http://jobs.workable.com/companies/tYoBQmFCuLVKUygNZbHUKp.md)
**Location:** Oxford, United Kingdom
**Workplace:** hybrid
**Department:** MLOps

[Apply for this job](http://jobs.workable.com/view/510a1e1f-6a1b-4ca1-9618-0d73aafb6eab)

## Description

At the Ellison Institute of Technology (EIT), we’re on a mission to translate scientific discovery into real world impact. We bring together visionary scientists, technologists, policy makers, and entrepreneurs to tackle humanity’s greatest challenges in four transformative areas:

-   Health, Medical Science & Generative Biology
-   Food Security & Sustainable Agriculture
-   Climate Change & Managing CO₂
-   Artificial Intelligence & Robotics

This is ambitious work - work that demands curiosity, courage, and a relentless drive to make a difference. At EIT, you’ll join a community built on excellence, innovation, tenacity, trust, and collaboration, where bold ideas become real-world breakthroughs. Together, we push boundaries, embrace complexity, and create solutions to scale ideas for lab to society. Explore more at [www.eit.org](https://www.eit.or)

## Requirements

**Our MLOps team**

Join our MLOps team to build the cloud and compute foundation that enables scientific breakthroughs. Deliver reliable, secure platforms and self-service guardrails that accelerate experimentation and turn ideas into results—faster, at scale, and with confidence. 

**Day-to-day, you might:**

-   Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management. 
-   Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation). 
-   Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference. 
-   Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments. 
-   Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines. 

**What makes you a great fit:**

-   Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale 
-   A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions 
-   Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems 
-   Expertise with high-throughput storage systems for ML/HPC workloads 
-   Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks 
-   A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)

## Benefits

**We offer the following salary and benefits:**

Enhanced holiday pay

Pension

Life Assurance

Income Protection

Private Medical Insurance

Hospital Cash Plan

Therapy Services

Perk Box

Electric Car Scheme

\--

**Why work for EIT:**

At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. Valuing emotional intelligence, empathy, respect, and resilience, we encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!
