# Lead Technical DevOps / Infrastructure Engineer

**Company:** [Deep Origin](http://jobs.workable.com/companies/mCJUxMvKLrVjcDfkuLpd4V.md)
**Location:** Remote
**Workplace:** remote
**Department:** Deep Origin

[Apply for this job](http://jobs.workable.com/view/7b05d105-b90e-4386-9359-3d920846847c)

## Description

Deep Origin is a biotech startup building an operating system for science that transforms how life science research is conducted. Led by Michael Antonov, co-founder of Oculus, and backed by Formic Ventures, we are redefining the infrastructure behind modern drug discovery. As we scale our AI-driven platform and strategic programs, exceptional talent is a critical lever in accelerating our mission to dramatically reduce disease and extend human healthspan.

### About the role

We are looking for a Senior DevOps / Infrastructure Engineer to join our existing DevOps team. This is a senior IC role with a broad technical scope: you will own complex initiatives end-to-end, drive collaboration across engineering and science teams, and set a high bar for how we build and operate infrastructure. A significant part of this role is supporting our R&D teams by running and evolving the compute clusters that power bioinformatics pipelines, ML training, and other HPC workloads.

Highly autonomous: able to operate with minimal guidance, prioritize work independently, and take full ownership of infrastructure decisions and outcomes.

## Requirements

### Must-Have

-   10+ years of infrastructure and DevOps engineering experience, with a proven track record in senior or lead IC roles
-   Ability to take end-to-end ownership of complex, multi-team initiatives and drive them from design through to production
-   Hands-on experience running HPC or research compute clusters: bare-metal provisioning, Slurm (or equivalent), GPU infrastructure, and shared storage (NFS, Lustre, or similar)
-   Comfortable operating in environments with a mix of cloud, VPS, and bare-metal systems, including legacy or non-standard setups
-   Experience supporting scientific or R&D teams with mixed workloads: long-running CPU batch jobs, GPU training jobs, and interactive compute
-   Deep, hands-on AWS expertise: EKS/Kubernetes, IAM, VPC networking, S3, RDS, and cost management
-   Solid Terraform skills and a principled approach to infrastructure-as-code
-   Strong Linux fundamentals and experience managing multi-node environments at scale
-   Experience owning and improving production observability systems (Prometheus/Grafana, OpenTelemetry, ELK, or similar)
-   Strong security fundamentals: threat modeling, least-privilege access design, vulnerability management, and compliance frameworks
-   Experience owning incident management end-to-end, including process design and continuous improvement
-   Excellent communication skills; able to work directly with researchers and scientists as well as with engineering and leadership
-   Fluent English.

### Nice-to-Have

-   Background in biotech, bioinformatics, or scientific computing environments
-   SOC 2 Type II audit experience
-   Monorepo tooling and developer platform engineering. 

###   
Key Responsibilities 

-   Own our cloud infrastructure across AWS and third-party hosting and compute providers; ensure it is reliable, scalable, and cost-efficient
-   Own and operate bare-metal compute clusters: node provisioning, configuration management, networking, secure access, and ongoing reliability
-   Build and maintain configuration management using Ansible (or similar), ensuring reproducible and scalable server provisioning
-   Set up and maintain Slurm for job scheduling across CPU and GPU node pools; ensure researchers can submit, monitor, and manage jobs without DevOps involvement
-   Design and manage cluster networking: management and storage networks, inter-node communication, DNS, and secure perimeter access, including bastion/jump host setup
-   Deep hands-on experience managing Linux-based infrastructure, including networking, firewalls, VPNs, and performance tuning in distributed environments
-   Own disaster recovery and business continuity: define RTO/RPO targets, maintain runbooks, and run regular tests
-   Manage and optimize infrastructure spend through capacity planning, right-sizing, and intelligent use of reserved and spot capacity
-   Manage Kubernetes clusters, networking, and workload scheduling across cloud and on-premise environments
-   Enable infrastructure-as-code practices in Terraform; drive consistency, modularity, and auditability across the codebase
-   Evolve our observability platform: improve coverage, reduce alert noise, and ensure engineering teams have the visibility they need to detect and resolve issues quickly
-   Own security posture across the platform: IAM policies, secrets management, network segmentation, vulnerability management, and SOC 2 compliance
-   Lead incident management: on-call processes, escalation policies, runbooks, and blameless post-mortems
-   Drive CI/CD improvements and developer workflow initiatives that meaningfully increase engineering throughput
-   Evolve internal tooling and CLI infrastructure that engineering teams depend on daily. 

### Values & Working Style

-   Ownership mindset — you take responsibility from A to Z
-   Comfortable navigating ambiguity in a fast-moving startup environment
-   Clear communicator who can collaborate across technical and non-technical teams
-   Pragmatic problem solver focused on impact.  
    

**Why This Role Matters Now**

As we scale our AI platform and expand into new initiatives, engineering velocity and platform reliability directly impact research outcomes and product milestones. This role plays a key part in strengthening our technical foundation during the growth phase.
