Slurm Administration & Systems Architecture (Santa Rosa) Job at Midjourney, Santa Rosa, CA

OHkzdUVlWlFjSmgyTUZUOEFHMXJqMGxn
  • Midjourney
  • Santa Rosa, CA

Job Description

Overview

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC/AI/ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI/CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.

Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI/ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.

Job Tags

Similar Jobs

TIBCO Software

barista - Store# 8 NORTH MICHIGAN Job at TIBCO Software

 ...Starbucks, its all about connection. People are at the heart of who we are, especially the people that are a part of our store team. We connect with each other, our customers and our communities to make a posit Barista, Store, Restaurant, Technology, Beverage, Experience... 

Abacus Corporation

Ryder Chambersburg (Amazon) - Ryder Chambersburg (Amazon) - Warehouse Rockstar - Pick, Pack, Perform Job at Abacus Corporation

 ...employees, we offer long-term jobs, competitive pay, benefits, and real growth opportunities. Join the Abacus family. General Warehouse Associate (GWH) Warehouse work with rhythmpick, pack, and perform like a pro Base Pay Rate: $19.50 /hour Shift... 

Direct Protection Security Inc.

Security Sales Representative Job at Direct Protection Security Inc.

 ...Job Description Job Description Are you a proven closer looking to take your sales career to the next level? Join Direct Protection Security , one of the fastest-growing authorized ADT dealers in the nation. We offer hot leads , high conversion rates , and the... 

Iowa Interstate Railroad

Train Conductor Job at Iowa Interstate Railroad

 ...Train Conductor - Council Bluffs, IA Start Wage: $29.60 -Full Rate: $37.00 (For Train Conductor) Iowa Interstate Railroad is looking for passionate, hardworking and dedicated individuals to join our team. Join a company with a rich history and a vital role in America... 

TRC Talent Solutions

Production Engineer Job at TRC Talent Solutions

 ...TRC is seeking a Production Engineer with experience in project-based work for one of our automotive manufacturing clients for an onsite position in Monroe, GA! POSITION DESCRIPTION: Recommends and implements improvements to existing production processes, methods...