Job Description

Overview

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads.

Responsibilities

Cluster Engineering & Deployment

Participate in the design and bring-up of bare metal HPC/AI/ML environments
Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
Integrate heterogeneous hardware platforms into cohesive scheduling environments.
Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI/CD pipelines) for reproducible cluster build-out.
Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

Configure and operate the Slurm Workload Manager.
Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

Assist cluster users with developing workflows that make efficient use of compute resources.
Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
Automate cost accounting and cluster usage reporting.

Qualifications

7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
Familiarity with common AI/ML software package dependencies and workflows
Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
Strong background in Linux system administration, networking, and performance tuning for HPC environments.
Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
Demonstrated ability to operate GPU-accelerated clusters at scale.

Job Tags

Similar Jobs

Talently

Industrial Safety Manager Job at Talently

...Job Title: Industrial Safety Manager Location: On Site - Greater Atlanta, Georgia, United States Salary: $100,000-$140,000 Skills: Electrical Safety, OSHA, Airport Construction, Commercial Construction, Incident Investigation About the Construction Company...

TRC Talent Solutions

Account Executive (Attorneys) Job at TRC Talent Solutions

...Required; In-Person/Field Role) Job Type: Full-Time Schedule : Standard business hours +... ...and written communication skills for working with attorneys and internal teams Ability... ...Paid Time Off and Paid Holidays Ability to earn work from home capabilities...

Motherduck

Sales Engineer Job at Motherduck

...native analytics platform powered by DuckDB. We believe in simplifying data without dumbing it downand were looking for a Senior Sales Engineer who can help our customers see how magical (and fast!) data can be. This role is perfect for someone whos waddled through the...

GardaWorld Security Services U.S.

Tactical Security Guard Job at GardaWorld Security Services U.S.

...Job Description: GardaWorld Security Services is Now Hiring a Tactical Security Officer! Ready to suit up as a Tactical Security Guard? What matters most in a role like this is your ability to read the environment, anticipate risk, and act accordingly. Tell us...

Talent Groups

Azure Databricks Engineer (Louisville) Job at Talent Groups

...Title: Azure Databricks Engineer Location - Louisville Kentucky - Onsite Contract Job Description: ~ Design, develop, and optimize robust ETL/ELT pipelines using Databricks, Python and SQL. ~ Write clean, efficient, and reusable code in SQL & Python for data...

Slurm Administration & Systems Architecture (Santa Rosa) Job at Midjourney, Santa Rosa, CA

OHkzdUVlWlFjSmgyTUZUOEFHMXJqMGxn