Slurm Administration & Systems Architecture Job at Midjourney, San Francisco, CA

dHN0eFlzelhVZTFLMUVoaXNWQWM0ZHEzMmc9PQ==
  • Midjourney
  • San Francisco, CA

Job Description

Overview

We are seeking a highly skilled HPC/AI/ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI/ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC/AI/ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI/CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog/prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.

System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP/SSSD, VPN, PAM, SSH session auditing).

User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker/Podman/Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.

Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI/ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS/preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100/200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI/CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.

Job Tags

Similar Jobs

BCP Engineers & Consultants

Project Manager Job at BCP Engineers & Consultants

 ...Project Manager - (Electrical/Industrial) - Hybrid/Remote Long-Term Direct Hire Frequent travel, Remote / Work from Home and Work at Corporate office (St Cloud, MN) Position Details: Seeking an experienced Project Manager to oversee full-cycle project management... 

GForce Life Sciences

Clinical Site Lead Job at GForce Life Sciences

 ...academic focus in natural science, pre-medicine, nursing, bioengineering, or a related academic...  ...Term & Start ~ Remote/home-based; Ideally located in Dallas, TX, Minneapolis, MN, or Chicago, IL ~50-75% travel (depending on clinical trials)~12-month... 

Tempus Unlimited Inc.

Personal Care Management Evaluation Registered Nurse Job at Tempus Unlimited Inc.

 ...Personal Care Management Evaluation Registered Nurse Location: Fall River, MA 02720, USA...  ...does require working in the community. Travel within the coverage area is required for this...  ...and up to a 60 mile radius of your home address in Bristol County. ~ Required 2... 

Nurtur Aveda Institutes

Admissions Counselor Job at Nurtur Aveda Institutes

 ...communication, and presentation skills. Demonstrate extreme professionalism and confidentiality in manner, dress, and conduct. Travel Requirements: The team member in this position must be able to travel locally to events and for out-of-state training as necessary... 

AutoNation

Advance Technician Job at AutoNation

This position interacts daily with Customers, Service Advisors, other Technicians, Parts Associates, Cashiers, Sales Associates and Managers, just to name a few. What are the day-to-day responsibilities? Performing vehicle repair and/or maintenance work as outlined...