Building a Production-Ready HPC Cluster with Docker, SLURM, and OpenMPI

High-performance computing shouldn’t require a six-figure budget and a dedicated data center. I built Docker HPC Cluster—a fully containerized, production-ready HPC environment that you can spin up on your laptop, homelab, or cloud infrastructure in minutes.

The Problem

Traditional HPC cluster deployment is painful. You’re looking at weeks of configuration, complex dependency management, and the constant headache of keeping everything in sync across nodes. Want to test a SLURM job scheduler configuration? Good luck setting up bare-metal infrastructure just to experiment.

I wanted something different: an HPC environment that’s portable, reproducible, and scales from a single laptop to multi-node deployments without rewriting configuration files.

What I Built

The project delivers a complete HPC stack using containerization:

Feature Description
SLURM Workload Manager Industry-standard job scheduler with accounting
Dynamic Scaling Scale compute nodes up/down with a single command
GPU Support NVIDIA GPU computing with CUDA 12.2
Monitoring Prometheus + Grafana dashboards out of the box
NFS Storage Shared filesystem across all nodes
OpenMPI Full MPI support for parallel computing
Scientific Stack NumPy, SciPy, pandas, scikit-learn, mpi4py
Job Accounting Full job history with MariaDB backend

 


Architecture

The cluster uses a classic controller/compute topology, fully orchestrated through Docker Compose:

Each component runs in its own container with proper service discovery, shared Munge authentication, and persistent volumes for data and logs.


Prerequisites

Required

  • Docker 20.10 or later
  • Docker Compose 2.0 or later
  • 8GB+ RAM (16GB recommended for larger clusters)
  • 10GB+ disk space

Optional

  • NVIDIA Container Toolkit (for GPU support)

Verify Installation

# Check Docker version
docker --version
# Docker version 24.0.0 or later

# Check Docker Compose version
docker compose version
# Docker Compose version v2.20.0 or later

# Check available memory
free -h  # Linux
vm_stat  # macOS

# For GPU support, verify NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:12.2.0-base-rockylinux9 nvidia-smi

Installation

1. Clone or Create Directory

mkdir -p ~/hpc-cluster && cd ~/hpc-cluster
# Copy all files to this directory

2. Build Base Image

# Build the base image (required first)
docker build -t hpc-base:latest -f Dockerfile.base .

3. Build All Images

# Using Make
make build

# Or using Docker Compose
docker compose build

4. Start the Cluster

# Start with 4 compute nodes
make up NODES=4

# Or without Make
docker compose up -d --scale compute=4

5. Verify Installation

# Check all containers are running
docker compose ps

# Verify SLURM is operational
make status
# Or: docker exec hpc-controller sinfo

Quick Start

# 1. Build and start cluster
make up NODES=4

# 2. Wait for cluster to initialize (~60 seconds)
make status

# 3. Run a test job
make test

# 4. Access the controller
make shell

# 5. Submit your first job
srun hostname

Usage

Scaling

Scale compute nodes dynamically without restarting the cluster:

# Scale up to 8 nodes
make scale NODES=8

# Scale down to 2 nodes
make scale NODES=2

# Check node status
docker exec hpc-controller sinfo -N -l

GPU Support

Prerequisites

Install NVIDIA Container Toolkit:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Start GPU Cluster

# Build GPU image and start
make gpu NODES=2

# Or with Docker Compose
docker compose --profile gpu up -d --scale compute-gpu=2

Submit GPU Job

# Check GPU partition
sinfo -p gpu

# Submit GPU job
sbatch examples/gpu_job.sh

# Interactive GPU session
srun -p gpu --gres=gpu:1 --pty bash

Monitoring

Enable Monitoring Stack

# Start with Prometheus and Grafana
make monitoring NODES=4

# Or with Docker Compose
docker compose --profile monitoring up -d --scale compute=4

Access Dashboards

Service URL Credentials
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090
cAdvisor http://localhost:8080

Pre-configured Metrics

  • CPU usage per node
  • Memory utilization
  • Network I/O
  • Container metrics
  • SLURM job statistics

Job Examples

Interactive Jobs

# Run single command
srun hostname

# Run on specific number of tasks
srun -n 4 hostname

# Interactive shell on compute node
srun --pty bash

Batch Jobs

# Submit batch job
sbatch examples/submit_job.sh

# Check job status
squeue

# View job details
scontrol show job <jobid>

# Cancel job
scancel <jobid>

MPI Jobs

# Compile MPI program
cd /home/hpcuser
mpicc -o hello_mpi /examples/hello_mpi.c

# Run interactively
srun -n 4 ./hello_mpi

# Submit as batch job
sbatch examples/multinode_mpi.sh

Array Jobs

# Submit array of 10 tasks
sbatch examples/array_job.sh

# Check array job status
squeue -u hpcuser

# Cancel specific array task
scancel <jobid>_5

Python MPI Jobs

# Monte Carlo Pi calculation
srun -n 4 python3 /examples/parallel_pi.py

# Matrix benchmark
srun -n 4 python3 /examples/matrix_benchmark.py

Parameter Sweeps

# Submit parameter sweep
sbatch examples/submit_sweep.sh

# Results saved to sweep_results_<jobid>/

GPU Jobs

# Submit GPU job
sbatch examples/gpu_job.sh

# Interactive GPU session
srun -p gpu --gres=gpu:1 python3 -c "import torch; print(torch.cuda.is_available())"

Command Reference

Make Commands

Command Description
make help Show all available commands
make build Build all Docker images
make build-gpu Build GPU-enabled images
make up NODES=N Start cluster with N compute nodes
make down Stop the cluster
make restart Restart the cluster
make scale NODES=N Scale to N compute nodes
make status Show cluster and job status
make nodes Show detailed node information
make jobs Show job queue
make logs Follow all container logs
make logs-controller Follow controller logs
make logs-compute Follow compute node logs
make shell Open root shell on controller
make shell-user Open shell as hpcuser
make test Run cluster tests
make test-mpi Run MPI test job
make test-python Run Python MPI test
make examples Copy examples to controller
make monitoring Start with monitoring stack
make gpu Start with GPU nodes
make backup Backup cluster data
make clean Stop and remove containers
make clean-volumes Remove all data volumes
make clean-all Remove everything
make info Show cluster information

SLURM Commands

Command Description
sinfo View cluster/partition status
sinfo -N -l Detailed node list
squeue View job queue
squeue -u <user> View user’s jobs
srun <cmd> Run interactive job
srun -n N <cmd> Run with N tasks
srun --pty bash Interactive shell
sbatch <script> Submit batch job
scancel <jobid> Cancel job
scancel -u <user> Cancel all user’s jobs
scontrol show job <id> Job details
scontrol show node <n> Node details
sacct Job accounting history
sacct -j <jobid> Specific job accounting
sstat -j <jobid> Running job statistics

Useful Aliases (Pre-configured)

sq   # squeue
si   # sinfo
sc   # scancel

Configuration

Environment Variables

Variable Default Description
NODES 2 Number of compute nodes
TZ UTC Timezone
MYSQL_ROOT_PASSWORD rootpassword Database root password
MYSQL_USER slurm Database user
MYSQL_PASSWORD slurmdbpass Database password

Compute Node Resources

Edit docker-compose.yml:

compute:
  deploy:
    resources:
      limits:
        cpus: '4'        # CPUs per node
        memory: 8G       # Memory per node
      reservations:
        cpus: '1'        # Minimum CPUs
        memory: 1G       # Minimum memory

SLURM Partitions

Edit config/slurm.conf:

# Default compute partition
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

# Quick debug partition (1 hour max)
PartitionName=debug Nodes=ALL MaxTime=01:00:00 State=UP Priority=100

# GPU partition
PartitionName=gpu Nodes=ALL MaxTime=INFINITE State=UP

# High memory partition
PartitionName=highmem Nodes=ALL MaxTime=24:00:00 State=UP MinMemoryNode=16000

Job Defaults

Edit config/slurm.conf:

# Default job settings
DefMemPerCPU=1000          # 1GB per CPU
MaxJobCount=10000          # Max queued jobs
MaxArraySize=1001          # Max array size

File Structure

hpc/
├── docker-compose.yml           # Main orchestration file
├── Makefile                     # Management commands
├── README.md                    # This file
│
├── Dockerfile.base              # Base image with SLURM/MPI
├── Dockerfile.controller        # Controller node
├── Dockerfile.compute           # Compute nodes
├── Dockerfile.slurmdbd          # Database daemon
├── Dockerfile.gpu               # GPU-enabled nodes
├── Dockerfile.nfs               # NFS server
│
├── config/
│   ├── slurm.conf              # SLURM configuration
│   ├── slurmdbd.conf           # Database daemon config
│   ├── cgroup.conf             # Cgroup configuration
│   └── gres.conf               # GPU resource config
│
├── scripts/
│   ├── start-controller.sh     # Controller startup
│   ├── start-compute.sh        # Compute node startup
│   ├── start-slurmdbd.sh       # Database daemon startup
│   ├── start-gpu-compute.sh    # GPU node startup
│   ├── start-nfs.sh            # NFS server startup
│   └── generate-munge-key.sh   # Munge key generator
│
├── monitoring/
│   ├── prometheus.yml          # Prometheus configuration
│   └── grafana/
│       └── provisioning/
│           ├── dashboards/
│           │   ├── dashboards.yml
│           │   └── hpc-cluster.json
│           └── datasources/
│               └── datasources.yml
│
└── examples/
    ├── hello_mpi.c             # MPI hello world (C)
    ├── parallel_pi.py          # Monte Carlo Pi (Python)
    ├── matrix_benchmark.py     # Matrix benchmark
    ├── parameter_sweep.py      # Parameter sweep example
    ├── submit_job.sh           # Basic batch job
    ├── multinode_mpi.sh        # Multi-node MPI
    ├── array_job.sh            # Array job
    ├── gpu_job.sh              # GPU job
    ├── submit_pi.sh            # Python MPI job
    └── submit_sweep.sh         # Parameter sweep job

Troubleshooting

Common Issues

Cluster Won’t Start

# Check container status
docker compose ps -a

# View startup logs
docker compose logs

# Rebuild images
make clean && make build && make up NODES=4

Nodes Not Registering

# Check compute node logs
docker compose logs compute

# Verify munge authentication
docker exec hpc-controller munge -n | unmunge

# Restart compute nodes
docker compose restart compute

Jobs Stuck in Pending

# Check why job is pending
scontrol show job <jobid> | grep Reason

# Check node availability
sinfo -N -l

# Check resource requests
squeue -o "%.10i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %.20R"

Database Issues

# Check database connectivity
docker exec hpc-database mysql -u slurm -pslurmdbpass -e "SELECT 1"

# View slurmdbd logs
docker compose logs slurmdbd

# Restart database services
docker compose restart database slurmdbd

Munge Authentication Failures

# Check munge is running
docker exec hpc-controller pgrep munged

# Test munge
docker exec hpc-controller munge -n | unmunge

# Regenerate munge key (requires restart)
docker compose down
docker volume rm hpc-munge-key
make up NODES=4

Debug Mode

# Run with verbose logging
docker compose up --scale compute=2

# Check SLURM debug logs
docker exec hpc-controller tail -f /var/log/slurm/slurmctld.log

# Check compute node logs
docker exec hpc-hpc-compute-1 tail -f /var/log/slurm/slurmd.log

Performance Tuning

Memory Settings

# docker-compose.yml
compute:
  deploy:
    resources:
      limits:
        memory: 8G
  environment:
    - SLURM_MEM_PER_NODE=8000

CPU Pinning

# config/slurm.conf
TaskPlugin=task/affinity
TaskPluginParam=cores

Network Optimization

# docker-compose.yml
networks:
  hpc-network:
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: 9000

MPI Tuning

# In job scripts
export OMPI_MCA_btl_tcp_if_include=eth0
export OMPI_MCA_mpi_yield_when_idle=1

Security

Default Credentials

Service Username Password
MariaDB root root rootpassword
MariaDB slurm slurm slurmdbpass
Grafana admin admin
SSH (hpcuser) hpcuser (key-based)

Recommendations for Production

  1. Change default passwords in docker-compose.yml and config/slurmdbd.conf
  2. Enable TLS for Grafana and Prometheus
  3. Restrict network access using firewall rules
  4. Use secrets management for sensitive data
  5. Enable SLURM accounting for audit trails

Cleanup

# Stop cluster (preserves data)
make down

# Stop and remove containers
make clean

# Remove all data volumes
make clean-volumes

# Complete cleanup (containers, volumes, images)
make clean-all

# Remove specific volumes
docker volume rm hpc-db-data hpc-slurm-logs

Why This Matters

For me, this project solved a real problem: I needed to experiment with job scheduling, test MPI code, and prototype distributed algorithms without provisioning physical hardware. Now I can iterate on SLURM configurations, test scaling behavior, and develop parallel applications—all from my homelab.

For others, it’s a learning tool, a development environment, or a foundation for building actual production clusters. The entire configuration is transparent and hackable.


Technical Stack

Component Version
Rocky Linux 9
SLURM 23.x
OpenMPI 4.x
Python 3.9+
MariaDB 10.11
Prometheus 2.48
Grafana 10.2
CUDA 12.2

Get the Code

The project is open source and available on GitHub:

github.com/marcosfermin/hpc

Clone it, customize it, break it, fix it. PRs welcome.

leave a comment