Building a Production-Ready HPC Cluster with Docker, SLURM, and OpenMPI

High-performance computing shouldn’t require a six-figure budget and a dedicated data center. I built Docker HPC Cluster—a fully containerized, production-ready HPC environment that you can spin up on your laptop, homelab, or cloud infrastructure in minutes.

The Problem

Traditional HPC cluster deployment is painful. You’re looking at weeks of configuration, complex dependency management, and the constant headache of keeping everything in sync across nodes. Want to test a SLURM job scheduler configuration? Good luck setting up bare-metal infrastructure just to experiment.

I wanted something different: an HPC environment that’s portable, reproducible, and scales from a single laptop to multi-node deployments without rewriting configuration files.

What I Built

The project delivers a complete HPC stack using containerization:

Feature	Description
SLURM Workload Manager	Industry-standard job scheduler with accounting
Dynamic Scaling	Scale compute nodes up/down with a single command
GPU Support	NVIDIA GPU computing with CUDA 12.2
Monitoring	Prometheus + Grafana dashboards out of the box
NFS Storage	Shared filesystem across all nodes
OpenMPI	Full MPI support for parallel computing
Scientific Stack	NumPy, SciPy, pandas, scikit-learn, mpi4py
Job Accounting	Full job history with MariaDB backend

Architecture

The cluster uses a classic controller/compute topology, fully orchestrated through Docker Compose:

Each component runs in its own container with proper service discovery, shared Munge authentication, and persistent volumes for data and logs.

Prerequisites

Required

Docker 20.10 or later
Docker Compose 2.0 or later
8GB+ RAM (16GB recommended for larger clusters)
10GB+ disk space

Optional

NVIDIA Container Toolkit (for GPU support)

Verify Installation

# Check Docker version
docker --version
# Docker version 24.0.0 or later

# Check Docker Compose version
docker compose version
# Docker Compose version v2.20.0 or later

# Check available memory
free -h  # Linux
vm_stat  # macOS

# For GPU support, verify NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:12.2.0-base-rockylinux9 nvidia-smi

Installation

1. Clone or Create Directory

mkdir -p ~/hpc-cluster && cd ~/hpc-cluster
# Copy all files to this directory

2. Build Base Image

# Build the base image (required first)
docker build -t hpc-base:latest -f Dockerfile.base .

3. Build All Images

# Using Make
make build

# Or using Docker Compose
docker compose build

4. Start the Cluster

# Start with 4 compute nodes
make up NODES=4

# Or without Make
docker compose up -d --scale compute=4

5. Verify Installation

# Check all containers are running
docker compose ps

# Verify SLURM is operational
make status
# Or: docker exec hpc-controller sinfo

Quick Start

# 1. Build and start cluster
make up NODES=4

# 2. Wait for cluster to initialize (~60 seconds)
make status

# 3. Run a test job
make test

# 4. Access the controller
make shell

# 5. Submit your first job
srun hostname

Usage

Scaling

Scale compute nodes dynamically without restarting the cluster:

# Scale up to 8 nodes
make scale NODES=8

# Scale down to 2 nodes
make scale NODES=2

# Check node status
docker exec hpc-controller sinfo -N -l

GPU Support

Prerequisites

Install NVIDIA Container Toolkit:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Start GPU Cluster

# Build GPU image and start
make gpu NODES=2

# Or with Docker Compose
docker compose --profile gpu up -d --scale compute-gpu=2

Submit GPU Job

# Check GPU partition
sinfo -p gpu

# Submit GPU job
sbatch examples/gpu_job.sh

# Interactive GPU session
srun -p gpu --gres=gpu:1 --pty bash

Monitoring

Enable Monitoring Stack

# Start with Prometheus and Grafana
make monitoring NODES=4

# Or with Docker Compose
docker compose --profile monitoring up -d --scale compute=4

Access Dashboards

Service	URL	Credentials
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	–
cAdvisor	http://localhost:8080	–

Pre-configured Metrics

CPU usage per node
Memory utilization
Network I/O
Container metrics
SLURM job statistics

Job Examples

Interactive Jobs

# Run single command
srun hostname

# Run on specific number of tasks
srun -n 4 hostname

# Interactive shell on compute node
srun --pty bash

Batch Jobs

# Submit batch job
sbatch examples/submit_job.sh

# Check job status
squeue

# View job details
scontrol show job <jobid>

# Cancel job
scancel <jobid>

MPI Jobs

# Compile MPI program
cd /home/hpcuser
mpicc -o hello_mpi /examples/hello_mpi.c

# Run interactively
srun -n 4 ./hello_mpi

# Submit as batch job
sbatch examples/multinode_mpi.sh

Array Jobs

# Submit array of 10 tasks
sbatch examples/array_job.sh

# Check array job status
squeue -u hpcuser

# Cancel specific array task
scancel <jobid>_5

Python MPI Jobs

# Monte Carlo Pi calculation
srun -n 4 python3 /examples/parallel_pi.py

# Matrix benchmark
srun -n 4 python3 /examples/matrix_benchmark.py

Parameter Sweeps

# Submit parameter sweep
sbatch examples/submit_sweep.sh

# Results saved to sweep_results_<jobid>/

GPU Jobs

# Submit GPU job
sbatch examples/gpu_job.sh

# Interactive GPU session
srun -p gpu --gres=gpu:1 python3 -c "import torch; print(torch.cuda.is_available())"

Command Reference

Make Commands

Command	Description
`make help`	Show all available commands
`make build`	Build all Docker images
`make build-gpu`	Build GPU-enabled images
`make up NODES=N`	Start cluster with N compute nodes
`make down`	Stop the cluster
`make restart`	Restart the cluster
`make scale NODES=N`	Scale to N compute nodes
`make status`	Show cluster and job status
`make nodes`	Show detailed node information
`make jobs`	Show job queue
`make logs`	Follow all container logs
`make logs-controller`	Follow controller logs
`make logs-compute`	Follow compute node logs
`make shell`	Open root shell on controller
`make shell-user`	Open shell as hpcuser
`make test`	Run cluster tests
`make test-mpi`	Run MPI test job
`make test-python`	Run Python MPI test
`make examples`	Copy examples to controller
`make monitoring`	Start with monitoring stack
`make gpu`	Start with GPU nodes
`make backup`	Backup cluster data
`make clean`	Stop and remove containers
`make clean-volumes`	Remove all data volumes
`make clean-all`	Remove everything
`make info`	Show cluster information

SLURM Commands

Command	Description
`sinfo`	View cluster/partition status
`sinfo -N -l`	Detailed node list
`squeue`	View job queue
`squeue -u <user>`	View user’s jobs
`srun <cmd>`	Run interactive job
`srun -n N <cmd>`	Run with N tasks
`srun --pty bash`	Interactive shell
`sbatch <script>`	Submit batch job
`scancel <jobid>`	Cancel job
`scancel -u <user>`	Cancel all user’s jobs
`scontrol show job <id>`	Job details
`scontrol show node <n>`	Node details
`sacct`	Job accounting history
`sacct -j <jobid>`	Specific job accounting
`sstat -j <jobid>`	Running job statistics

Useful Aliases (Pre-configured)

sq   # squeue
si   # sinfo
sc   # scancel

Configuration

Environment Variables

Variable	Default	Description
`NODES`	2	Number of compute nodes
`TZ`	UTC	Timezone
`MYSQL_ROOT_PASSWORD`	rootpassword	Database root password
`MYSQL_USER`	slurm	Database user
`MYSQL_PASSWORD`	slurmdbpass	Database password

Compute Node Resources

Edit docker-compose.yml:

compute:
  deploy:
    resources:
      limits:
        cpus: '4'        # CPUs per node
        memory: 8G       # Memory per node
      reservations:
        cpus: '1'        # Minimum CPUs
        memory: 1G       # Minimum memory

SLURM Partitions

Edit config/slurm.conf:

# Default compute partition
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

# Quick debug partition (1 hour max)
PartitionName=debug Nodes=ALL MaxTime=01:00:00 State=UP Priority=100

# GPU partition
PartitionName=gpu Nodes=ALL MaxTime=INFINITE State=UP

# High memory partition
PartitionName=highmem Nodes=ALL MaxTime=24:00:00 State=UP MinMemoryNode=16000

Job Defaults

Edit config/slurm.conf:

# Default job settings
DefMemPerCPU=1000          # 1GB per CPU
MaxJobCount=10000          # Max queued jobs
MaxArraySize=1001          # Max array size

File Structure

hpc/
├── docker-compose.yml           # Main orchestration file
├── Makefile                     # Management commands
├── README.md                    # This file
│
├── Dockerfile.base              # Base image with SLURM/MPI
├── Dockerfile.controller        # Controller node
├── Dockerfile.compute           # Compute nodes
├── Dockerfile.slurmdbd          # Database daemon
├── Dockerfile.gpu               # GPU-enabled nodes
├── Dockerfile.nfs               # NFS server
│
├── config/
│   ├── slurm.conf              # SLURM configuration
│   ├── slurmdbd.conf           # Database daemon config
│   ├── cgroup.conf             # Cgroup configuration
│   └── gres.conf               # GPU resource config
│
├── scripts/
│   ├── start-controller.sh     # Controller startup
│   ├── start-compute.sh        # Compute node startup
│   ├── start-slurmdbd.sh       # Database daemon startup
│   ├── start-gpu-compute.sh    # GPU node startup
│   ├── start-nfs.sh            # NFS server startup
│   └── generate-munge-key.sh   # Munge key generator
│
├── monitoring/
│   ├── prometheus.yml          # Prometheus configuration
│   └── grafana/
│       └── provisioning/
│           ├── dashboards/
│           │   ├── dashboards.yml
│           │   └── hpc-cluster.json
│           └── datasources/
│               └── datasources.yml
│
└── examples/
    ├── hello_mpi.c             # MPI hello world (C)
    ├── parallel_pi.py          # Monte Carlo Pi (Python)
    ├── matrix_benchmark.py     # Matrix benchmark
    ├── parameter_sweep.py      # Parameter sweep example
    ├── submit_job.sh           # Basic batch job
    ├── multinode_mpi.sh        # Multi-node MPI
    ├── array_job.sh            # Array job
    ├── gpu_job.sh              # GPU job
    ├── submit_pi.sh            # Python MPI job
    └── submit_sweep.sh         # Parameter sweep job

Troubleshooting

Common Issues

Cluster Won’t Start

# Check container status
docker compose ps -a

# View startup logs
docker compose logs

# Rebuild images
make clean && make build && make up NODES=4

Nodes Not Registering

# Check compute node logs
docker compose logs compute

# Verify munge authentication
docker exec hpc-controller munge -n | unmunge

# Restart compute nodes
docker compose restart compute

Jobs Stuck in Pending

# Check why job is pending
scontrol show job <jobid> | grep Reason

# Check node availability
sinfo -N -l

# Check resource requests
squeue -o "%.10i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %.20R"

Database Issues

# Check database connectivity
docker exec hpc-database mysql -u slurm -pslurmdbpass -e "SELECT 1"

# View slurmdbd logs
docker compose logs slurmdbd

# Restart database services
docker compose restart database slurmdbd

Munge Authentication Failures

# Check munge is running
docker exec hpc-controller pgrep munged

# Test munge
docker exec hpc-controller munge -n | unmunge

# Regenerate munge key (requires restart)
docker compose down
docker volume rm hpc-munge-key
make up NODES=4

Debug Mode

# Run with verbose logging
docker compose up --scale compute=2

# Check SLURM debug logs
docker exec hpc-controller tail -f /var/log/slurm/slurmctld.log

# Check compute node logs
docker exec hpc-hpc-compute-1 tail -f /var/log/slurm/slurmd.log

Performance Tuning

Memory Settings

# docker-compose.yml
compute:
  deploy:
    resources:
      limits:
        memory: 8G
  environment:
    - SLURM_MEM_PER_NODE=8000

CPU Pinning

# config/slurm.conf
TaskPlugin=task/affinity
TaskPluginParam=cores

Network Optimization

# docker-compose.yml
networks:
  hpc-network:
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: 9000

MPI Tuning

# In job scripts
export OMPI_MCA_btl_tcp_if_include=eth0
export OMPI_MCA_mpi_yield_when_idle=1

Security

Default Credentials

Service	Username	Password
MariaDB root	root	rootpassword
MariaDB slurm	slurm	slurmdbpass
Grafana	admin	admin
SSH (hpcuser)	hpcuser	(key-based)

Recommendations for Production

Change default passwords in docker-compose.yml and config/slurmdbd.conf
Enable TLS for Grafana and Prometheus
Restrict network access using firewall rules
Use secrets management for sensitive data
Enable SLURM accounting for audit trails

Cleanup

# Stop cluster (preserves data)
make down

# Stop and remove containers
make clean

# Remove all data volumes
make clean-volumes

# Complete cleanup (containers, volumes, images)
make clean-all

# Remove specific volumes
docker volume rm hpc-db-data hpc-slurm-logs

Why This Matters

For me, this project solved a real problem: I needed to experiment with job scheduling, test MPI code, and prototype distributed algorithms without provisioning physical hardware. Now I can iterate on SLURM configurations, test scaling behavior, and develop parallel applications—all from my homelab.

For others, it’s a learning tool, a development environment, or a foundation for building actual production clusters. The entire configuration is transparent and hackable.

Technical Stack

Component	Version
Rocky Linux	9
SLURM	23.x
OpenMPI	4.x
Python	3.9+
MariaDB	10.11
Prometheus	2.48
Grafana	10.2
CUDA	12.2

Get the Code

The project is open source and available on GitHub:

github.com/marcosfermin/hpc

Clone it, customize it, break it, fix it. PRs welcome.

The Problem

What I Built

Architecture

Prerequisites

Required

Optional

Verify Installation

Installation

1. Clone or Create Directory

2. Build Base Image

3. Build All Images

4. Start the Cluster

5. Verify Installation

Quick Start

Usage

Scaling

GPU Support

Prerequisites

Start GPU Cluster

Submit GPU Job

Monitoring

Enable Monitoring Stack

Access Dashboards

Pre-configured Metrics

Job Examples

Interactive Jobs

Batch Jobs

MPI Jobs

Array Jobs

Python MPI Jobs

Parameter Sweeps

GPU Jobs

Command Reference

Make Commands

SLURM Commands

Useful Aliases (Pre-configured)

Configuration

Environment Variables

Compute Node Resources

SLURM Partitions

Job Defaults

File Structure

Troubleshooting

Common Issues

Cluster Won’t Start

Nodes Not Registering

Jobs Stuck in Pending

Database Issues

Munge Authentication Failures

Debug Mode

Performance Tuning

Memory Settings

CPU Pinning

Network Optimization

MPI Tuning

Security

Default Credentials

Recommendations for Production

Cleanup

Why This Matters

Technical Stack

Get the Code

leave a comment cancel