Skip to content
C

Cluster

Acknowledgement

If your work is or has been supported by the use of this cluster or the Lamarr infrastructure in general, it is strongly recommended that you include an acknowledgement in any publication to allow continued use by you and others. We recommend the following approach:

This research has been funded/supported by the Federal Ministry of Education and Research of Germany and the state of North Rhine-Westphalia as part of the Lamarr Institute for Machine Learning and Artificial Intelligence.


This documentation describes how to use the Slurm-managed computer cluster of the Lamarr Institute.


Quick start

This is a condensed introduction of how to use our cluster. More detailed information is provided in the later sections.

Overview

Our cluster allows us to perform compute-heavy research that needs large amounts of data or large amounts of processing power. To this end, it provides multiple nodes with many CPU cores, powerful GPUs and lots of main memory, as well as large storage capabilities. Users can access this compute power by starting Slurm jobs, a form of light-weight virtual machines, with the desired hardware resources and software installation.

cluster.svg

The cluster is hidden behind 3 gateway servers (most importantly, the default gwkilab server). Using SSH, users can connect to their jobs via a gateway machine and run their experiments. Jobs have a maximum wall-time; after this time, they will be terminated automatically. Results can be persisted in the network file system cephfs, which is shared across all nodes in the cluster and mounted under /home per default. Additionally, Each node has a fast local storage, mounted under /raid per default.

Getting access to the cluster

How can you get access to the cluster?

  1. You are required to have an official Lamarr account including access to the mattermost. Contact your supervisor to create the account for you.
  2. Write Dominik Baack via Mattermost, to create a cluster account. Ideally by using this template:
Name:
E-Mail:
Type of Work:
Affiliation: Fraunhofer, Dortmund, Bonn, ...
Chair: 
Supervisor:
Approximate access duration:
Telephone (optional):
Please remember to add a SSH KeyFile
  1. Wait until you receive your login information, this should usually not exceed 7 days.

Slurm: Basic vocabulary

Slurm is a workload manager, which, in our case, manages the execution of Docker containers. As such, Slurm is the main interface between you and our cluster.

A Slurm job resembles a computer that you can access remotely via SSH to perform computations, to load and save files, and so on. However, a job is only virtual and it will be removed after some time. Therefore, in order to preserve the results of your work, you will need to save them to specific locations in our network file system.

Each Slurm job is based on a Docker container, an isolated environment in terms of the software that is installed. You can think of a container as an Anaconda or a venv environment - covering, however, not only a set of Python packages but a complete operating system. When a Slurm job terminates, you can start another job that is based on the same container: the new job will have the same software and configuration, as if you had rebooted an actual computer.

You will also encounter the concept of Docker images. An image is a blueprint for a container, i.e., it defines which software is initially installed and how the container is initially configured. When creating a container, you will have to decide which image to use.

Managing Slurm jobs

This section describes how to create and maintain Slurm jobs and their Docker containers.

Creating a Slurm job

  1. Use SSH to log in to gwkilab, the main gateway server in our cluster.

    ssh <username>@gwkilab.cs.tu-dortmund.de
  2. Start a tmux session. This session is needed later, to let your container run in the background.

    tmux new -s <name>

    It is useful to choose a short and descriptive name for your session.

  3. Create a new Slurm job.

    srun --mem=64GB   \ # specify the amount of RAM
         --export ALL \ 
          -c 8        \ # specity the number of cores
         --container-name=<name> \ # container name (e.g., use the same as for your tmux session)
          -p CPU      \ # container queue (CPU, GPU1, GPU2, GPU4, GPU8)
         --container-image=nvcr.io/ml2r/interactive_ubuntu:22.04 \ # container image name
         --mail-user=email@adress.xx \ # Please use your email adress!
         --mail-type=ALL \ 
         --pty /bin/bash

    or as GPU Job

    srun --mem=64GB   \ # specify the amount of RAM
         --export ALL \ 
          -c 8        \ # specity the number of cores
         --gres=gpu:1 \
         --container-name=<name> \ # container name (e.g., use the same as for your tmux session)
         --job-name="SomeName" \  # arbitary jobname to make it easier for you to identify in squeue
         -p GPU1      \ # container queue (CPU, GPU1, GPU2, GPU4, GPU8)
         --container-image=nvcr.io/ml2r/interactive_cuda \ # container image name
         --mail-user=email@adress.xx \ # Please use your email adress!
         --mail-type=ALL \ 
         --pty /bin/bash

    (Instead of breaking lines with \ you can write everything in a single line). As you can see, starting a job requires you to choose a Docker image.

    --pty must be placed as the last command end defines which shell or task should be started as default.

  4. Do stuff!

  5. Learn how to detach and re-attach to your job (or simply exit when you're done).

By the way, the above command has not only created a Slurm job, but also a Docker container, which defines the environment of the job. When your job terminates, you will be able to recover your container from the --container-name that you have assigned.

Partition and Hardware specification

The partition specified with the parameter -p GPU1 etc. ensures that the job is given a place in the correct queue. The queue name represents the maximum number of GPUs to be allocated. At present there is no restriction on requesting more GPUs than the queue provides, but the job can then be stopped at any time without warning by the automatically running script.

The actual number of GPUs the job receives is specified by the parameter --gres. There are currently 3 configurations possible here:

--gres=gpu:1
--gres=gpu:nvidia_a100-sxm4-40gb:1
--gres=gpu:nvidia_a100-sxm4-80gb:1

The first method assigns the next possible graphics processor, methods 2 and 3 specify exactly which graphics processor the job should use. In this case, it is the 40GB or 80GB variant of the A100. If it is not necessary, the automatic variant is always preferable.

The last parameter, here ":1" specifies the actual number of GPUs assigned. This must be <= the number in the queue.

GPU1 -> 1 GPU2 -> 1,2 GPU4 -> 1,2,3,4 GPU8 -> 1,2,3,4,5,6,7,8

The available accelerator can be checked via: sinfo -o %G Usually the list above should be kept up to date.

Available Docker images

The --container-image argument has the following form:

nvcr.io/ml2r/interactive_<base name>:<version>

The following choices for <base name> and <version> are available:

base name versions
pytorch 22.05-py3, 22.12-py3, 23.01-py3, 23.07-py3, 23.09-py3, 23.12-py3
tensorflow 22.05-tf2-py3, 22.12-tf2-py3, 23.01-tf2-py3, 23.07-tf2-py3, 23.09-tf2-py3
ubuntu 18.04, 20.04, 22.04, 23.04
cuda 12.2.0
cuda + OpenGL 11.4.2-ubuntu20.04

For example, if you want an Ubuntu 22.04, you need to specify nvcr.io/ml2r/interactive_ubuntu:22.04. When not specifying a version the latest will be used.

Starting from any of the above images, you can later install additional Python packages or system packages. You can also create custom images, which is, however, a more involved process than simply starting from an available image.

The build process of all available Docker images is published in the custom-container-example repository.

If you need images from other repositories, you can use the following syntax docker://REPOSITORY#Image:TAG, for example:

--container-image="docker://ghcr.io#ghcr.io/huggingface/text-generation-inference:latest"

Detaching and re-attaching to your job

If you want to leave your job running in the background, you cannot simply close the terminal; if you did, the container would be terminated and you might lose your progress. For this reason, we have already created our job inside a tmux session, which we can now simply detach from:

  1. Inside your tmux session, press Ctrl+b and then d to detach. Now you are back on gwkilab.
  2. To re-attach your session, run tmux attach -t <name>, where <name> is the name of your session. Now you should be back in your container.

Tip

If you have only a single session, it suffices to run tmux a to re-attach.

Stopping a job

If your work is done, please stop your job manually. To do so, re-attach to your job (if not already atteched) and simply exit the terminal session, either by running exit or by pressing Ctrl+d. Your tmux session should be terminated automatically.

Alternatively, the Slurm command scancel ${JOBID} can be used to cancel a job even if a job is not attachable or accessible over the gateway.

Starting a job from an existing container

If you recently started and stopped a Slurm job, you can recover its container by starting a new job with the same --container-name but without the --container-image argument. This procedure resembles rebooting an actual computer: all software and configuration of your container will be present in your new job, too.

What is even better than a reboot is that you can also re-configure the resources of your job. A restart of the above example job would look like this (recognize the missing --container-image argument):

srun --mem=32GB   \ # specify the amount of RAM (might be different than before)
     --export ALL \ 
      -c 4        \ # specity the number of cores  (might be different than before)
     --container-name=<name> \ # container name
      -p CPU      \ # container queue (CPU, GPU1, GPU2, GPU4, GPU8)
     --pty /bin/bash

You can find all your cached containers in /cephfs/containers/user-$( id -u ). Make sure to clean-up here very once-in-a-while - just delete the folders with rm -rf {CONTAINER-NAME}.

Rules and limits

The following table gives an overview of the resource limits in each Slurm queue (the one you select with the -p argument in your srun command). If no -p argument is given the default GPU1 queue is chosen.

queue CPUs (default) memory (default) wall-time (default)
CPU 32 128984 MB 7 days (max 21 days)
GPU1 32 128984 MB 7 days (max 14 days)
GPU2 64 257968 MB 3 days (max 7 days)
GPU4 128 515936 MB 3 days (max 3 days)
GPU8 256 1031870 MB 1 days (max 2 days)

Hardware quota

To provide all users with a fair share of the cluster, some limits and constraints must be enforced. This includes foremost a runtime limit depending on the number of GPUs required. In addition, limits on memory (RAM) and the number of CPU cores are automatically set to reasonable default values. It is possible to overwrite this value with the -c flag in srun. The same can be done with required memory. Keep the available hardware of a single node in mind, e.g., if you request 1 TB of RAM, no other job will be able to run on the same machine.

Queue and wall-time

If more jobs are scheduled then the cluster can provide, jobs will be put in a queue where they will wait until the requested resources become available. The queue separates between CPU-only jobs and jobs with varying numbers of GPUs; it enforces the constraints in the table above. As of now, the number of requested GPUs and the queue that the job is paced in are separate parameters in the SLURM Command. If booth values do not match, in special #GPUs > queue than the Job can be killed anytime by automatic scripts without any further notification!

Fair share and accounting

To enable fair use, all requested resources are logged. This is independent of the real consumption! Users with high resource consumption will get a lower priority overall compared to new users. Multi-GPU jobs command a higher "price" than the same number of single GPU Jobs, e.g. 8x1 GPU will reduce your share by a smaller amount than 1x8GPUs if utilized for the same time.

Best practices

The Slurm scheduling system allows you to set a specific name for your container, via --container-name=xxx. This name takes priority over --container-image. This leads to the problem that existing containers are loaded regardless of which image is defined; this happens without a warning. Therefore, name your container uniquely to avoid errors.

Monitoring

Check the cluster status

When you are on gwkilab, you can run multiple commands to get status information about the cluster and your containers.

Cluster status

To view information about the cluster status, run sinfo:

user@gwkilab:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
CPU          up 14-00:00:0      7   unk* ml2ran[01,03-08]
CPU          up 14-00:00:0      1   drng ml2ran09
CPU          up 14-00:00:0      1    mix ml2ran10
CPU          up 14-00:00:0      3   idle ml2ran[02,11-12]
GPU1*        up 14-00:00:0      1   drng ml2ran09
GPU1*        up 14-00:00:0      1    mix ml2ran10
GPU2         up 7-00:00:00      1    mix ml2ran10
GPU2         up 7-00:00:00      3   idle ml2ran[02,11-12]
GPU4         up 3-00:00:00      3   idle ml2ran[02,11-12]
GPU8       down 2-00:00:00      2   idle ml2ran[02,12]

The STATE field provides information about the current status of this queue and the corresponding nodes. In this case the CPU queue is available on several machines and has different states depending on the node.

For some of the states it is possible to check for reasons via sinfo --long --list-reasons

STATE FULL Meaning
* Unreachable State is old and server can currently not be reached
unk Unknown Queue status is unknown on nodes
idle idle Nodes are currently empty and waits for jobs
mix Mixed Usage There are currently runs Jobs on this node and queue but has still space left
full full Queue and node is full and can't process further jobs
drng Draining No new Jobs can be scheduled on this Queue+Node. Check the reason for details
down Down Queue is not available on this node. Check reason for details

Job status

To view info about your own jobs, run squeue -u $USER.

For more detailed information scontrol show job ${ID} can be used.

Finished job metrics

sacct is a powerful tool for querying all information logged by Slurm. For now, not all information are populated; for example, energy consumption is still work in progress.

sacct --format="CPUTime,MaxRSS"

Visual live metrics

Alt text

Monitoring with visual output can be done via the Grafana website running on the cluster. For privacy reasons the website is not available outside the cluster network and therefore requires a dedicated port tunnel or a corresponding proxy setup.

Currently, Grafana is reachable via ml2rsn06s0:3000/ and no public website is available. This will change soon.

Installing Python packages

Installation of Python packages works like usual over pip and in images where conda is installed with conda.

Cleanup

Data on /raid will be removed if no process is found on the machine. The time frame where the check will occur is flexible.

Basic topics

This section provides basic background information on the topics discussed above.

Storage

Our cluster supports two main kinds of working storage described below. All storage is not usable for long term storage. If the account is deactivated data will be removed after a short time due to privacy concerns. For good scientific practice data should be kept for 10 years, therefore you need to copy your data to a archive at the end of your thesis!

Global, persistent storage (/home/{user}): This storage option is shared across all containers. Being made persistent and secure by the Ceph file system across multiple storage nodes, it is, however, presumably slower than the

Local, non-persistent storage (/raid/{user}): The local SSD memory of every A100 node is the fastest way to store data. However, upon container termination or node failure, data is likely lost or inaccessible. Stored data is not directly purged after every job, but will eventually be removed after longer inactivity.

Best practices

  • Use persistent storage (e.g., /home/{user}) for any important data: code, configurations, experimental results
  • Use non-persistent storage (e.g., /raid/{user}) for temporary files and fast access: model snapshots, downloaded data sets
    • It is recommended to copy your datasets to /raid/{user} when required for training and clean up after your job has finished
  • You can start by dumping all results to local storage, and then periodically copy them to global storage
  • If you want to run compute-intensive experiments, it might make sense to explicitly install Python packages to the local SSD
  • You can specify which node you want to use when starting containers - this allows to re-access the local storage!
  • While both storage types have high limits, please clean up after yourself, and do not store extremely big files persistently, i.e., on /home/{user} or inside your container workspace

Note

See Advanced storage topics for more information on the file systems.

SSH

For easy access with SSH, it is recommended that you add the address of your Slurm job to the ~/.ssh/config file of your local computer. The folder .ssh is found in your users home folder.

Example entry for accessing a job that run on the node ml2ran10s0 with port 26000:

Host gwkilab
	HostName gwkilab.cs.tu-dortmund.de
	User <username>

Host slurmjob
	HostName ml2ran10s0
	ProxyJump gwkilab
	Port 26000

For <username>, you need to enter your LS8 account user name.

With the above configuration, you can type ssh slurmjob to open a terminal in your Slurm job. This configuration is also the basis for several common use cases for development. If problems arise check if ~/.ssh/authorized_keys inside the job has a key available. If not you must generate a new key file on gwkilab with the following commands and copy it into the job.

# ON **GWKILAB**

cd ~/.ssh/
ssh-keygen  # Press Enter 5 time

echo "IdentityFile ~/.ssh/id_rsa" >> ~/.ssh/config

echo "Copy to JOB ~/.ssh/authorized_keys:"
cat id_rsa.pub # Copy this out

Important

It is important that the newly generated key is copied into the first line of ~/.ssh/authorized_keys (the public key of your laptop should then be at the second line).

Common use cases for development

VSCode remote extension

You can connect your local Visual Studio Code window to the container and run code remotely. To do this, you need to set up your ~/.ssh/config as described above. Next, open a new Visual Studio Code window, click on Extensions in the left pane, search for the extension named Remote - SSH by Microsoft and install it. You may need to restart Visual Studio Code afterwards.

Then, you can connect your window by pressing CTRL+SHIFT+p (CMD+SHIFT+p on Mac) and searching for Remote-SSH: Connect Current Window to Host.... There, you select slurmjob (or any other name you have chosen in your ~/.ssh/config file). After the connection is established, you can open files, edit projects etc. as if they were on your local machine.

Jupyter notebooks

Insight a tmux session in your Slurm job, type in:

jupyter notebook <password>

This way you set a password <password> for later.

Then, start Jupyter like this:

jupyter notebook --ip="*"

When running this command, take note of the port Jupyter reports on the command line. As an example output:

[I 13:04:26.192 NotebookApp] Jupyter Notebook 6.4.10 is running at:
[I 13:04:26.192 NotebookApp] http://hostname:8888/

Then, from your local machine, you can forward this port by running the following command in a new terminal window:

ssh -L 8080:localhost:8888 slurmjob

This will forward the port 8888 from the container to your local machine's port 8080. If you close this terminal, this forwarding will stop. You can, for example, forward the port to your local port 8080. Then, you can access the notebooks on your local machine by visiting localhost:8080 in your web browser. Type in your password <password> you set previously and you are good to go.

Instead of manually forwarding the ports, you can add the following to your ~/.ssh/config file to make this persistent:

Host slurmjob
	HostName ml2ran10s0
	ProxyJump gwkilab
	Port 26000
        LocalForward 8080 localhost:8888

Then, you just need to call ssh cluster in a terminal and you can connect to Jupyter in your browser as before.

Default Container

We provide multiple default container with essential development tools preinstalled and ready to use for interactive development sessions. The images are based on Nvidias NGC Container and are therefore optimized for the Nvidia architecture.

The use-case of this images is for interactive development not for batch processing.

Interactive PyTorch

This container has PyTorch preinstalled, but it is not directly accessible by your venv by default. To solve this after 22.05 you must execute:

/bin/python -m venv --system-site-packages .venv/sw

in the beginning to initialize your venv with the preinstalled pytorch package. It can be manually found under /opt/pytorch

Name PyTorch Version Ubuntu Version Disclaimer
22.05-py3 1.12.0a0+8a1a93a 20.04 Uses Conda
22.12-py3 1.14.0a0+410ce96 20.04 Requires to use /bin/python
23.01-py3 1.14.0a0+44dac51 20.04 Requires to use /bin/python
23.07-py3 2.1.0a0+b5021ba 22.04 Requires to use /bin/python
23.09-py3 2.1.0a0+32f93b1 22.04 Requires to use /bin/python
23.12-py3 2.2.0a0+81ea7a4 22.04 Requires to use /bin/python

https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/

Interactive Tensorflow

Ubuntu

Cuda

Advanced topics

This section describes topics that go slightly beyond the basic usage of the cluster. It assumes that you are familiar with all basic topics.

Installing system packages

System packages are typically installed through apt install <package name>, which requires root privileges. If you have started your job the recommended way, these privileges are unavailable for security reasons.

Just do

apt-get -o Dpkg::Options::="--force-not-root" install

to trick apt-get to install anyways.

Legacy-Solution: Click to expand if you need root in scenarios other than apt-get. Note that you can do a lot of stuff without being root, e.g. messing with files in /etc/

To install a system package nevertheless, we need to start a separate root-privileged session.

  1. Stop your current job: attach to your screen / tmux session and terminate the running srun command by pressing CTRL+C. Caution: This will terminate all SSH sessions and all computations in your job. You can, however, return to this job later; all of its configuration (like installed Python packages) is persisted.
  2. Start a spin-off of your job with root privileges:
# inside a screen / tmux session

srun -c 1 --container-remap-root --container-name=<my container name> --pty /bin/bash
  1. Inside the newly opened terminal, you can freely install system packages. You can also take out any other configuration that requires root privileges.
# inside your job

apt install <package name>
  1. Stop the root-privileged session by pressing CTRL+D.
  2. Restart your actual job without root privileges.

Custom Docker images

How to create your own images

Overview

In order to deploy your own Docker images via slurm you have to follow these steps:

  1. Build an image on your local machine¹ and push it into the nvcr.io registry²
  2. Log into gwkilab and start the freshly build image via srun or sbatch³
  3. Start whatever you wanted to start

¹ You can build Docker images from any machine that contains your nvcr API token (see below), except gwkilab, gwkilab1, gwkilab2 due to their Docker registry configuration. We recommend that you use your local machine.

² Nvidia ships their own Docker version with DGX clusters that is geared towards GPU usage, but comes with its own image registry. At this moment there does not seem to be a way to host our own registry (?)

³ Make sure that you tmux or screen beforehand to make jobs persistent as discussed below.

flowchart LR
PC --->|push| nvcr.io/mlr2
PC --->|ssh| gwkilab --->|srun| ml2ran01 --->|docker| Container
nvcr.io/mlr2 --->|auto pull| gwkilab
gwkilab --->|srun| ml2ran02 --->|docker| Container
gwkilab --->|srun| ... --->|docker| Container

1) Getting an nvcr API token

  • Go to https://catalog.ngc.nvidia.com/ and login (top right corner). Make sure you are part of approrpiate lamarr team, e.g.Lamarr/lamarr-dortmund team. - Note All images are stored under the URL nvcr.io/ml2r/$TEAM/$IMAGE_NAME which is different from the default URL nvcr.io/ml2r/$IMAGE_NAME which does not contain the $TEAM name.
  • Click on your account name in the top right corner and go to Setup and then click on Get API Key.
  • In the top right corner you find Generate API key to generate a new API key.
    • Note Carefully store this API key. The webpage does not store it for you. There is no way to view this API key afterwards!

2) Building and pushing the docker image from your local machine

  • Make sure that docker is correctly setup on your local machine.
  • Add the nvcr API token to your local docker setup
user@PC:~$ sudo docker login nvcr.io
Username: $oauthtoken 
Password: YOUR_API_KEY

Note: $oauthtoken is a special user name that indicates that you will authenticate with an API key and not a username and password.

Note: Oftentimes the password prompt shows empty characters (e.g. in bash), but you can still copy/paste the API key. If you are in doubt if your API was stored correctly, you can check ~/.docker/config.json or /root/.docker/config.json(depending on your system setup) which should contain the correct API key.

  • Build your docker image with the name $IMAGE_NAME and the team $TEAM, tag it correctly, and push it to nvcr:
user@PC:~$ sudo docker build -f Dockerfile --network=host -t $IMAGE_NAME .
user@PC:~$ sudo docker tag my_nice_container nvcr.io/ml2r/$TEAM/$IMAGE_NAME
user@PC:~$ sudo docker push nvcr.io/ml2r/$TEAM/$IMAGE_NAME

Note

The build process of all available Docker images is published in the custom-container-example repository. We recommend using this build process as your starting point.

Note

We understand that the path nvcr.io/ml2r/$TEAM/ looks somewhat weird as it contains ml2r and lamarr, but this cannot be changed at the moment due to historic reasons and due to the account management of Nvidia.

3) Starting your container as usual.

You should be able to start your container as usual via srun / sbatch. However, make sure that you adapt the path correctly to contain your team's name, i.e.

user@gwkilab:~$ srun --mem=96GB --export ALL -c 16 --container-name=my_nice_container -p GPU1 --gres=gpu:1 --container-image=nvcr.io/ml2r/$TEAM/$CONTAINERNAME --pty /bin/bash.sh

Note: The first loading and creation of a container including the file system can take some time, more than 10 minutes for a very large container.

Advanced storage topics

The storage is divided between two distinct units very fast local storage (/raid) and slower remotely mounted storage (cephfs). Both systems have distinct advantages and disadvantages

\ Raid CephFS
Speed Fast Slow
Latency Low High
Reliability Data can be deleted after a Job ends, or the node stops Redundant storage divided between multiple nodes
Accessibility Shared between Jobs on a single node Shared between all Jobs in the cluster environment and Gateways
Size Ranging from 14 to 28TB depending on node Several hundred TB
Use case Temporary data, checkpoints, intermediate results, training data Final results

Expert topics

Slum

Some slurm commands and options can be usefull but a not generaly used in everyday work:

Exclude / Force Nodes

With the options -w (AllowList) -x (BlockList) the job can be sheduled on specific nodes or excluded from others. The Blocklist can be used when a specific job does not start on a node because of hardware failure or a system error and the node is not automatically removed from the queue. If you experience such errors please notify the admin team regardless.

Mass Processing

To process several files or configurations at once a combination of environment variables and scripts can be used to steer the job. In non container usecases it is recommended to use sbatch for this kind of workload, sadly at the moment this does not work well with the container environment, therefore srun will be used.

The first step for creating a job is creating a directory in your cephfs system where your scripts will be stored and data will be logged. For this execute the following commands on gwkilab:

mkdir -p /cephfs/users/${USER}/scripts
mkdir -p /cephfs/users/${USER}/logs
ln -s /cephfs/users/${USER}/scripts ~/
ln -s /cephfs/users/${USER}/logs ~/

Now the execution and submit scripts can be created:

cat ~/scripts/run.sh

#!/bin/bash

#SBATCH --time=02-00:00:00
#SBATCH --partition=CPU
#SBATCH --nodes=1
#SBTACH --ntasks-per-node=1
#SBATCH --mem-per-cpu=32000mb
#SBATCH --mail-type=FAIL,TIME_LIMIT_90,TIME_LIMIT
#SBATCH --mail-user=XXX
#SBATCH -e ~/logs/sbatch_example-%j.err
#SBATCH -o ~/logs/sbatch_example-%j.out


echo "Job Started"
hostname
echo $SLURM_JOB_ID
echo $TEST1
echo $TEST2
cat ~/scripts/submit.sh

for itr in 1 2 3 4 5
do
    srun --job-name=test_job_${itr} --export='TEST1=1,TEST2=${itr}' --container-name=test_sbatch --container-workdir=$HOME  ./run.sh &

You need to add your data into booth scripts to adjust it to your workload.

Hardware

The cluster has a large number of different hardware installed. This includes the computers, the storage infrastructure, the gateways, IB and Eth switches and services machines. The following sections should give an overview for the essential components:

Computing Nodes - DGX A100

Alt text

The cluster utilizes two versions of the DGX A100 node. Nodes 1-8 use the 40GB GPU variant, Nodes 9-12 the 80GB variant. RAM and local discs are scaled accordingly.

Hardware affinity

For the highest performance of every container, it is necessary to pay attention to the internal architecture of the node. Each CPU Socket has a fixed number of memory and PCI controller linked to individual Cores. To communicate with hardware not interfaced from those Cores several synchronization barriers must be passed which significantly slows down communication. In worst case, data must be transferred via CPU-CPU link. The primary bottlenecks for the PCI Bus is the transfer of your training data from host to the GPU device memory. Most of this is taken care of by SLUM, but when executing multiple programs parallel in the same job you effectively bypass this configuration and will need to do this yourself.

CPU device affinity

nvidia-smi topo -mp
GPU CPU NUMA
GPU0 48-63,176-191 3
GPU1 48-63,176-191 3
GPU2 16-31,144-159 1
GPU3 16-31,144-159 1
GPU4 112-127,240-255 7
GPU5 112-127,240-255 7
GPU6 80-95,208-223 5
GPU7 80-95,208-223 5

Network device affinity

nvidia-smi topo -m
GPU Network Card
GPU0 mlx5_0, mlx5_1
GPU1 mlx5_0, mlx5_1
GPU2 mlx5_2, mlx5_3
GPU3 mlx5_2, mlx5_3
GPU4 mlx5_4, mlx5_5
GPU5 mlx5_4, mlx5_5
GPU6 mlx5_6, mlx5_7
GPU7 mlx5_6, mlx5_7

Storage device affinity

https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html

Storage (expert)

Hardware Description
CPU 2x EPYC Rome 7302
RAM 16x 32GB 3200MHz ECC
SSD (OS) 2x 1.92TB
SSD (Storage) 9x 15.36TB Micron 9300 Pro
Network 1x Dual 200GBe Ethernet

Networking

The storage and nodes are connected via dual 200 GB-Ethernet network over two switches. The GPUs support collision free communication with a full spanning 200GBe Infiniband tree.

Enroot

Enroot in combination with pyxis represents the container interface of Slurm. It allows containers to be automatically downloaded and executed inside of jobs. Compared to Docker the interface is more light weight, but does not support the full level of abstraction docker does.