1. Prerequisites

This is a "quick start" introduction into using the BA-HPC cluster at the Bibliotheca Alexandrina. This covers the general activities most users will deal with when using the cluster.

In order to properly follow this quick start guide, you should have

  1. an account on BA-HPC cluster
  2. an account at our support system
  3. knowledge on how to use ssh
  4. a basic familiarity with Unix

If do not have an account, you could follow the above links before proceeding with this quick start.

2. Logging into the login node

The cluster have a node available for users to log into. From this node you can submit and monitor your jobs, look at results of the jobs, etc.


  • It's not preferred to run software commands on login node, it may result with memory fault or segmentation errors. The best practice is to run commands via Slurm, to distribute the job on the compute nodes, whether CPU or GPU.
  • DO NOT RUN computationally intensive processes on the login node. Maximum runtime of any process on login node is 30 minutes.
  • On login node, maximum simultaneous processes for each user is 100.
  • Note that your home directory $HOME is limited to 100 MB. Use data directory (linked at $HOME/data) for large data.

For most tasks you will wish to accomplish, you will start by logging into the login node. In order to do that, you need to use the Secure Shell protocol (SSH). This is standardly installed as ssh on Unix systems, and clients are available for Windows and Mac. If you are using a non-Unix system like Windows, you must install an SSH client.
On Unix-like systems, you can login by executing ssh -i path/to/private/key username@login01.c2.hpc.bibalex.org or
ssh -i path/to/private/key username@login02.c2.hpc.bibalex.org in the terminal. If you're using BitVise client, you need to set the host to login01.c2.hpc.bibalex.org or login02.c2.hpc.bibalex.org, Initial method to public key and port to 22.



- Note that you'll need to use your real username instead of username

3. Setting up your environment modules

The software environment used on BA-HPC cluster can be managed via modules. Modules facilitate the task of updating applications and provide a user-controllable mechanism for accessing software revisions and controlling combination of versions. For the job to executed, you must load any required modules before submitting your job.

Common commands to work with modules:

                        
module avail                    # lists available modules
module list                     # lists current loaded modules
module help module-name         # help on specific module
module whatis module-name       # brief description on a specific module
module display module-name      # display changes by a given module
module load module-name         # load a specific module
module unload module-name       # unloads a specific module
module clear                    # unloads all loaded modules
                        
                    
  • Do not module load multiple versions of the same module at the same time (including same version for different compilers). The module command will report a conflict if you attempt to do so.
  • If you work with liscenced software. kindly note users are kindly responsible for providing their own licenses for software not in the public domain. It's recommended to install the liscenced software, which is linux version that supports parallelism, under you data directory

4. Submitting parallel jobs

To handle the queuing, scheduling, and execution of jobs the BA-HPC cluster use a batch scheduling system called Slurm (Simple Linux Utility for Resource Management). Normally, you will submit jobs by writing a job script file and submitting the job to Slurm with the sbatch command.

The sbatch command takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster. The options to sbatch can be given on the command line, or in most cases inside the job script. When given inside the job script, the option is placed alone on a line starting with #SBATCH (you must include a space after the sbatch).


Kindly note that these #SBATCH lines SHOULD come before any non-comment/non-blank line in the script.

Choosing a Queue

On the BA-HPC cluster, you only specify a partition when you want to run your job on a GPU enabled nodes. To request GPUs for your job on the HPC, you need to add the #SBATCH --partition=gpu and #SBATCH --gres=gpu:N options to your job script file, where N specifies the number of GPUs that you are requesting. Kindly note that we have at most 2 GPUs per node and the total number of GPU enabled nodes is 16.

Currently, we do not directly charge usage of GPUs. GPU based jobs usage will charged for the CPUs they consume on the GPU enabled node.Every GPU enabled node have 2 GPUs and 16 CPU cores. Since all jobs run in exclusive mode, consuming 1 GPU resource will also consume 8 CPU cores.

Choosing Slurm Account

You have by default slurm account for cpu and gpu, cpu is recognized by default. For GPU slurm account, it may be necessary to edit the job script with your project name to set a non-default Slurm bank account in the header block, e.g.: #SBATCH --account=g.alex044


Setting Job Time

You will need to add the #SBATCH --time=00:15:00 option to your job script file to set a limit on the total run time of the job allocation. The time for the job depends on your estimation for how long it may take to finish.

Skipping the time parameter in your job script may leave your job in a PENDING state with reason AssocGrpCPUMinutesLimit.
[username@login01 ~]$ squeue
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 84353       cpu  jobname username PD       0:00      1 (AssocGrpCPUMinutesLimit)


Besides setting time limit, it’s helpful to make sure you have enough time quota for running your job as you may encounter again the reason AssocGrpCPUMinutesLimit before you start your job or after the changing state of your job from RUNNING to PENDING.

To check your CPU hours usage: [username@login01 ~]$ cpumins To check your GPU hours usage: [username@login01 ~]$ gpumins

5. Creating and submitting MPI job

The Message Passing Interface (MPI) is a standardized and portable system for communication between the various tasks of parallized jobs in HPC environments. A number of different implementations of MPI libraries are available at our cluster. Although the MPI interface itself is somewhat standardized, the different versions are not binary compatible. It is important that you match the MPI implementation you use and with which your code was compiled. The recommended MPI library on BA-HPC cluster is Intel MPI libraries.

Let's start by compiling a sample MPI program written in C. The program initialize a defined number of processes that print the 'Hello World' line to a file along with process rank. The source code for this program can be found at this github gist. To start using MPI environment, load the intel impi module using: [username@login01 ~]$ module load impi Let's use our newly loaded module to compile our C program, using MPI C compiler wrapper mpicc. Execute [username@login01 ~]$ mpicc hello-mpi.c -o hello-mpi.bin Now that we got our binary file, let's create a job script to submit it to OGS.

Here's an example of a simple script that will specify the necessary job parameters, we'll call it hello-mpi.sh:

                    
#! /bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=1
#SBATCH --time=00:15:00

mpirun -np 24 ./hello-mpi.bin
                    
                
  1. #SBATCH --job-name=mpi_job specify the job name.
  2. #SBATCH --ntasks=24 restart the job in the case the system has a crash or is rebooted.
  3. #SBATCH --cpus-per-task=1 specifies the number of cores per task, we will need only one core per process.
  4. mpirun -np 24 ./hello-mpi.bin run the MPI executable and specifies number of processes.

Now that you have a job script, you need to submit the job to the cluster with the sbatch command. Make sure the Intel MPI libraries are loaded first then use 'sbatch' to submit the job to the scheduler



                        
[username@login01 ~]$ module list
Currently Loaded Modulefiles:
  1) GCCcore/5.4.0
  2) binutils/2.26-GCCcore-5.4.0
  3) icc/2016.3.210-GCC-5.4.0-2.26
  4) ifort/2016.3.210-GCC-5.4.0-2.26
  5) iccifort/2016.3.210-GCC-5.4.0-2.26
  6) impi/5.1.3.181-iccifort-2016.3.210-GCC-5.4.0-2.26

[username@login01 ~]$ sbatch hello-mpi.sh
Submitted batch job 156
                        
                        
The number that is returned to you is your job identifier, and you should this ID anytime you want to find out more information about your job, and you should always include this ID when are opening a support ticket about a job.

At this point, your job has been placed in the queue, and will wait its turn for resources to be available. Depending on how heavily used the cluster is at that time, and how many resources you are requesting, your job might start within minutes or it might wait for hours.

Once resources become available, our scheduler will assign resources to your job, including one or more nodes.

The standard output and standard error streams will be directed to a file, by default slurm-.output in the directory where you started the job, where the job-id is the job number as described above.

Output from your job can be viewed in the above specified file shortly after it starts running (assuming it has output something). This can be used to check the status of your job, although it is recommended make your code generates a lot of output to redirect it to another file.

For our trivial example from the last section, when the job completes we should see something like

                                
[username@login01]$ cat slurm-156.output
Hello world: rank 12 of 24 running on comp085.local
Hello world: rank 1 of 24 running on comp085.local
Hello world: rank 2 of 24 running on comp085.local
Hello world: rank 4 of 24 running on comp085.local
Hello world: rank 7 of 24 running on comp085.local
Hello world: rank 8 of 24 running on comp085.local
Hello world: rank 9 of 24 running on comp085.local
Hello world: rank 14 of 24 running on comp085.local
Hello world: rank 15 of 24 running on comp085.local
Hello world: rank 16 of 24 running on comp085.local
Hello world: rank 17 of 24 running on comp085.local
Hello world: rank 18 of 24 running on comp085.local
Hello world: rank 20 of 24 running on comp085.local
Hello world: rank 21 of 24 running on comp085.local
Hello world: rank 0 of 24 running on comp085.local
Hello world: rank 3 of 24 running on comp085.local
Hello world: rank 5 of 24 running on comp085.local
Hello world: rank 6 of 24 running on comp085.local
Hello world: rank 10 of 24 running on comp085.local
Hello world: rank 11 of 24 running on comp085.local
Hello world: rank 13 of 24 running on comp085.local
Hello world: rank 19 of 24 running on comp085.local
Hello world: rank 22 of 24 running on comp085.local
Hello world: rank 23 of 24 running on comp085.local
                            
                        

As you can see in the output files above, the MPI program executed and each process was assigned a unique rank, which was printed off along with the hostname.

6. Creating and submitting CUDA job

CUDA is a parallel computing platform and API model created by Nvidia. It allows you to use a CUDA-enabled GPU for general purpose processing – an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

Again let's start by compiling a sample CUDA program written in C. The program use GPU to add two 2 vectors of integers in parallel. It starts by generating 2 vectors of size n then pass these vectors to the GPU memory, after that we make each core simply sums a single element from each of the two input vectors and writes the result into the output vector. Finally, we print only m number of elements into the output file. The source code for this program can be found at this github gist . To start using the CUDA library, let's load Intel C compiler and CUDA [username@login01 ~]$ module load icc CUDA Then let's use the loaded Nvidia CUDA Compiler (NVCC) to compile our source code [username@login01 ~]$ nvcc vector-add.cu -o vector-add.bin

An example of a simple script that will specify the necessary job parameters for a GPU based job, we'll call it cuda-vec_add.sh:

                    
#!/bin/bash
#SBATCH --job-name=first-cuda-job
#SBATCH --account=g.projectname
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:15:00

./vector-add.bin 100000 10
                    
                

There's some additional options we've put in our job script:

  1. #SBATCH --account=g.projectname Specify your GPU slurm account.
  2. #SBATCH --partition=gpu Submit the job to the gpu partition.
  3. #SBATCH --gres=gpu:1 Specify the number of GPUs. In this example we only need one GPU card.
  4. #SBATCH --nodes=1 Specify the number of required nodes. In this example we only need one node.
  5. #SBATCH --ntasks=1 Specify the number of CPU cores/process to be used. In this example we only need one process to initiate our program execution on the GPU. Maximum number of CPU cores/process to be used is 16.
  6. ./vector-add.bin 100000 10 Generate two vectors of length 100000, and print only the first 10 elements in the output file.

It is possible to run MPI programs that use GPUs but only within a single node, maximum of 2 GPUs and 16 CPU cores.

Now that you have a job script, let's submit the job to the cluster. Make sure the CUDA library are loaded first then use 'sbatch' to submit the job to the scheduler

                    
[username@login01 ~]$ module list
Currently Loaded Modulefiles:
1) CUDA/8.0.44

[username@login01 ~]$ sbatch cuda-vec_add.sh
Submitted batch job 157
                    
                

After the job completes, we should see something like

                        
[username@login01]$ cat slurm-157.out
h_x = 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

h_y = 100000.0 99999.0 99998.0 99997.0 99996.0 99995.0 99994.0 99993.0 99992.0 99991.0

The sum is:
100001.0 100001.0 100001.0 100001.0 100001.0 100001.0 100001.0 100001.0 100001.0 100001.0

                    
                

7. Creating and submitting Quantum Espresso job

Currently, we have q-e-qe-6.3 version. You can find a job script of sample input can be found at this github gist.

To be able to run QE, let's load Intel compiler 2018 [username@login01 ~]$ module load intel/2018b Run it in parallel via Slurm: [username@login01 ~]$ sbatch qe-sample.sh After the job completes you can check output from linux terminal. We should see something like

8. Setting up your Python environment

Working with Python pip

As mentioned previously, home quota, which is around 100M, maybe consumed quickly with large cache files. To be able to work with pip packages without exceeding your home quota , you can execute the commands you need from the following under your home to move cache files to lustre quota, which is your data directory.

Here, you will move bin and lib directories to data directory and create symbolic links under .local which refrence the actual moved files.

[username@login01 ~]$ cd .local [username@login01 ~]$ mv bin lib ../data/ [username@login01 ~]$ ln -s ../data/bin . [username@login01 ~]$ ln -s ../data/lib .

Here, you will move pip under .cache directory to your data directory and create a symbolic link under .cache which refrences the actual moved pip directory named pip_cache.

[username@login01 ~]$ cd [username@login01 ~]$ cd .cache [username@login01 ~]$ mv pip ../data/pip_cache [username@login01 ~]$ ln -s ../data/pip_cache pip

Working with Anaconda

If your work needs Anaconda packages, you can excute the following in your home directory

[username@login01 ~]$ module load Anaconda3 [username@login01 ~]$ source activate /share/apps/conda_envs/ba-hpc [username@login01 ~]$ conda list Conda list will show you the available libraries we have, that most researchers use on the BA-HPC

If the anaconda library you need does not exist, you can mail us at supercomputer@bibalex.org.

Working with TensorFlow

Since Tensorflow makes huge contribtuion in many of machine learning projects. You can use TensorFlow working environment. [username@login01 ~]$ module load Anaconda3 [username@login01 ~]$ source activate /share/apps/conda_envs/tensorflow

9. Creating and submitting Python job

Pi approximation

This example is about serial and parallel Python implementations of the Leibniz formula for approximating the value of pi. Besides, it demonstrates the time differences between serial execution and parallel execution. The source code for this program and the submission scripts can be found at BA-HPC code repository.

You would need to load mpi4py module before submitting your script [username@login01 ~]$ module load mpi4py

"Hello, Tensorflow"

We have a code repository which is an accumulator for scripts intended as executable examples for users to carry out tasks related to the TensorFlow machine learning framework on a High-Performance Computing (HPC) system.

Listing GPU devices

The source code for this program and the submission script can be found at BA-HPC code repository.

Dynamic Recurrent Neural Network

A simple Tensorflow implementation of a Recurrent Neural Network (LSTM) that performs dynamic computation over sequences with variable length on a toy dataset. The source code for this program can be found at this link. The scripts to fetch and run this program can be found at BA-HPC code repository.

Convolutional Neural Network

A tensorflow tutorial demonstrates training a simple Convolutional Neural Network (CNN) to classify CIFAR images, which is a frequently used benchmark for image classification tasks. The source code for this program can be found at this link The scripts to fetch and run this program can be found at BA-HPC code repository.

10. Monitoring job status

The basic command for monitoring your jobs' status is the squeue command. Because normally you are only interested in your jobs, it is advisable to add the -u username flags, to speed up the command and only show your jobs. Replace username with your username.

To check elapsed time after the job finishes in the same session:

[username@login01 ~]$ sacct -Xo jobid,jobname,elapsed

To generally check all your jobs' status in any session:

[username@login01 ~]$ sacct -S 2019-01-01 -u username

11. Monitoring Lustre quota

You could use the following command to check your account storage:

[username@login01 ~]$ lfs quota -hg projectname /lfs01

We should see something like

               [username@login01 ~]$ lfs quota -hg projectname /lfs01
               Disk quotas for grp alex036 (gid 1034):
                    Filesystem    used   quota   limit   grace   files   quota   limit   grace
                         /lfs01  2.785G     10G   10.1G       -     155  100000  105000       -
               

Here, the output shows the used storage under your data directory and also shows number of files which is 155 in this example.

It's advisable to make sure that your software doesn't generate many files that may exceed the maximum limit of files for each user.