This script automates the submission of multi-node jobs using Slurm.
master.sh generates individual original.sh for each node and submits them to the Slurm scheduler.
- Execute master.sh to create a shell script for each node and submit them to Slurm.
- Create your own script (e.g., [original.sh]) and set the
DIR
variable to "ORIGINAL_SH" in [master.sh]. - Include your Distributed Data Parallel (DDP) module, such as "torchrun" or "torch.distributed.launch," in [original.sh].
- Input parameters into [original.sh] using $1, $2, and so on, maintaining the sequence.
- Customize the DDP parameters in [master.sh]. In this example, "torch.distributed.launch" parameters like
MASTER_NODE
,MASTER_PORT
, etc., are parameterized. - Adapt the script according to your preferences.
Feel free to modify the design for better readability and clarity.
If master.sh submit your scripts in your own N nodes.
Each node's DDP module will handshakes each other ( like checking communication )
master.sh go to the $(DIR) directory and create OUT and SH folders. In the OUT folder, create $JOB_NAME.out and $JOB_NAME.err files. In the SH folder, copy $ORIGINAL_SH for each node.
JOB_NAME=My_multinode_job # Slurm Job name you want
ORIGINAL_SH=./original.sh # Shell DIR you want to submit
DIR=./result/ # Set DIR of slurm .out, .err
List the Node names mapped to the IP address of the server you are using.
Set one of the nodes as MASTER_NODE and specify the number of GPUs to be used.
In the example below, i use 4 nodes with 4 GPUs each, totaling 16 GPUs.
MASTER_NODE="cherry-y2"
NODE_LIST["cherry-y1"]=; orders+=( "cherry-y1" )
NODE_LIST["cherry-y2"]=4; orders+=( "cherry-y2" )
NODE_LIST["cherry-y3"]=4; orders+=( "cherry-y3" )
NODE_LIST["cherry-y4"]=4; orders+=( "cherry-y4" )
NODE_LIST["cherry-y5"]=4; orders+=( "cherry-y5" )
NODE_LIST["cherry-y6"]=; orders+=( "cherry-y6" )
NODE_LIST["cherry-y7"]=; orders+=( "cherry-y7" )
You can rearrange the order of the arguments passed to original.sh here.
sbatch -w $node \
-J $JOB_NAME \
--gres=gpu:$GPU_PER_NODE \
--out $SLURM_OUT_DIR/%j_master_${JOB_NAME}.out \
--error $SLURM_OUT_DIR/%j_master.err \
$EACH_NODE_SH/$node.sh \
${MASTER_NODE} ${NODE_RANK} ${MASTER_PORT} ${SLURM_OUT_DIR} ${NODES} ${GPU_PER_NODE}
#!/bin/bash
#--------------------------------------------
#Your Slurm fixed parameter that don't have to control master.sh
#ex)--time, worker, cpu, partition...
#--------------------------------------------
export NCCL_SOCKET_IFNAME="YOUR SOCKET NAME" #Set your ifconfig surely