Distributed computing on multiple hosts #996
-
Good day! Is there any possible value for KERASTUNER_ORACLE_IP other than "127.0.0.1"? I tried to set the external ip of the host for tuners. But this does not work even if the chief and the tuners are on the same host. DetailsHere is my bash script to run such an experiment: #!/bin/bash
# Runs a file in the current directory tune_model.py
# in parallel computing mode.
# Where we write the logs:
log_dir='../../reports/experiments/KT'
mkdir -p "$log_dir"
# General environment parameters:
export CUDA_DEVICE_ORDER='PCI_BUS_ID'
export KERASTUNER_ORACLE_PORT='8000'
# If no argument has been specified,
# then the control process is deployed on this host:
if [ $# -eq 0 ]; then
echo Starting the control process...
# General environment parameters:
export CUDA_VISIBLE_DEVICES="
export KERASTUNER_ORACLE_IP="127.0.0.1"
export KERASTUNER_TUNER_ID='chief'
nohup python tune_model.py >> "$log_dir"/tune_chief.log &
# If the argument is given, then this is the name/ip of the host with the controlling process:
else
echo Using the control process on "$1"...
export KERASTUNER_ORACLE_IP="$1"
fi
# Determining the number of GPUs:
#num_gpus=`lshw -C display 2>/dev/null | grep "vendor: NVIDIA Corporation" -c`
num_gpus=`nvidia-smi -L | wc -l`
# We run 1 worker for each GPU:
let "max_gpu_ind=$num_gpus-1"
for gpu_ind in $(seq 0 $max_gpu_ind); do
export CUDA_VISIBLE_DEVICES=$gpu_ind
export KERASTUNER_TUNER_ID=`hostname' _tuner$gpu_ind
echo Starting training on GPU $gpu_ind:
nohup python tune_model.py >> "$log_dir"/`hostname`_tuner$gpu_ind.log &
done
# After all launches, we delete the changed environment variables:
unset CUDA_DEVICE_ORDER
unset CUDA_VISIBLE_DEVICES
unset KERASTUNER_ORACLE_IP
unset KERASTUNER_ORACLE_PORT
unset KERASTUNER_TUNER_ID For example, I have two hosts with the following IP addresses: 192.168.0.10 and 192.168.0.20.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I used the above script under Docker and it turned out that everything works if I run the container with the --network="host" parameter! |
Beta Was this translation helpful? Give feedback.
I used the above script under Docker and it turned out that everything works if I run the container with the --network="host" parameter!