Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033

Open
Horang1173 opened this issue Jan 1, 2025 · 9 comments

Comments

@Horang1173
Copy link

Horang1173 commented Jan 1, 2025

Context: Attempting to train a LORA for the first time. Cannot figure out what this error is...
Steps to Reproduce: Run gui.bat ... Load config ... run training...
Actual Result: Failure with the following log
Expected Result: Training to start / finish.
I also can't do multi-GPU... I get an error and the training doesn't finish.

17:14:31-800933 INFO SD v2 model selected. Setting --v2 and --v_parameterization parameters
17:14:45-206542 INFO Start training Dreambooth...
17:14:45-208074 INFO Validating lr scheduler arguments...
17:14:45-211137 INFO Validating optimizer arguments...
17:14:45-212669 INFO Validating H:/LoraTrainer/Namelessimpy/log existence and writability... SUCCESS
17:14:45-213691 INFO Validating H:/LoraTrainer/Namelessimpy/model existence and writability... SUCCESS
17:14:45-214712 INFO Validating stabilityai/stable-diffusion-2 existence... SKIPPING: huggingface.co model
17:14:45-215734 INFO Validating H:/LoraTrainer/Namelessimpy/img existence... SUCCESS
17:14:45-217266 INFO Folder 150_Impy By Impy: 150 repeats found
17:14:45-218288 INFO Folder 150_Impy By Impy: 180 images found
17:14:45-222372 INFO Folder 150_Impy By Impy: 180 * 150 = 27000 steps
17:14:45-223393 INFO Regulatization factor: 1
17:14:45-224925 INFO Total steps: 27000
17:14:45-225946 INFO Train batch size: 2
17:14:45-226968 INFO Gradient accumulation steps: 1
17:14:45-228499 INFO Epoch: 2
17:14:45-229521 INFO Max train steps: 1600
17:14:45-230543 INFO lr_warmup_steps = 0
17:14:45-244330 INFO Saving training config to H:/LoraTrainer/Namelessimpy/model\ImpyIllust0.1_20250101-171445.json...
17:14:45-246373 INFO Executing command:
C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Sc
ripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1
--num_cpu_threads_per_process 18 H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py --config_file
H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445.toml
17:14:45-265268 INFO Command executed.
[2025-01-01 17:14:50,218] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
2025-01-01 17:14:59 INFO Loading settings from H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445.toml... train_util.py:4174
INFO H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445 train_util.py:4193
WARNING v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません train_util.py:3876
2025-01-01 17:14:59 INFO prepare tokenizer train_util.py:4665
tokenizer/tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 824/824 [00:00<00:00, 1.62MB/s]
C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\huggingface_hub\file_download.py:149: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Horan.cache\huggingface\hub\models--stabilityai--stable-diffusion-2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer/vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 3.44MB/s]
tokenizer/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 25.1MB/s]
tokenizer/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 460/460 [00:00<?, ?B/s]
2025-01-01 17:15:01 INFO update token length: 75 train_util.py:4682
INFO prepare images. train_util.py:1815
INFO found directory H:\LoraTrainer\Namelessimpy\img\150_Impy By Impy contains 180 image files train_util.py:1762
INFO 27000 train images with repeating. train_util.py:1856
INFO 0 reg images. train_util.py:1859
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1864
INFO [Dataset 0] config_util.py:572
batch_size: 2
resolution: (1280, 1280)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True

                           [Subset 0 of Dataset 0]
                             image_dir: "H:\LoraTrainer\Namelessimpy\img\150_Impy By Impy"
                             image_count: 180
                             num_repeats: 150
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             caption_separator: ,
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0.0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             alpha_mask: False,
                             is_reg: False
                             class_tokens: Impy By Impy
                             caption_extension: .txt


                INFO     [Dataset 0]                                                                                              config_util.py:578
                INFO     loading image sizes.                                                                                      train_util.py:911

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 180/180 [00:00<00:00, 15329.75it/s]
INFO make buckets train_util.py:917
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is train_util.py:934
defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとm
ax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:963
INFO bucket 0: resolution (448, 832), count: 150 train_util.py:968
Comment: INFO buckets 0-76
INFO bucket 76: resolution (2304, 704), count: 150 train_util.py:968
INFO mean ar error (without repeats): 0.029467308740607177 train_util.py:973
INFO prepare accelerator train_db.py:106
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "H:\LoraTrainer\kohya_ss\sd-scripts\train_db.py", line 529, in
train(args)
File "H:\LoraTrainer\kohya_ss\sd-scripts\train_db.py", line 116, in train
accelerator = train_util.prepare_accelerator(args)
File "H:\LoraTrainer\kohya_ss\sd-scripts\library\train_util.py", line 4743, in prepare_accelerator
accelerator = Accelerator(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\accelerator.py", line 371, in init
self.state = AcceleratorState(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\state.py", line 758, in init
PartialState(cpu, **kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\state.py", line 217, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2025-01-01 17:15:07,348] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 34080) of binary: C:\Users\Horan\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts\accelerate.EXE_main
.py", line 7, in
sys.exit(main())
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\run.py", line 797, in run
elastic_launch(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-01_17:15:07
host : HOST
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34080)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

17:15:08-714423 INFO Training has ended.

@Sylsatra
Copy link

Sylsatra commented Jan 7, 2025

Hi, I have some questions with your script:

  1. Can I see your config?
  2. Can I see your accelerate config?
  3. Can you provide your system specification (GPU, CPU)?

I suspect your accelerate config is wrong because the Ip address is very weird.

@Horang1173
Copy link
Author

Hi, I have some questions with your script:

  1. Can I see your config?
    "LoRA_type": "Standard",
    "LyCORIS_preset": "full",
    "adaptive_noise_scale": 0,
    "additional_parameters": "",
    "async_upload": false,
    "block_alphas": "",
    "block_dims": "",
    "block_lr_zero_threshold": "",
    "bucket_no_upscale": true,
    "bucket_reso_steps": 64,
    "bypass_mode": false,
    "cache_latents": true,
    "cache_latents_to_disk": false,
    "caption_dropout_every_n_epochs": 0,
    "caption_dropout_rate": 0,
    "caption_extension": ".txt",
    "clip_skip": 2,
    "color_aug": false,
    "constrain": 0,
    "conv_alpha": 1,
    "conv_block_alphas": "",
    "conv_block_dims": "",
    "conv_dim": 1,
    "dataset_config": "",
    "debiased_estimation_loss": false,
    "decompose_both": false,
    "dim_from_weights": false,
    "dora_wd": false,
    "down_lr_weight": "",
    "dynamo_backend": "no",
    "dynamo_mode": "default",
    "dynamo_use_dynamic": false,
    "dynamo_use_fullgraph": false,
    "enable_bucket": true,
    "epoch": 2,
    "extra_accelerate_launch_args": "",
    "factor": -1,
    "flip_aug": false,
    "fp8_base": false,
    "full_bf16": false,
    "full_fp16": false,
    "gpu_ids": "",
    "gradient_accumulation_steps": 1,
    "gradient_checkpointing": false,
    "huber_c": 0.1,
    "huber_schedule": "snr",
    "huggingface_path_in_repo": "",
    "huggingface_repo_id": "",
    "huggingface_repo_type": "",
    "huggingface_repo_visibility": "",
    "huggingface_token": "",
    "ip_noise_gamma": 0,
    "ip_noise_gamma_random_strength": false,
    "keep_tokens": 0,
    "learning_rate": 0.0002,
    "log_tracker_config": "",
    "log_tracker_name": "",
    "log_with": "",
    "logging_dir": "H:/LoraTrainer/kohya_ssDocker/kohya_ss/dataset/logs",
    "loss_type": "l2",
    "lr_scheduler": "constant",
    "lr_scheduler_args": "",
    "lr_scheduler_num_cycles": 1,
    "lr_scheduler_power": 1,
    "lr_warmup": 0,
    "main_process_port": 0,
    "masked_loss": false,
    "max_bucket_reso": 2048,
    "max_data_loader_n_workers": 1,
    "max_grad_norm": 1,
    "max_resolution": "1024,1024",
    "max_timestep": 1000,
    "max_token_length": 75,
    "max_train_epochs": 0,
    "max_train_steps": 1600,
    "mem_eff_attn": false,
    "metadata_author": "",
    "metadata_description": "",
    "metadata_license": "",
    "metadata_tags": "",
    "metadata_title": "",
    "mid_lr_weight": "",
    "min_bucket_reso": 256,
    "min_snr_gamma": 0,
    "min_timestep": 0,
    "mixed_precision": "fp16",
    "model_list": "custom",
    "module_dropout": 0,
    "multi_gpu": false,
    "multires_noise_discount": 0,
    "multires_noise_iterations": 0,
    "network_alpha": 1,
    "network_dim": 8,
    "network_dropout": 0,
    "network_weights": "",
    "noise_offset": 0,
    "noise_offset_random_strength": false,
    "noise_offset_type": "Original",
    "num_cpu_threads_per_process": 18,
    "num_machines": 1,
    "num_processes": 1,
    "optimizer": "AdamW8bit",
    "optimizer_args": "",
    "output_dir": "H:/LoraTrainer/kohya_ssDocker/kohya_ss/dataset/outputs",
    "output_name": "ImpyIllust0.1",
    "persistent_data_loader_workers": false,
    "pretrained_model_name_or_path": "H:/Stable Diffusion Full Reinstall/stable-diffusion-webui/models/Stable-diffusion/ntrMIXIllustriousXL_xiii.safetensors",
    "prior_loss_weight": 1,
    "random_crop": false,
    "rank_dropout": 0,
    "rank_dropout_scale": false,
    "reg_data_dir": "",
    "rescaled": false,
    "resume": "",
    "resume_from_huggingface": "",
    "sample_every_n_epochs": 0,
    "sample_every_n_steps": 0,
    "sample_prompts": "",
    "sample_sampler": "euler_a",
    "save_as_bool": false,
    "save_every_n_epochs": 1,
    "save_every_n_steps": 0,
    "save_last_n_steps": 0,
    "save_last_n_steps_state": 0,
    "save_model_as": "safetensors",
    "save_precision": "fp16",
    "save_state": false,
    "save_state_on_train_end": false,
    "save_state_to_huggingface": false,
    "scale_v_pred_loss_like_noise_pred": false,
    "scale_weight_norms": 0,
    "sdxl": true,
    "sdxl_cache_text_encoder_outputs": false,
    "sdxl_no_half_vae": false,
    "seed": 1234,
    "shuffle_caption": false,
    "stop_text_encoder_training": 0,
    "text_encoder_lr": 0.0001,
    "train_batch_size": 2,
    "train_data_dir": "H:/LoraTrainer/kohya_ssDocker/kohya_ss/dataset/img",
    "train_norm": false,
    "train_on_input": true,
    "training_comment": "",
    "unet_lr": 0.0001,
    "unit": 1,
    "up_lr_weight": "",
    "use_cp": false,
    "use_scalar": false,
    "use_tucker": false,
    "v2": false,
    "v_parameterization": false,
    "v_pred_like_loss": 0,
    "vae": "",
    "vae_batch_size": 0,
    "wandb_api_key": "",
    "wandb_run_name": "",
    "weighted_captions": false,
    "xformers": "xformers"
    }
  2. Can I see your accelerate config?

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

  1. Can you provide your system specification (GPU, CPU)?

CPU: AMD Ryzen 9 5900x 12-Core Processor (24 CPUs) ~3.7GHZ
RAM: G.Skill Trident Z RGB 128 GB (4 x 32 GB) DDR4-4000 CL18 Memory
C: Samsung 970 Evo Plus 2 TB M.2-2280 PCIe 3.0 X4 NVME Solid State Drive
D: Sabrent Rocket 4 Plus 8 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive
GPU1: MSI RTX 3090 SUPRIM X 24G GeForce RTX 3090 24 GB Video Card
GPU2: Asus ROG STRIX GAMING OC GeForce RTX 4090 24 GB Video Card
OS: Microsoft Windows 10 Home OEM

I suspect your accelerate config is wrong because the Ip address is very weird.
I hope it's that simple XD... I configured accelerate before running the program

@Horang1173
Copy link
Author

Horang1173 commented Jan 7, 2025

Hi, I have some questions with your script:

  1. Can I see your config?
  2. Can I see your accelerate config?
  3. Can you provide your system specification (GPU, CPU)?

I suspect your accelerate config is wrong because the Ip address is very weird.

Update! I attempted a single gpu config and, aside from an out of memory error, it started training! It seems something is wrong with attempting 2 GPUs.
Update 1 (U1): For single gpu (4090), this is the error I get.

Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Of the allocated memory 22.40 GiB is allocated by PyTorch, and 634.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

U2: Did some config changes, and for some reason it -refuses- to use the 4090 on slot 0, and will only train using the 3090 on slot 1... Strange :/
U3: Changed the config to use GPU 0,1,2 and suddenly it registers the 4090... I have no idea what is happening, but it's working! Still unable to train dual GPU for some reason.

@Horang1173 Horang1173 changed the title Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) Jan 7, 2025
@Sylsatra
Copy link

Sylsatra commented Jan 7, 2025

Hi, I have some questions with your script:

  1. Can I see your config?
  2. Can I see your accelerate config?
  3. Can you provide your system specification (GPU, CPU)?

I suspect your accelerate config is wrong because the Ip address is very weird.

Update! I attempted a single gpu config and, aside from an out of memory error, it started training! It seems something is wrong with attempting 2 GPUs. For single gpu (4090), this is the error I get. Did some config changes, and for some reason it -refuses- to use the 4090 on slot 0, and will only train using the 3090 on slot 1... Strange :/

Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Of the allocated memory 22.40 GiB is allocated by PyTorch, and 634.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Great news!
By any chance, your 4090 is on another port? Because the error is shouting all about it.
For the OOM i can help you if you provide your config.

@Horang1173
Copy link
Author

Horang1173 commented Jan 7, 2025

Great news! By any chance, your 4090 is on another port? Because the error is shouting all about it. For the OOM i can help you if you provide your config.

preset124324151.json

Both GPUs are installed into the same motherboard on different PCIe ports. I managed to wiggle out the OOM figuring it was due to being overzealous with my training speed lol. Config is above. Of note, with this config, I'm only using 16/24GB of my 4090. I'd like to optimize it for 22-23GB, so I suppose that is my new task for experimentation.

Update: On the OOM, it seems a 2 batch was too much, however it will only utilize one card at a time since multi-gpu seems to not want to function. I suppose that is the cause of the original error. On attempting to activate multi-gpu support, I get the same error as originally posted.

@Sylsatra
Copy link

Sylsatra commented Jan 8, 2025

The setting for multiple GPUs is quite straightforward, for 2 GPU just copy this (you have to find the main process port by yourself)
Screenshot 2025-01-08 072411
Edit: bf16 not fp16

@Horang1173
Copy link
Author

Horang1173 commented Jan 8, 2025

How do I go about finding my main process port? It defaulted to 0 in the config. Note: I have no idea what this is lol...
attempted 7860 and 7861 to no avail... It seems no value will enable a connection. I have the exact settings from above outside of the main process port and it spits out the same error.

@Sylsatra
Copy link

Sylsatra commented Jan 8, 2025

How do I go about finding my main process port? It defaulted to 0 in the config. Note: I have no idea what this is lol... attempted 7860 and 7861 to no avail... It seems no value will enable a connection. I have the exact settings from above outside of the main process port and it spits out the same error.

if your 4090 is on your pc then you dont need it, just add "--same_network" into Extra Accelerate Launch arguments.
If your 4090 is NOT on your pc then you need to change "Number of machines" to "2" and in "Main process port" you write the ip address of machine rank 0 (the ip address).

@Horang1173
Copy link
Author

if your 4090 is on your pc then you dont need it, just add "--same_network" into Extra Accelerate Launch argument

Still running into the same above error with --same_network added to the argument line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants