-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033
Comments
Hi, I have some questions with your script:
I suspect your accelerate config is wrong because the Ip address is very weird. |
command_file: null
CPU: AMD Ryzen 9 5900x 12-Core Processor (24 CPUs) ~3.7GHZ
|
Update! I attempted a single gpu config and, aside from an out of memory error, it started training! It seems something is wrong with attempting 2 GPUs.
U2: Did some config changes, and for some reason it -refuses- to use the 4090 on slot 0, and will only train using the 3090 on slot 1... Strange :/ |
Great news! |
Both GPUs are installed into the same motherboard on different PCIe ports. I managed to wiggle out the OOM figuring it was due to being overzealous with my training speed lol. Config is above. Of note, with this config, I'm only using 16/24GB of my 4090. I'd like to optimize it for 22-23GB, so I suppose that is my new task for experimentation. Update: On the OOM, it seems a 2 batch was too much, however it will only utilize one card at a time since multi-gpu seems to not want to function. I suppose that is the cause of the original error. On attempting to activate multi-gpu support, I get the same error as originally posted. |
How do I go about finding my main process port? It defaulted to 0 in the config. Note: I have no idea what this is lol... |
if your 4090 is on your pc then you dont need it, just add "--same_network" into Extra Accelerate Launch arguments. |
Still running into the same above error with --same_network added to the argument line. |
Context: Attempting to train a LORA for the first time. Cannot figure out what this error is...
Steps to Reproduce: Run gui.bat ... Load config ... run training...
Actual Result: Failure with the following log
Expected Result: Training to start / finish.
I also can't do multi-GPU... I get an error and the training doesn't finish.
17:14:31-800933 INFO SD v2 model selected. Setting --v2 and --v_parameterization parameters
17:14:45-206542 INFO Start training Dreambooth...
17:14:45-208074 INFO Validating lr scheduler arguments...
17:14:45-211137 INFO Validating optimizer arguments...
17:14:45-212669 INFO Validating H:/LoraTrainer/Namelessimpy/log existence and writability... SUCCESS
17:14:45-213691 INFO Validating H:/LoraTrainer/Namelessimpy/model existence and writability... SUCCESS
17:14:45-214712 INFO Validating stabilityai/stable-diffusion-2 existence... SKIPPING: huggingface.co model
17:14:45-215734 INFO Validating H:/LoraTrainer/Namelessimpy/img existence... SUCCESS
17:14:45-217266 INFO Folder 150_Impy By Impy: 150 repeats found
17:14:45-218288 INFO Folder 150_Impy By Impy: 180 images found
17:14:45-222372 INFO Folder 150_Impy By Impy: 180 * 150 = 27000 steps
17:14:45-223393 INFO Regulatization factor: 1
17:14:45-224925 INFO Total steps: 27000
17:14:45-225946 INFO Train batch size: 2
17:14:45-226968 INFO Gradient accumulation steps: 1
17:14:45-228499 INFO Epoch: 2
17:14:45-229521 INFO Max train steps: 1600
17:14:45-230543 INFO lr_warmup_steps = 0
17:14:45-244330 INFO Saving training config to H:/LoraTrainer/Namelessimpy/model\ImpyIllust0.1_20250101-171445.json...
17:14:45-246373 INFO Executing command:
C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Sc
ripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1
--num_cpu_threads_per_process 18 H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py --config_file
H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445.toml
17:14:45-265268 INFO Command executed.
[2025-01-01 17:14:50,218] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
2025-01-01 17:14:59 INFO Loading settings from H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445.toml... train_util.py:4174
INFO H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445 train_util.py:4193
WARNING v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません train_util.py:3876
2025-01-01 17:14:59 INFO prepare tokenizer train_util.py:4665
tokenizer/tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 824/824 [00:00<00:00, 1.62MB/s]
C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\huggingface_hub\file_download.py:149: UserWarning:
huggingface_hub
cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Horan.cache\huggingface\hub\models--stabilityai--stable-diffusion-2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting theHF_HUB_DISABLE_SYMLINKS_WARNING
environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
tokenizer/vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 3.44MB/s]
tokenizer/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 25.1MB/s]
tokenizer/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 460/460 [00:00<?, ?B/s]
2025-01-01 17:15:01 INFO update token length: 75 train_util.py:4682
INFO prepare images. train_util.py:1815
INFO found directory H:\LoraTrainer\Namelessimpy\img\150_Impy By Impy contains 180 image files train_util.py:1762
INFO 27000 train images with repeating. train_util.py:1856
INFO 0 reg images. train_util.py:1859
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1864
INFO [Dataset 0] config_util.py:572
batch_size: 2
resolution: (1280, 1280)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 180/180 [00:00<00:00, 15329.75it/s]
INFO make buckets train_util.py:917
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is train_util.py:934
defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとm
ax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:963
INFO bucket 0: resolution (448, 832), count: 150 train_util.py:968
Comment: INFO buckets 0-76
INFO bucket 76: resolution (2304, 704), count: 150 train_util.py:968
INFO mean ar error (without repeats): 0.029467308740607177 train_util.py:973
INFO prepare accelerator train_db.py:106
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "H:\LoraTrainer\kohya_ss\sd-scripts\train_db.py", line 529, in
train(args)
File "H:\LoraTrainer\kohya_ss\sd-scripts\train_db.py", line 116, in train
accelerator = train_util.prepare_accelerator(args)
File "H:\LoraTrainer\kohya_ss\sd-scripts\library\train_util.py", line 4743, in prepare_accelerator
accelerator = Accelerator(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\accelerator.py", line 371, in init
self.state = AcceleratorState(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\state.py", line 758, in init
PartialState(cpu, **kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\state.py", line 217, in init
torch.distributed.init_process_group(backend=self.backend, **kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2025-01-01 17:15:07,348] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 34080) of binary: C:\Users\Horan\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts\accelerate.EXE_main.py", line 7, in
sys.exit(main())
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\run.py", line 797, in run
elastic_launch(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-01-01_17:15:07
host : HOST
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34080)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
17:15:08-714423 INFO Training has ended.
The text was updated successfully, but these errors were encountered: