Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033

Horang1173 · 2025-01-01T23:19:42Z

Context: Attempting to train a LORA for the first time. Cannot figure out what this error is...
Steps to Reproduce: Run gui.bat ... Load config ... run training...
Actual Result: Failure with the following log
Expected Result: Training to start / finish.
I also can't do multi-GPU... I get an error and the training doesn't finish.

17:14:31-800933 INFO 17:14:45-206542 INFO 17:14:45-208074 INFO 17:14:45-211137 INFO 17:14:45-212669 INFO 17:14:45-213691 INFO 17:14:45-214712 INFO 17:14:45-215734 INFO 17:14:45-217266 INFO 17:14:45-218288 INFO 17:14:45-222372 INFO 17:14:45-223393 INFO 17:14:45-224925 INFO 17:14:45-225946 INFO 17:14:45-226968 INFO 17:14:45-228499 INFO Epoch: 2
17:14:45-229521 INFO 17:14:45-230543 INFO 17:14:45-244330 INFO 17:14:45-246373 INFO C:\Users\Horan\AppDa ripts\accelerate.EXE --num_cpu_threads_per_process H:/LoraTrainer/Namel 17:14:45-265268 INFO [2025-01-01 Using RTX 3090 [W socket.cpp:663] 2025-01-01 17:14:59 INFO INFO H:/LoraTrai WARNING v2 2025-01-01 17:14:59 INFO tokenizer/tokenizer_config.json: C:\Users\Horan\AppDa To support warnings.warn(message)
tokenizer/vocab.json: tokenizer/merges.txt: tokenizer/special_to 2025-01-01 17:15:01 INFO INFO prepare images. INFO found INFO 27000 INFO 0 reg images. WARNING no INFO [Dataset 0] batch_size: 2
resolution: (1280, 1280)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True SD v2 model selected. Setting --v2 and --v_parameterization parameters
Start training Dreambooth...
Validating lr scheduler arguments...
Validating optimizer arguments...
Validating H:/LoraTrainer/Namelessimpy/log existence and writability... SUCCESS
Validating H:/LoraTrainer/Namelessimpy/model existence and writability... SUCCESS
Validating stabilityai/stable-diffusion-2 existence... SKIPPING: huggingface.co model
Validating H:/LoraTrainer/Namelessimpy/img existence... SUCCESS
Folder 150_Impy By Impy: 150 repeats found
Folder 150_Impy By Impy: 180 images found
Folder 150_Impy By Impy: 180 * 150 = 27000 steps
Regulatization factor: 1
Total steps: 27000
Train batch size: 2
Gradient accumulation steps: 1
Max train steps: 1600
lr_warmup_steps = 0
Saving training config to H:/LoraTrainer/Namelessimpy/model\ImpyIllust0.1_20250101-171445.json...
Executing command:
ta\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Sc
launch --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1
18 H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py --config_file
essimpy/model/config_dreambooth-20250101-171445.toml
Command executed.
17:14:50,218] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
Loading settings from H:/LoraTrainer/Namelessimpy/model/config_dreambooth-20250101-171445.toml... train_util.py:4174
ner/Namelessimpy/model/config_dreambooth-20250101-171445 train_util.py:4193
with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません train_util.py:3876
prepare tokenizer train_util.py:4665
100%|████████████████████████████████████████████████████████████████████████████████| 824/824 [00:00<00:00, 1.62MB/s]
ta\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\huggingface_hub\file_download.py:149: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Horan.cache\huggingface\hub\models--stabilityai--stable-diffusion-2. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
100%|███████████████████████████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 3.44MB/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 25.1MB/s]
kens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 460/460 [00:00<?, ?B/s]
update token length: 75 train_util.py:4682
train_util.py:1815
directory H:\LoraTrainer\Namelessimpy\img\150_Impy By Impy contains 180 image files train_util.py:1762
train images with repeating. train_util.py:1856
train_util.py:1859
regularization images / 正則化画像が見つかりませんでした train_util.py:1864
config_util.py:572

                           [Subset 0 of Dataset 0]
                             image_dir: "H:\LoraTrainer\Namelessimpy\img\150_Impy By Impy"
                             image_count: 180
                             num_repeats: 150
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             caption_separator: ,
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0.0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             alpha_mask: False,
                             is_reg: False
                             class_tokens: Impy By Impy
                             caption_extension: .txt


                INFO     [Dataset 0]                                                                                              config_util.py:578
                INFO     loading image sizes.                                                                                      train_util.py:911

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 180/180 [00:00<00:00, 15329.75it/s]
INFO make buckets train_util.py:917
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is train_util.py:934
defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとm
ax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） train_util.py:963
INFO bucket 0: resolution (448, 832), count: 150 train_util.py:968
Comment: INFO buckets 0-76
INFO bucket 76: resolution (2304, 704), count: 150 train_util.py:968
INFO mean ar error (without repeats): 0.029467308740607177 train_util.py:973
INFO prepare accelerator train_db.py:106
[W socket.cpp:663] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "H:\LoraTrainer\kohya_ss\sd-scripts\train_db.py", line 529, in
train(args)
File "H:\LoraTrainer\kohya_ss\sd-scripts\train_db.py", line 116, in train
accelerator = train_util.prepare_accelerator(args)
File "H:\LoraTrainer\kohya_ss\sd-scripts\library\train_util.py", line 4743, in prepare_accelerator
accelerator = Accelerator(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\accelerator.py", line 371, in init
self.state = AcceleratorState(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\state.py", line 758, in init
PartialState(cpu, kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\state.py", line 217, in init
torch.distributed.init_process_group(backend=self.backend, kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
func_return = func(*args, kwargs)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group
default_pg, _ = _new_process_group_helper(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper
raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2025-01-01 17:15:07,348] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 34080) of binary: C:\Users\Horan\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts\accelerate.EXE_main.py", line 7, in
sys.exit(main())
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
args.func(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\run.py", line 797, in run
elastic_launch(
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\launcher\api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\Horan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-01_17:15:07
host : HOST
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34080)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

17:15:08-714423 INFO Training has ended.

The text was updated successfully, but these errors were encountered:

Sylsatra · 2025-01-07T03:00:49Z

Hi, I have some questions with your script:

Can I see your config?
Can I see your accelerate config?
Can you provide your system specification (GPU, CPU)?

I suspect your accelerate config is wrong because the Ip address is very weird.

Horang1173 · 2025-01-07T13:05:19Z

Hi, I have some questions with your script:

Can I see your config?
"LoRA_type": "Standard",
"LyCORIS_preset": "full",
"adaptive_noise_scale": 0,
"additional_parameters": "",
"async_upload": false,
"block_alphas": "",
"block_dims": "",
"block_lr_zero_threshold": "",
"bucket_no_upscale": true,
"bucket_reso_steps": 64,
"bypass_mode": false,
"cache_latents": true,
"cache_latents_to_disk": false,
"caption_dropout_every_n_epochs": 0,
"caption_dropout_rate": 0,
"caption_extension": ".txt",
"clip_skip": 2,
"color_aug": false,
"constrain": 0,
"conv_alpha": 1,
"conv_block_alphas": "",
"conv_block_dims": "",
"conv_dim": 1,
"dataset_config": "",
"debiased_estimation_loss": false,
"decompose_both": false,
"dim_from_weights": false,
"dora_wd": false,
"down_lr_weight": "",
"dynamo_backend": "no",
"dynamo_mode": "default",
"dynamo_use_dynamic": false,
"dynamo_use_fullgraph": false,
"enable_bucket": true,
"epoch": 2,
"extra_accelerate_launch_args": "",
"factor": -1,
"flip_aug": false,
"fp8_base": false,
"full_bf16": false,
"full_fp16": false,
"gpu_ids": "",
"gradient_accumulation_steps": 1,
"gradient_checkpointing": false,
"huber_c": 0.1,
"huber_schedule": "snr",
"huggingface_path_in_repo": "",
"huggingface_repo_id": "",
"huggingface_repo_type": "",
"huggingface_repo_visibility": "",
"huggingface_token": "",
"ip_noise_gamma": 0,
"ip_noise_gamma_random_strength": false,
"keep_tokens": 0,
"learning_rate": 0.0002,
"log_tracker_config": "",
"log_tracker_name": "",
"log_with": "",
"logging_dir": "H:/LoraTrainer/kohya_ssDocker/kohya_ss/dataset/logs",
"loss_type": "l2",
"lr_scheduler": "constant",
"lr_scheduler_args": "",
"lr_scheduler_num_cycles": 1,
"lr_scheduler_power": 1,
"lr_warmup": 0,
"main_process_port": 0,
"masked_loss": false,
"max_bucket_reso": 2048,
"max_data_loader_n_workers": 1,
"max_grad_norm": 1,
"max_resolution": "1024,1024",
"max_timestep": 1000,
"max_token_length": 75,
"max_train_epochs": 0,
"max_train_steps": 1600,
"mem_eff_attn": false,
"metadata_author": "",
"metadata_description": "",
"metadata_license": "",
"metadata_tags": "",
"metadata_title": "",
"mid_lr_weight": "",
"min_bucket_reso": 256,
"min_snr_gamma": 0,
"min_timestep": 0,
"mixed_precision": "fp16",
"model_list": "custom",
"module_dropout": 0,
"multi_gpu": false,
"multires_noise_discount": 0,
"multires_noise_iterations": 0,
"network_alpha": 1,
"network_dim": 8,
"network_dropout": 0,
"network_weights": "",
"noise_offset": 0,
"noise_offset_random_strength": false,
"noise_offset_type": "Original",
"num_cpu_threads_per_process": 18,
"num_machines": 1,
"num_processes": 1,
"optimizer": "AdamW8bit",
"optimizer_args": "",
"output_dir": "H:/LoraTrainer/kohya_ssDocker/kohya_ss/dataset/outputs",
"output_name": "ImpyIllust0.1",
"persistent_data_loader_workers": false,
"pretrained_model_name_or_path": "H:/Stable Diffusion Full Reinstall/stable-diffusion-webui/models/Stable-diffusion/ntrMIXIllustriousXL_xiii.safetensors",
"prior_loss_weight": 1,
"random_crop": false,
"rank_dropout": 0,
"rank_dropout_scale": false,
"reg_data_dir": "",
"rescaled": false,
"resume": "",
"resume_from_huggingface": "",
"sample_every_n_epochs": 0,
"sample_every_n_steps": 0,
"sample_prompts": "",
"sample_sampler": "euler_a",
"save_as_bool": false,
"save_every_n_epochs": 1,
"save_every_n_steps": 0,
"save_last_n_steps": 0,
"save_last_n_steps_state": 0,
"save_model_as": "safetensors",
"save_precision": "fp16",
"save_state": false,
"save_state_on_train_end": false,
"save_state_to_huggingface": false,
"scale_v_pred_loss_like_noise_pred": false,
"scale_weight_norms": 0,
"sdxl": true,
"sdxl_cache_text_encoder_outputs": false,
"sdxl_no_half_vae": false,
"seed": 1234,
"shuffle_caption": false,
"stop_text_encoder_training": 0,
"text_encoder_lr": 0.0001,
"train_batch_size": 2,
"train_data_dir": "H:/LoraTrainer/kohya_ssDocker/kohya_ss/dataset/img",
"train_norm": false,
"train_on_input": true,
"training_comment": "",
"unet_lr": 0.0001,
"unit": 1,
"up_lr_weight": "",
"use_cp": false,
"use_scalar": false,
"use_tucker": false,
"v2": false,
"v_parameterization": false,
"v_pred_like_loss": 0,
"vae": "",
"vae_batch_size": 0,
"wandb_api_key": "",
"wandb_run_name": "",
"weighted_captions": false,
"xformers": "xformers"
}

Can I see your accelerate config?

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Can you provide your system specification (GPU, CPU)?

CPU: AMD Ryzen 9 5900x 12-Core Processor (24 CPUs) ~3.7GHZ
RAM: G.Skill Trident Z RGB 128 GB (4 x 32 GB) DDR4-4000 CL18 Memory
C: Samsung 970 Evo Plus 2 TB M.2-2280 PCIe 3.0 X4 NVME Solid State Drive
D: Sabrent Rocket 4 Plus 8 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive
GPU1: MSI RTX 3090 SUPRIM X 24G GeForce RTX 3090 24 GB Video Card
GPU2: Asus ROG STRIX GAMING OC GeForce RTX 4090 24 GB Video Card
OS: Microsoft Windows 10 Home OEM

I suspect your accelerate config is wrong because the Ip address is very weird.
I hope it's that simple XD... I configured accelerate before running the program

Horang1173 · 2025-01-07T13:40:26Z

Hi, I have some questions with your script:

Can I see your config?

Can I see your accelerate config?

Can you provide your system specification (GPU, CPU)?

I suspect your accelerate config is wrong because the Ip address is very weird.

Update! I attempted a single gpu config and, aside from an out of memory error, it started training! It seems something is wrong with attempting 2 GPUs.
Update 1 (U1): For single gpu (4090), this is the error I get.

Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Of the allocated memory 22.40 GiB is allocated by PyTorch, and 634.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

U2: Did some config changes, and for some reason it -refuses- to use the 4090 on slot 0, and will only train using the 3090 on slot 1... Strange :/
U3: Changed the config to use GPU 0,1,2 and suddenly it registers the 4090... I have no idea what is happening, but it's working! Still unable to train dual GPU for some reason.

Sylsatra · 2025-01-07T14:20:59Z

Hi, I have some questions with your script:

Can I see your config?

Can I see your accelerate config?

Can you provide your system specification (GPU, CPU)?

I suspect your accelerate config is wrong because the Ip address is very weird.

Update! I attempted a single gpu config and, aside from an out of memory error, it started training! It seems something is wrong with attempting 2 GPUs. For single gpu (4090), this is the error I get. Did some config changes, and for some reason it -refuses- to use the 4090 on slot 0, and will only train using the 3090 on slot 1... Strange :/

Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Of the allocated memory 22.40 GiB is allocated by PyTorch, and 634.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Great news!
By any chance, your 4090 is on another port? Because the error is shouting all about it.
For the OOM i can help you if you provide your config.

Horang1173 · 2025-01-07T14:52:12Z

Great news! By any chance, your 4090 is on another port? Because the error is shouting all about it. For the OOM i can help you if you provide your config.

preset124324151.json

Both GPUs are installed into the same motherboard on different PCIe ports. I managed to wiggle out the OOM figuring it was due to being overzealous with my training speed lol. Config is above. Of note, with this config, I'm only using 16/24GB of my 4090. I'd like to optimize it for 22-23GB, so I suppose that is my new task for experimentation.

Update: On the OOM, it seems a 2 batch was too much, however it will only utilize one card at a time since multi-gpu seems to not want to function. I suppose that is the cause of the original error. On attempting to activate multi-gpu support, I get the same error as originally posted.

Sylsatra · 2025-01-08T00:25:15Z

The setting for multiple GPUs is quite straightforward, for 2 GPU just copy this (you have to find the main process port by yourself)

Edit: bf16 not fp16

Horang1173 · 2025-01-08T01:21:06Z

How do I go about finding my main process port? It defaulted to 0 in the config. Note: I have no idea what this is lol...
attempted 7860 and 7861 to no avail... It seems no value will enable a connection. I have the exact settings from above outside of the main process port and it spits out the same error.

Sylsatra · 2025-01-08T06:16:32Z

How do I go about finding my main process port? It defaulted to 0 in the config. Note: I have no idea what this is lol... attempted 7860 and 7861 to no avail... It seems no value will enable a connection. I have the exact settings from above outside of the main process port and it spits out the same error.

if your 4090 is on your pc then you dont need it, just add "--same_network" into Extra Accelerate Launch arguments.
If your 4090 is NOT on your pc then you need to change "Number of machines" to "2" and in "Main process port" you write the ip address of machine rank 0 (the ip address).

Horang1173 · 2025-01-09T01:02:26Z

if your 4090 is on your pc then you dont need it, just add "--same_network" into Extra Accelerate Launch argument

Still running into the same above error with --same_network added to the argument line.

Horang1173 changed the title ~~Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED~~ Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033

Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033

Horang1173 commented Jan 1, 2025 •

edited

Loading

Sylsatra commented Jan 7, 2025

Horang1173 commented Jan 7, 2025

Horang1173 commented Jan 7, 2025 •

edited

Loading

Sylsatra commented Jan 7, 2025

Horang1173 commented Jan 7, 2025 •

edited

Loading

Sylsatra commented Jan 8, 2025 •

edited

Loading

Horang1173 commented Jan 8, 2025 •

edited

Loading

Sylsatra commented Jan 8, 2025

Horang1173 commented Jan 9, 2025

Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033

Training failed exitcode : 1 (pid: 34080) / train_db.py FAILED (Multi-GPU failure) #3033

Comments

Horang1173 commented Jan 1, 2025 • edited Loading

H:/LoraTrainer/kohya_ss/sd-scripts/train_db.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-01-01_17:15:07 host : HOST rank : 0 (local_rank: 0) exitcode : 1 (pid: 34080) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sylsatra commented Jan 7, 2025

Horang1173 commented Jan 7, 2025

Horang1173 commented Jan 7, 2025 • edited Loading

Sylsatra commented Jan 7, 2025

Horang1173 commented Jan 7, 2025 • edited Loading

Sylsatra commented Jan 8, 2025 • edited Loading

Horang1173 commented Jan 8, 2025 • edited Loading

Sylsatra commented Jan 8, 2025

Horang1173 commented Jan 9, 2025

Horang1173 commented Jan 1, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-01_17:15:07
host : HOST
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34080)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Horang1173 commented Jan 7, 2025 •

edited

Loading

Horang1173 commented Jan 7, 2025 •

edited

Loading

Sylsatra commented Jan 8, 2025 •

edited

Loading

Horang1173 commented Jan 8, 2025 •

edited

Loading