How do I know if stage-3 is a success by using deepspeed？ #6877

hwhyyds · 2024-12-16T14:18:07Z

I used accelerate to packag the model with GPUs, but the model are copied in GPUs, not shard in GPUs. Why?

deepspeed_plugin = DeepSpeedPlugin(hf_ds_config="./configs/rl_ds_zero_3.json", zero_stage=3)

base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin, mixed_precision="bf16")
base_model= accelerator.prepare(base_model)

rl_ds_zero_3.json

{
  "train_micro_batch_size_per_gpu": 1,
  "model_parallel_size": 3,
  "zero_allow_untested_optimizer": true,
  "bf16": {
    "enabled": "auto"
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-5,
      "betas": [0.9, 0.999],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },
  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e7,
    "stage3_param_persistence_threshold": 5e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
  }
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": true,
    "number_checkpoints": 4,
    "synchronize_checkpoint_boundary": false,
    "contiguous_memory_optimization": true
  }
}

tjruwase · 2024-12-18T11:21:17Z

@hwhyyds, can you please share more steps to reproduce, including logs, scripts, and command line?

loadams · 2025-01-03T15:49:30Z

@hwhyyds - closing as stale due to no response. If you still have questions feel free to comment and we can re-open the issue.

tjruwase added the training label Dec 18, 2024

loadams closed this as completed Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I know if stage-3 is a success by using deepspeed？ #6877

How do I know if stage-3 is a success by using deepspeed？ #6877

hwhyyds commented Dec 16, 2024

tjruwase commented Dec 18, 2024

loadams commented Jan 3, 2025

How do I know if stage-3 is a success by using deepspeed？ #6877

How do I know if stage-3 is a success by using deepspeed？ #6877

Comments

hwhyyds commented Dec 16, 2024

tjruwase commented Dec 18, 2024

loadams commented Jan 3, 2025