gemma 2 convert_checkpoint takes gpu ram more than needed #2647

Alireza3242 · 2025-01-02T14:17:11Z

System Info

a100

Who can help?

@kaiyux
@byshiue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python3 /app/src/model_files/gemma2_27b/convert/convert_checkpoint.py --model-dir /app/data/gemma2_27b/model --output-model-dir /app/data/tllm_checkpoint --dtype bfloat16 --ckpt-type hf --world-size 2

Expected behavior

Expected behavior: load model once.

actual behavior

actual behavior: the model is loaded world-size times.

additional notes

the problem is in line 254 to 267 in https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/examples/gemma/convert_checkpoint.py:

    for config in trt_llm_config.for_each_rank():
        hf_weights = load_gemma_weights(
            parameters_or_model_dir=args.model_dir,
            trt_llm_config=config,
            ckpt_parser=ckpt_parser,
            load_model_on_cpu=args.load_model_on_cpu)
        ranked_weights = non_modelopt_quantize_if_needed(
            hf_weights,
            model_dir=args.model_dir,
            quantize_modifiers=QuantizeModifiers.from_args(args),
            trt_llm_config=config)
        save_checkpoint(output_dir=args.output_model_dir,
                        weights=ranked_weights,
                        rank=config.mapping.rank)

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2025-01-06T03:05:29Z

@byshiue would u please take a look on it?

Alireza3242 added the bug Something isn't working label Jan 2, 2025

nv-guomingz assigned byshiue Jan 6, 2025

nv-guomingz added the LLM API/Workflow label Jan 6, 2025

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 6, 2025

github-actions bot assigned Superjomn Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemma 2 convert_checkpoint takes gpu ram more than needed #2647

gemma 2 convert_checkpoint takes gpu ram more than needed #2647

Alireza3242 commented Jan 2, 2025

nv-guomingz commented Jan 6, 2025

gemma 2 convert_checkpoint takes gpu ram more than needed #2647

gemma 2 convert_checkpoint takes gpu ram more than needed #2647

Comments

Alireza3242 commented Jan 2, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

nv-guomingz commented Jan 6, 2025