Is text-gen-webui able to load the new meta-llama_Llama-3.2-11B-Vision? & Cannot load multimodal ext #6412

CallMeRive · 2024-09-28T06:02:37Z

CallMeRive
Sep 28, 2024

I have the model, downloaded it manually, but I cannot load it, cause Transformers doesn't recognize its architecture called "mllama". I understand that 'm' in "mllama" means multimodal, so probably I'd need the multimodal extension, but the multimodal extension doesn't want to load either, with these errors:

08:53:32-869239` INFO     Loading the extension "multimodal"
		Traceback (most recent call last)
 N:\text-generation-webui-main\server.py:279 in <module>                                                   
                                                                                                           
   278                 time.sleep(0.5)                                                                     
 > 279                 create_interface()                                                                  
   280                                                                                                     
                                                                                                           
 N:\text-generation-webui-main\server.py:166 in create_interface                                           
                                                                                                           
   165         extensions_module.create_extensions_tabs()  # Extensions tabs                               
 > 166         extensions_module.create_extensions_block()  # Extensions block                             
   167                                                                                                     
                                                                                                           
 N:\text-generation-webui-main\modules\extensions.py:199 in create_extensions_block                        
                                                                                                           
   198                 extension, _ = row                                                                  
 > 199                 extension.ui()                                                                      
   200                                                                                                     
                                                                                                           
 N:\text-generation-webui-main\extensions\multimodal\script.py:99 in ui                                    
                                                                                                           
    98     global multimodal_embedder                                                                      
 >  99     multimodal_embedder = MultimodalEmbedder(params)                                                
   100     with gr.Column():                                                                               
                                                                                                           
 N:\text-generation-webui-main\extensions\multimodal\multimodal_embedder.py:27 in __init__                 
                                                                                                           
    26     def __init__(self, params: dict):                                                               
 >  27         pipeline, source = load_pipeline(params)                                                    
    28         self.pipeline = pipeline                                                                    
                                                                                                           
 N:\text-generation-webui-main\extensions\multimodal\pipeline_loader.py:34 in load_pipeline                
                                                                                                           
   33     else:                                                                                            
 > 34         model_name = shared.args.model.lower()                                                       
   35         for k in pipeline_modules:                                                                   

AttributeError: 'NoneType' object has no attribute 'lower'

CallMeRive · 2024-10-04T06:56:38Z

CallMeRive
Oct 4, 2024
Author

Hey? Anybody reads this?

2 replies

ChrisWhiteSr Oct 5, 2024

I got it to work but man is it slow with my machine! 3060 12GB

good luck!

Run cmd_windows.bat in the text-generation-webui folder.
Set the HF_TOKEN like so: set HF_TOKEN=<YOUR_HF_TOKEN_HERE>
Lastly, run download-model.py with the model argument.
Example: python download-model.py meta-llama/Llama-3.2-11B-Vision

CallMeRive Oct 5, 2024
Author

Your answer is about downloading the model, which is not a problem - I have the model.
My question was about loading it. Without the multimodal extension I get the error "The checkpoint you are trying to load has model type mllama but Transformers does not recognize this architecture." etc, but when I try to activate the multimodal extension, the error from the first post occurs.

I've updated the post to clarify this confusion.

CalculonPrime · 2024-10-06T20:38:46Z

CalculonPrime
Oct 6, 2024

Even if you loaded it, wouldn't oobabooga need to also add support for importing images for it to do anything? As I understand it Llama 3.2 "vision" models are about "image to text". Basically the opposite of stable diffusion. So you'd drag a photo into the (hypothetical) Web UI in the future, and then you could ask the text engine questions about it.

1 reply

CallMeRive Oct 7, 2024
Author

That's why, I believe, I need the multimodal extension.
https://github.com/oobabooga/text-generation-webui/blob/main/extensions/multimodal/README.md

CalculonPrime · 2024-12-31T10:51:21Z

CalculonPrime
Dec 31, 2024

You never gave your command line. Which multimodal pipeline are you trying to use? For example on the WIKI page, there's a list of command lines invoking with different pipelines.

Such as:
python server.py --model liuhaotian_llava-v1.5-13b --multimodal-pipeline llava-v1.5-13b --load-in-4bit

Did you just retarget the model file like so?
python server.py --model meta-llama_Llama-3.2-11B-Vision --multimodal-pipeline llava-v1.5-13b --load-in-4bit

If so, did you try the original command line (from the WIKI) first as a baseline to see whether that worked?

4 replies

ChrisWhiteSr Jan 5, 2025

hey man, sorry, honestly when I realized how slowly the model ran on my machine I abandoned the project. Thanks for your willingness to help though!!

CalculonPrime Jan 5, 2025

Actually, I'm interested in running it as well. Hence my questions above. How far did you get? Can you provide any details?

ricardo2001l Jan 9, 2025

@CalculonPrime
I've done

python server.py --model meta-llama_Llama-3.2-11B-Vision --multimodal-pipeline llava-v1.5-**7b** --load-in-4bit

worked well. I had the error Sizes of tensors must match except in dimension 0. Expected size 4096 but got size 5120 for tensor number 1 in the list. Then compared the 4096 hidden_size in config.json of the model llama folder with the 5120 of the llava-v1.5-13b. Then looked for the llava-v1.5-7b, that it has 4096 in its hidden_size.

CalculonPrime Jan 12, 2025

Thanks. I put those flags in CMD_FLAGS.txt and was able to load the model. The model responds intelligently to text questions, but if I attach a picture, and ask "How many people are in the picture?" it starts outputting gibberish:

(undercover 2

3rd

7, Italian

"4

8

4

5

7-governes

1

2

The gibberish continues if I asked it more questions without attachments. It goes back to intelligible text only if I reset the chat.

Did you actually get meaningful responses after attaching pictures? If you also saw the same thing, it's probably not supported by this Web UI. That's a shame because I really hoped I wouldn't have to install a separate UI for this model.

Here's the initial part of my output when I run start_windows.bat with the special server arguments in CMD_FLAGS.txt:

04:43:23-358119 INFO     Starting Text generation web UI
04:43:23-361120 INFO     Loading settings from "settings.yaml"
04:43:23-391708 INFO     Loading "meta-llama_Llama-3.2-11B-Vision"
04:43:23-398705 INFO     TRANSFORMERS_PARAMS=
{   'low_cpu_mem_usage': True,
    'torch_dtype': torch.bfloat16,
    'device_map': 'auto',
    'quantization_config': BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": true,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is text-gen-webui able to load the new meta-llama_Llama-3.2-11B-Vision? & Cannot load multimodal ext #6412

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is text-gen-webui able to load the new meta-llama_Llama-3.2-11B-Vision? & Cannot load multimodal ext #6412

Replies: 3 comments · 7 replies

CallMeRive Oct 4, 2024 Author

CallMeRive Oct 5, 2024 Author

CallMeRive Oct 7, 2024 Author

Replies: 3 comments 7 replies

CallMeRive
Oct 4, 2024
Author

CallMeRive Oct 5, 2024
Author

CallMeRive Oct 7, 2024
Author