[Question] Masked SDPA attention kernel? #1582

EricLBuehler · 2024-11-12T16:55:38Z

Hello MLX team!

Thank you for your excellent work here. We have been focusing on accelerating Candle performance on Metal and recently integrated some fast SDPA kernels from MLX!

I noticed that the scaled_dot_product_attention.metal file does not contain kernels for causal, non-decode step kernels. This seems to prohibit their usage in the prompt step for models such as Llama, and the fallback code for the SDPA op seems to indicate so.

Are there any plans to add masking support to the full SDPA kernel or current methods that have been taken to accelerate the prompt step?

awni · 2024-11-12T21:05:36Z

Are there any plans to add masking support to the full SDPA kernel or current methods that have been taken to accelerate the prompt step?

This is a work in progress. @jagrit06 has been working on that. We're hoping to ship something soon, but I can't give you an exact timeline yet.

awni · 2024-11-12T21:10:52Z

Closing as a dup of #129

awni closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Masked SDPA attention kernel? #1582

[Question] Masked SDPA attention kernel? #1582

EricLBuehler commented Nov 12, 2024

awni commented Nov 12, 2024

awni commented Nov 12, 2024

[Question] Masked SDPA attention kernel? #1582

[Question] Masked SDPA attention kernel? #1582

Comments

EricLBuehler commented Nov 12, 2024

awni commented Nov 12, 2024

awni commented Nov 12, 2024