-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making this work with relative position bias from XTransformers #5
Comments
@pfeatherstone if you are working with 1d sequences, the best approach would be https://github.com/lucidrains/x-transformers#dynamic-positional-bias, which is the other alternative is ALiBi positional embedding, which needs only to be materialized within each block, but may come with some limitations (unidirectional, forced local attending, etc) |
@pfeatherstone which module are you using from this repository? you should be using the CUDA implementation from here |
@lucidrains Actually, i've just realized, you can pass in |
I need to use something that can be ONNX exported. I don't think Memory efficient attention is great because it yields the exact same result as normal attention, so i can train with memory efficient option turned on, then export to ONNX using normal attention. Correct me if I'm wrong, but i don't think this will work with flash attention? |
I've also kind of given up on the memory efficient implementation, it is cripplingly slow to train. |
Is there a way to make this work with$BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?
RelativePositionBias
. Currently this produces an attention bias of sizeThe text was updated successfully, but these errors were encountered: