Meshed-Memory Transformer for Image Captioning #8
harshraj22
started this conversation in
Paper Review
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Meshed Memory Transformer
CVPR 2020
This paper introduces an interesting idea of using transformers for image captioning. The set of image regions extracted with an object detector, is used as input to the model. These regions contain information about the objects present in the image, but fail to capture the relationship. ie. one region could contain ball, while the other may contain person, but inferring the action of baseball being played using these regions would still be a challenge.
The paper introduces the concept of meshed memory to solve this. They add an extra memory slot to the key, and value vectors inside the encoder, which is assumed to store this relationship information.
They also bring another change compared to standard transformer, by introducing attention to each encoding layer output. They claim, this multi-layer architecture exploits both low and high level architecture.
Loss Function: (pending)
Beta Was this translation helpful? Give feedback.
All reactions