Meshed-Memory Transformer for Image Captioning #8

harshraj22 · 2021-05-11T20:21:09Z

harshraj22
May 11, 2021
Maintainer

Meshed Memory Transformer

CVPR 2020

This paper introduces an interesting idea of using transformers for image captioning. The set of image regions extracted with an object detector, is used as input to the model. These regions contain information about the objects present in the image, but fail to capture the relationship. ie. one region could contain ball, while the other may contain person, but inferring the action of baseball being played using these regions would still be a challenge.

The paper introduces the concept of meshed memory to solve this. They add an extra memory slot to the key, and value vectors inside the encoder, which is assumed to store this relationship information.

They also bring another change compared to standard transformer, by introducing attention to each encoding layer output. They claim, this multi-layer architecture exploits both low and high level architecture.

Loss Function: (pending)

Reinforcement Learning- self critical sequence training approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meshed-Memory Transformer for Image Captioning #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Meshed-Memory Transformer for Image Captioning #8

harshraj22 May 11, 2021 Maintainer

Meshed Memory Transformer

Loss Function: (pending)

Replies: 0 comments

harshraj22
May 11, 2021
Maintainer