A TensorFlow implementation of the image-to-text model described in the paper:
"Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge."
Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.
IEEE transactions on pattern analysis and machine intelligence (2016).
Full text available at: http://arxiv.org/abs/1609.06647
The full version of this model can be found, together with many others, at the TensorFlow models github repo.
This is only an adapted and simplified version of the project for demo purposes, showing just the results at inference time. Using pre-trained models as described later on this README.
-
Download the pre-trained checkpoints and place them in
model/train
-
Fix the code depending on your TF version. (Recommended version 1.2)
python fix_ckpoints.py
This will generate the
checkpoint
files needed, pointing to the TF checkpoint files. -
Fix the python code. See fixes section
-
Run inference on an image:
./inference.sh
Modifying the paths as needed
The Show and Tell model is a deep neural network that learns how to describe the content of images. For example:
The Show and Tell model is an example of an encoder-decoder neural network. It works by first "encoding" an image into a fixed-length vector representation, and then "decoding" the representation into a natural language description.
The image encoder is a deep convolutional neural network. This type of network is widely used for image tasks and is currently state-of-the-art for object recognition and detection. Our particular choice of network is the Inception v3 image recognition model pretrained on the ILSVRC-2012-CLS image classification dataset.
The decoder is a long short-term memory (LSTM) network. This type of network is commonly used for sequence modeling tasks such as language modeling and machine translation. In the Show and Tell model, the LSTM network is trained as a language model conditioned on the image encoding.
Words in the captions are represented with an embedding model. Each word in the vocabulary is associated with a fixed-length vector representation that is learned during training.
The following diagram illustrates the model architecture.
In this diagram, {s0, s1, ..., sN-1} are the words of the caption and {wes0, wes1, ..., wesN-1} are their corresponding word embedding vectors. The outputs {p1, p2, ..., pN} of the LSTM are probability distributions generated by the model for the next word in the sentence. The terms {log p1(s1), log p2(s2), ..., log pN(sN)} are the log-likelihoods of the correct word at each step; the negated sum of these terms is the minimization objective of the model.
During the first phase of training the parameters of the Inception v3 model are kept fixed: it is simply a static image encoder function. A single trainable layer is added on top of the Inception v3 model to transform the image embedding into the word embedding vector space. The model is trained with respect to the parameters of the word embeddings, the parameters of the layer on top of Inception v3 and the parameters of the LSTM. In the second phase of training, all parameters - including the parameters of Inception v3 - are trained to jointly fine-tune the image encoder and the LSTM.
Given a trained model and an image we use beam search to generate captions for that image. Captions are generated word-by-word, where at each step t we use the set of sentences already generated with length t - 1 to generate a new set of sentences with length t. We keep only the top k candidates at each step, where the hyperparameter k is called the beam size. We have found the best performance with k = 3.
First ensure that you have installed the following required packages:
- Bazel (instructions)
- TensorFlow 1.0 or greater (instructions)
- NumPy (instructions)
- Natural Language Toolkit (NLTK):
- First install NLTK (instructions)
- Then install the NLTK data (instructions)
Your trained Show and Tell model can generate captions for any JPEG image! The following command line will generate captions for an image from the test set.
NOTE:
This file can be found already configured for this project in `inference.sh`
in the root directory.
# Path to checkpoint file or a directory containing checkpoint files. Passing
# a directory will only work if there is also a file named 'checkpoint' which
# lists the available checkpoints in the directory. It will not work if you
# point to a directory with just a copy of a model checkpoint: in that case,
# you will need to pass the checkpoint path explicitly.
CHECKPOINT_PATH="${HOME}/im2txt/model/train"
# Vocabulary file generated by the preprocessing script.
VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"
# JPEG image file to caption.
IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"
# Build the inference binary.
cd tensorflow-models/im2txt
bazel build -c opt //im2txt:run_inference
# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""
# Run inference to generate captions.
bazel-bin/im2txt/run_inference \
--checkpoint_path=${CHECKPOINT_PATH} \
--vocab_file=${VOCAB_FILE} \
--input_files=${IMAGE_FILE}
Example output:
Captions for image COCO_val2014_000000224477.jpg:
0) a man riding a wave on top of a surfboard . (p=0.040413)
1) a person riding a surf board on a wave (p=0.017452)
2) a man riding a wave on a surfboard in the ocean . (p=0.005743)
Note: you may get different results. Some variation between different models is expected.
Here is the image:
Depending on the TensorFlow version checkpoint tensor key values might need to be renamed.
This can be done with fix_ckpoints.py
More information on transitioning from TF 1.0 to 1.2 can be found here:
In addition, if running on Python 3.x tf.gfile.GFile
needs to read files in rb
mode.
(See im2txt/run_inference.py
line 74)