I reproduced an image caption generation system at CVPR 2015 by google using chainer. If you give an image, the description of the image is generated. Here is some sample of generated captions. Note the images are from validation dataset, unseen images in the training.
It's not perfect, but makes sense for some.
How it works? The idea is simple. It’s like machine translation by deep learning. In machine translation, you give a sentence in a language and then the system translated it into another language. In caption generation, you will give an image and then the system translate it into description. Specifically, I used GoogleNet to represent image into 1024 dimensional vector (feature extraction). Then the vector becomes an input to recurrent LSTM.
The code (including pre-traind model) is available here:https://github.com/apple2373/chainer_caption_generation
Preparation and implementation note
The dataset is MSCOCO. It has 80,000 training image, 40,000 validation images, and 40,000 test images. Each image has at least five captions. I first extracted all the image feature using pre-trained google net because extracting feature is time-consuming. Also, I preprocessed the captions making words into lower case, replacing the words that appears less then five times into <UKN> (unknown), and adding <SOS> at the start of the sentence and <EOS> at the end. The final vocabulary size is 8843. Then I organized captions by the number of words of captions. This is because I will use mini-batch training. That is, for each batch, the size of tensor should be same, so the difference of the sentence length is a problem. So I solved the issue using only the sentences the same length for a batch.
Training and Tuning
Because deep learning is inherently solving non-convex optimization problem, the choice of initial weights and bias (parameters) and hyper-parameters are very important. For the parameters initialization, I initialized the most of the parameters as uniform distribution in (-0.1, 0.1). I don’t know the reason, but the chainer’s sample use this initialization. I tried other initialization like introduced in this paper, but it did not work in recurrent LSTM. Moreover, following this paper, I initialized the bias of the forget gate into one. For the hyper-parameters, I just used the Adam and default parameter values. I also clipped the norm of gradient into one. Other hyper-parameters are: batch-size:256, number of hidden units:512, gradient clipping size: 1.0. Here is the loss during the training.
Generating Sentences and Evaluation.
After the training is done, then I just generated sentence using the top predicted words until it predicts <EOS> or reach the 50 words. I do not familiar with machine translations, but there seems to be several metrics to evaluate the quality automatically. It seems like CIDEr is the standard one used in MSCOCO captions ranking. I just used the script here to evaluate the captions. My best model achieves CIDEr of 0.66 for the MSCOCO validation dataset. To achieve the better score, introducing beam search to generate sentence is first step (this is mentioned in the original paper, but I have not implemented yet). Also, I think the CNN has to be fine-tuned. Here is some evaluation score. The best CIDEr score is at epoch 35.
One thing I do not understand is that the original google paper says it achieved CIDEr of 85.5, which seems too high. I guess this is scaled differently than the script I used. I think it means 0.855.