Sunday, December 20, 2015

Image caption generation by CNN and LSTM

I reproduced an image caption generation system at CVPR 2015 by google using chainer. If you give an image, the description of the image is generated. Here is some sample of generated captions. Note the images are from validation dataset, unseen images in the training.



It's not perfect, but makes sense for some.

How it works? The idea is simple. It’s like machine translation by deep learning. In machine translation, you give a sentence in a language and then the system translated it into another language. In caption generation, you will give an image and then the system translate it into description.  Specifically, I used GoogleNet to represent image into 1024 dimensional vector (feature extraction). Then the vector becomes an input to recurrent LSTM.  

The code (including pre-traind model) is available here: https://github.com/apple2373/chainer_caption_generation

Preparation and implementation note

The dataset is MSCOCO. It has 80,000 training image, 40,000 validation images, and 40,000 test images. Each image has at least five captions.  I first extracted all the image feature using pre-trained google net because extracting feature is time-consuming. Also, I preprocessed the captions making words into lower case, replacing the words that appears less then five times into <UKN> (unknown), and adding <SOS> at the start of the sentence and <EOS> at the end.  The final vocabulary size is 8843. Then I organized captions by the number of words of captions. This is because I will use mini-batch training. That is, for each batch, the size of tensor should be same, so the difference of the sentence length is a problem. So I solved the issue using only the sentences the same length for a batch.

Training and Tuning

Because deep learning is inherently solving non-convex optimization problem, the choice of initial weights and bias (parameters) and hyper-parameters are very important. For the parameters initialization, I initialized the most of the parameters as uniform distribution in (-0.1, 0.1). I don’t know the reason, but the chainer’s sample use this initialization. I tried other initialization like introduced in this paper, but it did not work in recurrent LSTM.  Moreover, following this paper, I initialized the bias of the forget gate into one. For the hyper-parameters, I just used the Adam and default parameter values. I also clipped the norm of gradient into one.   Other hyper-parameters are:  batch-size:256, number of hidden units:512, gradient clipping size: 1.0. Here is the loss during the training.

Generating Sentences and Evaluation.

After the training is done, then I just generated sentence using the top predicted words until it predicts <EOS> or reach the 50 words. I do not familiar with machine translations, but there seems to be several metrics to evaluate the quality automatically.  It seems like CIDEr is the standard one used in MSCOCO captions ranking. I just used the script here to evaluate the captions. My best model achieves CIDEr of 0.66 for the MSCOCO validation dataset. To achieve the better score, introducing beam search to generate sentence is first step (this is mentioned in the original paper, but I have not implemented yet). Also, I think the CNN has to be fine-tuned. Here is some evaluation score. The best CIDEr score is at epoch 35.



One thing I do not understand is that the original google paper says it achieved CIDEr of 85.5, which seems too high. I guess this is scaled differently than the script I used. I think it means 0.855.

45 comments:

  1. I am using a pretrained model for CNN in Caffe. I wish to pass the fc7 features(4096 vector) to LSTM to generate sentences as image description. Can you please give me steps for this.

    ReplyDelete
    Replies
    1. See the comment below. I did not assume to use VGG, so you might have to change the code very much. Actually I did not like it because it is memory consuming.

      If I show general process,
      pre-extract VGG features from training set and validation set. then pickled them.

      modify train_caption_model.py to fit VGG. At least, dimension of feature, pre-extracted feature file. Then train it.

      After training, see the evalutation_script directory. You will also need to modify several lines and files. You will use generate_caption_val.py to generate captions for each models, and use evalutate_caption_val.py to evaluate generated captions so that you can finally select the best model.

      However, I suggest you to use karpathy's implementation if you prefer VGG.

      Delete
  2. I am using 16-layer VGGnet caffemodel instead of GoogleNet that you've used for feature extraction. Your code works on 1024-dimensional feature vector as input to LSTM, but my output of CNN is 4096-dimensional feature vector (of fc7 layer).

    I have following queries:
    1. How to generate .pkl (Pickle file) for my caffemodel (e.g. VGGnet)?
    2. I tried your train_caption_model.py for retraining the LSTM model, but it failed as your code works only for 1024-dimensional vector extracted from GoogleNet. How do I train the LSTM model to work for 4096-dimensional feature vector?

    ReplyDelete
    Replies
    1. Direct Answer to Questions:

      1. How to generate .pkl (Pickle file) for my caffemodel (e.g. VGGnet)?
      There is a chainer's official way to load caffe model. So you can use it and then pickled it. http://docs.chainer.org/en/stable/reference/caffe.html

      Actually, I did it with GoogleNet, but I did not want to use this method every time because it is slow to load caffe model. That’s why I pickled.

      2. you need to change image_feature_dim into 4096.
      https://github.com/apple2373/chainer_caption_generation/blob/dfcdc91be084af4a16cd80bae23fb619e7e973ee/codes/train_caption_model.py#L53

      This cannot be specified by command line in current implementation.

      Other Comments
      I do not assume that others will replace GoogleNet into VGG. So you might need to change my codes very much.
      At least, you need to change this file for training.
      https://github.com/apple2373/chainer_caption_generation/blob/dfcdc91be084af4a16cd80bae23fb619e7e973ee/codes/train_caption_model.py#L49

      This is a dictionary (map) from image_id to feature. You will need to put numpy array of VGG feature for each images. That means you need to download full MSCOCO dataset and prepare VGG features. Also, you will need to have large GPU memory to load VGG. VGG is very memory consuming. That's why I used GoogleNet

      Personally I suggest you to consider implementation by karpathy. He uses VGG instead of GoogleNet.
      https://github.com/karpathy/neuraltalk2

      This is torch implementation, but he organized codes very well so you don't have to know lua very much. I think you can train just indicating command line options. Also, torch is faster than chainer.

      Delete
    2. Can you please give me the steps to pickle VGG.caffemodel?

      Delete
    3. I just remembered that I used VGG before. here is how to load VGG. You just have to pickle func.
      https://github.com/apple2373/chainer_stylenet/blob/82af0d7f20cd00c15dfc8eb252358093ded1aa9d/style_net.py#L133-L137

      Delete
  3. If I want to train my own model(which uses VGGNET for feature extraction), do I need to make any changes in the prepocess_captions.py ?

    ReplyDelete
    Replies
    1. No, you don't need to change prepocess_captions.py.

      Delete
    2. In that case, do I need to make my own train_image_id2feature.pkl file?

      Delete
    3. Yes, you are right. I just published the code I generated train_image_id2feature.pkl. Hope this helps.
      https://github.com/apple2373/chainer_caption_generation/blob/master/codes/pre_extract_googlenet_features.py

      Delete
  4. Your code is really very well organized and you've done a nice job. Your project is really helpful for me :)

    I have the following queries:
    1. It looks like there are 581921 images with their corresponding 1024-dimensional feature vectors pickled into train_image_id2feature.pkl file. So, which are those images? Because, MSCOCO has only 80000 images in its training set.

    2. I have to change the feature dimensions from 1024 to 4096. So, I have to retrain the model accordingly by using train_caption_model. To create a new train_image_id2feature.pkl which would work for my features, I need those 581921 images. Where can I get those images?

    ReplyDelete
    Replies
    1. Thank you for positive feedback (^_^)
      How do you find 581921?
      If you excute until this line
      https://github.com/apple2373/chainer_caption_generation/blob/1a143eac0e64ec398ceb8dbe5901eb3da7a85ce9/codes/train_caption_model.py#L50
      then,
      len(train_captions) = 52
      len(train_caption_id2sentence) = 414113
      len(train_caption_id2image_id) = 414113
      len(train_image_id2feature) = 82783
      None is 581921.

      Images are about 80,000 (82783) but, but remember that each image has about five captions.

      Delete
  5. When I checked the contents of your train_new_iamge_id2feature.pkl , then I got 581921 lines with feature vectors. So, what is that?

    ReplyDelete
    Replies
    1. What do you mean by check the content? you got 581921 by len(train_image_id2feature) ? I don't think so.

      Delete
    2. I got it. those were the COCO training images file names :D I thought those were number of images

      Delete
  6. Can you explain the use and working of forward_one_step_for_image() function? I get an error on this line:
    h0 = model.img_feature2vec(x)
    Don't know why, as my feature vector is 1024-D i.e. same as yours.

    ReplyDelete
    Replies
    1. forward_one_step_for_image performs prediction of the first word from the input of image feature.

      h0 = model.img_feature2vec(x) is a liner transformation from 1024 vector to 512 vector (number of hidden units).

      Delete
    2. Oh, I just thought you might not know the rule that you always have to make an np.array for batch training even if batch size is one. This is chainer's rule.

      So, you can't put (1024,) array directory. It should be (1, 1024) if you don't use batch training.

      Delete
  7. One more doubt. Approximately, how much time does it take to run the following files:

    1.train_caption_model to create new caption model
    2.pre_extract_googlenet_features.py to create a pickle file for training and test image features

    I am using normal CPU with 4GB of RAM.

    ReplyDelete
    Replies
    1. How much time did it take for you? What was your system configuration?

      Delete
    2. I don't remember the exact time, but I think it's less than two hours to extract features and less than six hours to train 50 epochs. I used Nvidia Tesla K80, a very good GPU.

      I don't know how much it takes in your environment but I don't use it that if I were you. It's not enough for deep learning.

      Delete
  8. Do you have any tutorial for Chainer implementation? I am new to Chainer, so I can take the help of that tutorial, if any.

    ReplyDelete
    Replies
    1. I think official documentation is the first step.

      Then you might want to check:
      http://multithreaded.stitchfix.com/blog/2015/12/09/intro-to-chainer/
      https://github.com/stitchfix/Algorithms-Notebooks/blob/master/chainer-blog/Introduction_to_Chainer.ipynb

      Delete
  9. Why have you included "pickle.dump(train_image_id2feature, open(savedir+"val_image_id2feature.pkl", 'wb'), -1)" again at the end of "pre_extract_googlenet_features.py"?? Wouldn't it overwrite the contents of already generated "val_image_id2feature.pkl"??

    ReplyDelete
    Replies
    1. Oh, yeah, that was mistake when I copied and pasted from other file. I fixed it. Thank you for notification.

      As I said, I did not plan to make the feature pre extraction code available. I just made ad-hock one for you because you asked.

      Delete
  10. Question 3: After completing 200 epochs in train_caption_model.py, I got 200 chainer_models as well as optimizers in experiment1 folder. Which one is to be used as the final LSTM model in generate_captions.py as caption generation model?

    ReplyDelete
    Replies
    1. chainer_models is the file you use for generating captions. Optimizers are set of parameters for Adam, which could be used when you want to resume the training.

      Delete
    2. Okay. But which of those 200 chainer_models? I need only one file, right? out of 200

      Delete
    3. You can generate caption from any of the models. From machine learning perspective, you need to generate captions from all of them with validation dataset. But I think it will be ok if you choose model over 30 judging from my experiment.

      Delete
  11. Can you give me the steps to generate a pickle file of CaffeModel?

    ReplyDelete
    Replies
    1. After importing the caffe model, you just have to pickle func. That's it.
      https://github.com/apple2373/chainer_stylenet/blob/82af0d7f20cd00c15dfc8eb252358093ded1aa9d/style_net.py#L133-L137

      Delete
  12. We already tried using this method. But it takes a lot of time to load. So we wanted to know the command you used to pickle your caffe model.

    ReplyDelete
    Replies
    1. func = caffe.CaffeFunction('VGG_ILSVRC_19_layers.caffemodel')
      with open('func.pickle', 'w') as f:
        pickle.dump(func,f)

      That's it.

      I won't be surprised if it takes a lot of time if you use VGG. VGG is very heavy.

      Delete
  13. I have some doubts and need your help.

    1. I am using vgg with your code, but we are getting the same sentence output for different images.
    2. Do I need to change forward_one_step_for_image() ?
    3. In file generate_caption.py,
    y, = func(inputs={'data': x_chainer_variable}, outputs=['pool5/7x7_s1'],disable=['loss1/ave_pool', 'loss2/ave_pool','loss3/classifier'],train=False)
    What is the meaning of the 'disable' attribute? In my VGG(16layer) what should be the value of disable? I have used disable=[] currently. Also I am using outputs=['fc7']. What am I doing wrong here?

    ReplyDelete
    Replies
    1. Sorry, my new semester started, and currently I do not have time to investigate the issue.

      As for 3, please ask chainer's mailing list. I think you need to learn how to use caffe model in chainer. inputs, outputs, disable, etc all have meanings.

      Again, if you want to use VGG, I strongly recommend karpathy's. He uses VGG instead of GoogleNet. You can train from your own data.
      https://github.com/karpathy/neuraltalk2

      He is much much better in deep learning. He is very famous figure.

      Delete
  14. i execute your test code. but this result have a problem

    python generate_caption.py -i ../images/test_image.jpg
    loading vocab
    loading caffe models
    done
    preparing caption generation models
    done
    /usr/local/lib/python2.7/dist-packages/chainer/functions/activation/lstm.py:15: RuntimeWarning: overflow encountered in exp
    return 1 / (1 + numpy.exp(-x))
    sentence generation started
    (446, 446, 3)
    (224, 224, 3)
    ---genrated_sentence--

    a
    person
    flying
    a
    kite
    in
    the
    sky

    0.751564

    why??

    ReplyDelete
    Replies
    1. python generate_caption_beam.py -b 3 -i ../images/test_image.jpg
      /usr/local/lib/python2.7/dist-packages/chainer/functions/activation/lstm.py:15: RuntimeWarning: overflow encountered in exp
      return 1 / (1 + numpy.exp(-x))
      a person flying a kite in the sky 0.0026689476398
      a person flying a kite in the air 0.00173693987611
      a person is flying a kite in the sky 0.00191272892393

      my test -->> all image' s result --

      a
      person
      flying
      a
      kite
      in
      the
      sky

      Delete
    2. I don't know why, something is wrong. It should generate the caption of ../images/test_image.jpg

      You could try this notebook, which is more organized:
      https://github.com/apple2373/chainer_caption_generation/blob/master/codes/sample_code.ipynb

      Delete
  15. i solve it.

    im = skimage.transform.resize(im, (224, w*224/h), preserve_range=True) //
    generator error
    TypeError: resize() got an unexpected keyword argument 'preserve_range'

    so. i motify -> m = skimage.transform.resize(im, (224, w*224/h));
    but, upper problem generate!! ㅠㅠ

    adding source - im = img_as_ubyte(im)

    finally, sucess!!

    // to your source...

    MEAN_VALUES = np.array([104, 117, 123]).reshape((3,1,1))
    def image_read_np(file_place):
    im = imread(file_place)
    if len(im.shape) == 2:
    im = im[:, :, np.newaxis]
    im = np.repeat(im, 3, axis=2)
    # Resize so smallest dim = 224, preserving aspect ratio
    h, w, _ = im.shape

    print im.shape

    if h < w:
    im = skimage.transform.resize(im, (224, w*224/h))
    else:
    im = skimage.transform.resize(im, (h*224/w, 224))

    im = img_as_ubyte(im)

    ....

    ReplyDelete
  16. Can you please tell me how to resume LSTM training using optimiser?

    ReplyDelete
    Replies
    1. If you run my code, you will have files of caption_model*.chainer and optimizer*.chainer. First one is serialized model (holds model parameter) and second one is serialized optimizer (holds parameter such as learning rate). You can load them again using serializers.load_hdf5

      Example is on chainer's repository.
      https://github.com/pfnet/chainer/blob/fbf7ea3a270f5e63cb06c9da9a1c610a68516d9f/examples/mnist/train_mnist.py#L68-L73

      If you want to know more about general information about serialization, please ask chainer's customer support.

      Delete
  17. Your model which is available on github has been trained for how many epochs?

    ReplyDelete
    Replies
    1. Trained for 60 epochs and uses epoch 35.

      Delete
  18. I wish to check the model accuracy, as training datasize changes. Can I train it on a MS COCO 10k images?

    ReplyDelete