Monday, June 20, 2016

Pre-deep aspect of deep learning

When you play with deep learning, you need data. In most examples, it’s prepared for you (e.g. mnist or cifer-10). For example, in tensorflow tutorial:

for i in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100), feed_dict={x: batch_xs, y_: batch_ys})

“mnist.train.next_batch(100)”!! This is great. But how to do this on your own? That’s the issue I call “pre-deep”. It’s ironic when you realize spending more time on preprocessing part when you actually want to enjoy deep learning on your data. It shouldn’t be like that. My philosophy is, most of time should be spent on “deep” part: Find the best model architectures and parameters. Preprocessing should be quick and easy.

I summarize my thought on the issue here. This is kind of brainstorming to determine what kind of library I want. If developers of deep learning libraries see this, please consider to include that features. That’s the part most users wonder when they actually try on their own data. Perhaps it’s better to separate the preprocessing library from the framework so that users can also use with other deep learning frameworks.

Let’s use image classification. Let x a python instance that hides everything (i.e. I want to do “x.train.next_batch(100)” like tensorflow example). The library has the following functions.

User Input: We need to fix input data format to initialize x to feed our dataset. I can think two ways:

 1: Text file with path and class. i.e. each text line is
[path/to/image.jpeg], [label]
 2: Directory by classes. The directory will look like:
 What users can tune:

  • Network Input Size: fixed like 224*224, flexible but with maximum length, or only fix aspect ratio. 
  • Mean Image Subtraction: Precompute mean in advance and subtract or not. 
  • Dataset Split: Split dataset into train, validate, and test. Sometimes the data is already divided by the provider. In that case,  user can specify the setting when inputting the data.
  • Shuffle Dataset: After each epoch, the order of data should be shuffled. 
  • Data Augmentation: User can designate augmentations (flip, rotate, noise, etc) on either live or preprocessed. Looks like Keras has a similar feature.
  • In Memory or Not: Real large dataset does not fit on memory. It should have an option to use online loading. 
  • Image Storing Methods: I know storing raw image files on disk is not so efficient. Some people use lmdb or hdf5 format. So the library should be able to use them internally.  
  • Feature Extraction: Sometimes we only need features from middle layers such as after conv5_3 in VGG or before softmax in GoogleNet. The library should have a function to precompute the features. 

Note this is just the simplest case with only one label per image. But I still couldn’t find an easy library to do that. I know caffe has a part of the features I mentioned above, but it’s highly customized for caffe. Moreover, caffe is too complicated to install. Especially if you do not have root privilege like in university server, it’s almost impossible to install.

There are more complicated situations in real. I assumed image classification with one label now because the purpose of this post is to raise needs. After we have a library for that, we can think about extensions to deal with the situation: localization, multi labels, multi localization (image detection), or even detection with multi labels. Moreover, you might have captions or question answer pairs on images.

Another story is text processing with RNN. One famous issue is how to make a batch with different lengths of sequences (e.g. Keras have a padding function). But making padding with completely different size is not good because the padded part is actually waste of computational resource. So common practice is to make a batch with almost same length of sequences or sometimes even only with exactly same length. This requires pre indexing with sequence size. Moreover, after training, we need to use beam search to generate sequences. These parts should be also included in the library.

Lastly, I know this is not research. Just an engineering product. But it is true that researcher’s time is wasted on this routine part. I want to argue that there should be a unified library that you can do preprocessing without thinking the details. Just like we rarely implement backpropagation by ourselves now.

 Feel free to comment more features that you think preprocessing library should have.

Friday, January 1, 2016



さて、英語のキャプション生成の学習データ(MS COCO)を機械翻訳した上で、学習してみました。アルゴリズムは英語と全く同じです。短い文なので、機械翻訳でもそれなりのキャプションになるようです。

モデルは英語と同じgithub経由でダウンロード 可能です。詳しくは"I want to generate Japanese caption."の章を参照してください。


Sunday, December 20, 2015

Image caption generation by CNN and LSTM

I reproduced an image caption generation system at CVPR 2015 by google using chainer. If you give an image, the description of the image is generated. Here is some sample of generated captions. Note the images are from validation dataset, unseen images in the training.

It's not perfect, but makes sense for some.

How it works? The idea is simple. It’s like machine translation by deep learning. In machine translation, you give a sentence in a language and then the system translated it into another language. In caption generation, you will give an image and then the system translate it into description.  Specifically, I used GoogleNet to represent image into 1024 dimensional vector (feature extraction). Then the vector becomes an input to recurrent LSTM.  

The code (including pre-traind model) is available here:

Preparation and implementation note

The dataset is MSCOCO. It has 80,000 training image, 40,000 validation images, and 40,000 test images. Each image has at least five captions.  I first extracted all the image feature using pre-trained google net because extracting feature is time-consuming. Also, I preprocessed the captions making words into lower case, replacing the words that appears less then five times into <UKN> (unknown), and adding <SOS> at the start of the sentence and <EOS> at the end.  The final vocabulary size is 8843. Then I organized captions by the number of words of captions. This is because I will use mini-batch training. That is, for each batch, the size of tensor should be same, so the difference of the sentence length is a problem. So I solved the issue using only the sentences the same length for a batch.

Training and Tuning

Because deep learning is inherently solving non-convex optimization problem, the choice of initial weights and bias (parameters) and hyper-parameters are very important. For the parameters initialization, I initialized the most of the parameters as uniform distribution in (-0.1, 0.1). I don’t know the reason, but the chainer’s sample use this initialization. I tried other initialization like introduced in this paper, but it did not work in recurrent LSTM.  Moreover, following this paper, I initialized the bias of the forget gate into one. For the hyper-parameters, I just used the Adam and default parameter values. I also clipped the norm of gradient into one.   Other hyper-parameters are:  batch-size:256, number of hidden units:512, gradient clipping size: 1.0. Here is the loss during the training.

Generating Sentences and Evaluation.

After the training is done, then I just generated sentence using the top predicted words until it predicts <EOS> or reach the 50 words. I do not familiar with machine translations, but there seems to be several metrics to evaluate the quality automatically.  It seems like CIDEr is the standard one used in MSCOCO captions ranking. I just used the script here to evaluate the captions. My best model achieves CIDEr of 0.66 for the MSCOCO validation dataset. To achieve the better score, introducing beam search to generate sentence is first step (this is mentioned in the original paper, but I have not implemented yet). Also, I think the CNN has to be fine-tuned. Here is some evaluation score. The best CIDEr score is at epoch 35.

One thing I do not understand is that the original google paper says it achieved CIDEr of 85.5, which seems too high. I guess this is scaled differently than the script I used. I think it means 0.855.

Sunday, September 6, 2015

A Neural Algorithm of Artistic Style with Chainer

See this picture. Isn't it awesome?
I implemented the algorithm to synthesize a painting and a picture in a way that represents the photo in the painting style. Here is more images that show how the learning goes. 

Well, it’s two weeks since classes started. If I summarize my phd student life so far in a word, it’s busy. I have lots of reading and writing far more than in Japanese universities. However, if I only do my work, I will be depressed. I need to do something that I can enjoy.

So, what do I like? I love technologies! That’s why I implemented an interesting algorithm proposed recently. It's from a paper called  A Neural Algorithm of ArtisticStyle. If you are not computer science student, this news article is helpful to know what I implemented.

Some thoughts when playing:
  • The paper didn’t mention how they optimized explicitly, so I used simple gradient decent at first, but it was too slow. Then I used Adam to minimize the loss, which I think one of the latest optimization methods. I think momentum gradient descent is also fine.
  • The difference from paper is that I did (could) not use average-pooling instead of max-pooling in VGG pooling layers. I wanted to change max-pooling to average-pooling, but I did not know how to change a part of layer of the imported caffe model in chainer. I need to figure out how to change network structure of a imported caffe model in chainer.
  • There are several parameters that make the output difference, but I am still not sure what is the best parameter value that is used for all paints and pictures.

Here is the code on github:
You can easily run both on GPU and CPU. I recommend you to use GPU if available. It is so much faster.

This was the first time I played with computer vision. To me, it seems like progress of deep learning in computer vision is already off the peak. Now the research is shifting to natural language processing. Although I am not a phd student specializing deep learning,  I look forward to the progress! 

Saturday, June 20, 2015

Follow-up to pixel to pixel MNIST by IRNN

This post is some kind of follow-up to my previous post.

Firstly, my blog caught attention from chainer’s developers, and was introduced in official twitter account. Thanks, PFI.
Also, there is a Reddit on the paper I implemented. Richard Socher has already invented initialization trick with the identity matrix before the paper. The first version of the Hinton’s paper did not cite the Richard’s paper, but now the latest version of the paper does. You can check it on the contrasting version 1 and 2. It is surprising that the researchers like God (Hinton) failed to recognize the work from Stanford. I think this means that the deep learning is making progress in the very fast speed that even the top researchers sometimes cannot recognize famous works (like the paper from Stanford).

Last but not least, my friend (@masvsn) refined my code, and achieved 94% test accuracy. Seems like Adam makes difference. Thank you, @masvsn. I can learn his implementation techniques more from the refined code.
He is a Ph.D. student in computer vision. In fact, I know him in person (but he does not identify himself online so I will keep his name anonymous) and I have been sneaking in his weekly seminar on machine learning with the emphasis on deep learning. I had slightly known deep learning and general feed forward neural networks, but it was him that “deeply” taught me the latest deep learning progress with useful and practical techniques. In particular, his implementation exercises helped me a lot to enhance my implementation skills. Perhaps I could not have implemented IRNN if I had not attended his seminar. I really appreciate him.

One thing, I am wondering why IRNN has no sign to overfit so far.  This is the latest plot from my own implementation. I have been running it for a week. Yeah, it is still learning very slowly.

Both his plot and mine have no symptom to overfit. I cannot judge this from the original paper because they only have test accuracy plot, not train accuracy. However, if this is the case with other problems, IRNN has great ability to generalize. If anybody knows, please let me know on comment.

I finish this post with his code. Thank you @masvsn, again.

Monday, June 15, 2015

Implementing Recurrent Neural Net using chainer!

I just started to study deep learning, which is huge boom both in academia and industry. A week ago, a Japanese company called  Preferred Infrastructure (PFI) released new deep learning framework " chainer"! This framework is really great. I was able to implement a recurrent net with less than 100 lines of python code.

Specifically, I tried new recurrent neural network (RNN) called IRNN described in recent Hinton's paper "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units ." It was difficult to train RNN to learn such a long dependency, but IRNN overcame it initializing recurrent weights by identity matrix, and using ReLU as activation function. Awesome!

In this post, I will write about my experiment of IRNN to recognize MNIST digits by putting 724 pixels to the recurrent net in sequential order (a experiment in the paper at section 4.2).

The technique and best parameter value in the paper is:
  • Initialize recurrent weights matrix with identity matrix
  • Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.001
  • Activation function is ReLU
  • Train the network using SGD 
  • learning rate: 10^-8,  gradient clipping value: 1,  and mini batch size is 16.
But I did not use the same settings because the net seems to learn faster at least on the first few epochs.  I did:

  • Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.01
  • No gradient clipping

  • The other setting is the same as the paper.

    Problem is, it takes 50 mins to run each epoch (forward and backpropagate over whole dataset once) on my local environment  (CPU). Perhaps, it's better to buy GPU or  use AWS GPU instance. Anyway, I am currently running it wtih CPU for two days so far!  The results is shown in the following figure. Though the learning is very slow, the net definitely learns ! Cool! I will continue my experiment.

    In the paper, they continue to learn up to 1,000,000 steps. At first, I thought one step is just one update of parameter by 16 examples (a batch) but, after I tried by myself, I started to think the step is not just one update, it's the updates over a whole dataset. I am not sure what the word (step or iteration in the paper) literally means, but if it is just one update, my plot above cannot be explained when it is compared with the plot in the paper.

    Sometimes technical words in deep learning or machine learning is confusing to foreigners like me who have studied science not in English.  I sometimes not sure whether epoch, iteration, or step means just one batch update or all updates over a whole dataset. Is it depends on situation or is there clear distinction?  Does anybody knows?

    Anyway, I successfully and relatively easily implemented IRNN for pixel to pixel MNIST. I think chainer made huge difference. Implementing recurrent net is easier with chainer than with other popular deep learning libraries such as Theano or Torch. 

    I finish this post with my implementation code. In the next post, I may explain how to set up chainer (though it's super easy) and describe the code. 

    Friday, June 12, 2015

    First Post

    Hello, my name is Satoshi, a Japanese student who will be a Ph.D. student at IU Bloomington starting from this fall. I am currently in Tokyo, and I will move to Bloomington at mid July. Looking forward to new life in the U.S.!

    I will write both about daily life and about technical matter.

    I mainly write in English, but sometimes in Japanese.