Saturday, June 20, 2015

Follow-up to pixel to pixel MNIST by IRNN

This post is some kind of follow-up to my previous post.

Firstly, my blog caught attention from chainer’s developers, and was introduced in official twitter account. Thanks, PFI.
Also, there is a Reddit on the paper I implemented. Richard Socher has already invented initialization trick with the identity matrix before the paper. The first version of the Hinton’s paper did not cite the Richard’s paper, but now the latest version of the paper does. You can check it on the contrasting version 1 and 2. It is surprising that the researchers like God (Hinton) failed to recognize the work from Stanford. I think this means that the deep learning is making progress in the very fast speed that even the top researchers sometimes cannot recognize famous works (like the paper from Stanford).

Last but not least, my friend (@masvsn) refined my code, and achieved 94% test accuracy. Seems like Adam makes difference. Thank you, @masvsn. I can learn his implementation techniques more from the refined code.
He is a Ph.D. student in computer vision. In fact, I know him in person (but he does not identify himself online so I will keep his name anonymous) and I have been sneaking in his weekly seminar on machine learning with the emphasis on deep learning. I had slightly known deep learning and general feed forward neural networks, but it was him that “deeply” taught me the latest deep learning progress with useful and practical techniques. In particular, his implementation exercises helped me a lot to enhance my implementation skills. Perhaps I could not have implemented IRNN if I had not attended his seminar. I really appreciate him.

One thing, I am wondering why IRNN has no sign to overfit so far.  This is the latest plot from my own implementation. I have been running it for a week. Yeah, it is still learning very slowly.

Both his plot and mine have no symptom to overfit. I cannot judge this from the original paper because they only have test accuracy plot, not train accuracy. However, if this is the case with other problems, IRNN has great ability to generalize. If anybody knows, please let me know on comment.

I finish this post with his code. Thank you @masvsn, again.

Monday, June 15, 2015

Implementing Recurrent Neural Net using chainer!

I just started to study deep learning, which is huge boom both in academia and industry. A week ago, a Japanese company called  Preferred Infrastructure (PFI) released new deep learning framework " chainer"! This framework is really great. I was able to implement a recurrent net with less than 100 lines of python code.

Specifically, I tried new recurrent neural network (RNN) called IRNN described in recent Hinton's paper "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units ." It was difficult to train RNN to learn such a long dependency, but IRNN overcame it initializing recurrent weights by identity matrix, and using ReLU as activation function. Awesome!

In this post, I will write about my experiment of IRNN to recognize MNIST digits by putting 724 pixels to the recurrent net in sequential order (a experiment in the paper at section 4.2).

The technique and best parameter value in the paper is:
  • Initialize recurrent weights matrix with identity matrix
  • Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.001
  • Activation function is ReLU
  • Train the network using SGD 
  • learning rate: 10^-8,  gradient clipping value: 1,  and mini batch size is 16.
But I did not use the same settings because the net seems to learn faster at least on the first few epochs.  I did:

  • Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.01
  • No gradient clipping

  • The other setting is the same as the paper.

    Problem is, it takes 50 mins to run each epoch (forward and backpropagate over whole dataset once) on my local environment  (CPU). Perhaps, it's better to buy GPU or  use AWS GPU instance. Anyway, I am currently running it wtih CPU for two days so far!  The results is shown in the following figure. Though the learning is very slow, the net definitely learns ! Cool! I will continue my experiment.

    In the paper, they continue to learn up to 1,000,000 steps. At first, I thought one step is just one update of parameter by 16 examples (a batch) but, after I tried by myself, I started to think the step is not just one update, it's the updates over a whole dataset. I am not sure what the word (step or iteration in the paper) literally means, but if it is just one update, my plot above cannot be explained when it is compared with the plot in the paper.

    Sometimes technical words in deep learning or machine learning is confusing to foreigners like me who have studied science not in English.  I sometimes not sure whether epoch, iteration, or step means just one batch update or all updates over a whole dataset. Is it depends on situation or is there clear distinction?  Does anybody knows?

    Anyway, I successfully and relatively easily implemented IRNN for pixel to pixel MNIST. I think chainer made huge difference. Implementing recurrent net is easier with chainer than with other popular deep learning libraries such as Theano or Torch. 

    I finish this post with my implementation code. In the next post, I may explain how to set up chainer (though it's super easy) and describe the code. 

    Friday, June 12, 2015

    First Post

    Hello, my name is Satoshi, a Japanese student who will be a Ph.D. student at IU Bloomington starting from this fall. I am currently in Tokyo, and I will move to Bloomington at mid July. Looking forward to new life in the U.S.!

    I will write both about daily life and about technical matter.

    I mainly write in English, but sometimes in Japanese.