Monday, June 15, 2015

Implementing Recurrent Neural Net using chainer!

I just started to study deep learning, which is huge boom both in academia and industry. A week ago, a Japanese company called  Preferred Infrastructure (PFI) released new deep learning framework " chainer"! This framework is really great. I was able to implement a recurrent net with less than 100 lines of python code.

Specifically, I tried new recurrent neural network (RNN) called IRNN described in recent Hinton's paper "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units ." It was difficult to train RNN to learn such a long dependency, but IRNN overcame it initializing recurrent weights by identity matrix, and using ReLU as activation function. Awesome!

In this post, I will write about my experiment of IRNN to recognize MNIST digits by putting 724 pixels to the recurrent net in sequential order (a experiment in the paper at section 4.2).

The technique and best parameter value in the paper is:
  • Initialize recurrent weights matrix with identity matrix
  • Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.001
  • Activation function is ReLU
  • Train the network using SGD 
  • learning rate: 10^-8,  gradient clipping value: 1,  and mini batch size is 16.
But I did not use the same settings because the net seems to learn faster at least on the first few epochs.  I did:

  • Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.01
  • No gradient clipping

  • The other setting is the same as the paper.

    Problem is, it takes 50 mins to run each epoch (forward and backpropagate over whole dataset once) on my local environment  (CPU). Perhaps, it's better to buy GPU or  use AWS GPU instance. Anyway, I am currently running it wtih CPU for two days so far!  The results is shown in the following figure. Though the learning is very slow, the net definitely learns ! Cool! I will continue my experiment.

    In the paper, they continue to learn up to 1,000,000 steps. At first, I thought one step is just one update of parameter by 16 examples (a batch) but, after I tried by myself, I started to think the step is not just one update, it's the updates over a whole dataset. I am not sure what the word (step or iteration in the paper) literally means, but if it is just one update, my plot above cannot be explained when it is compared with the plot in the paper.

    Sometimes technical words in deep learning or machine learning is confusing to foreigners like me who have studied science not in English.  I sometimes not sure whether epoch, iteration, or step means just one batch update or all updates over a whole dataset. Is it depends on situation or is there clear distinction?  Does anybody knows?

    Anyway, I successfully and relatively easily implemented IRNN for pixel to pixel MNIST. I think chainer made huge difference. Implementing recurrent net is easier with chainer than with other popular deep learning libraries such as Theano or Torch. 

    I finish this post with my implementation code. In the next post, I may explain how to set up chainer (though it's super easy) and describe the code. 


    1. Wow great work:) if you find what epoch is exactly mean please explain me to..!

      Good luck.

      1. From what I have read so far, now I am sure that epoch must be updates of parameters over a whole training data, not just a update by a batch data.

        I am still not sure when it comes to iteration or steps. At least in this paper, I guess it's same as epoch, comparing my plot and the plot in the paper.

    2. An epoch is complete pass over all data available. If you get N batches with you devide your dataset into batches of size M, all M *N examples must be used on training to constitute one epoch. It is basically the number of times you show a particular example to your model

    3. Hi
      Thank you for your excellent code. I have questions about it that I am in need of your help.

      1) i have a data set with following specification:
      x_all is a matrix with 3*5000 dimension with float data like [1.660513 4.532905 0.13058 ; 5.107513 6.365503 4.571937 ; ...]
      y-all is a matrix with 1*5000 dimention with only 0 or 1 digit like [1;1;0;0;0; ...]
      and there is not any test data set.
      I've changed the following lines in code:
      1) x_train = np.array(data).astype(np.float32)
      2) y_train = np.array(label).astype(np.int32)

      3) model.pixel_to_h = F.Linear(3,100)
      4) model.pixel_to_h.W = np.random.normal(0, 0.01,(3, 100)).astype(np.float32)
      5) model.h_to_h = F.Linear(100,100)
      6) model.h_to_h.W=np.identity(100).astype(np.float32)
      7) model.h_to_y = F.Linear(100,1)

      8) batchsize = 128
      9) num_train_data=5000

      But I'll get the following error:

      File "E:/", line 46, in forward
      h = F.relu(model.pixel_to_h(pixel) + model.h_to_h(h))
      File "E: \python\third party\chainer-master\chainer\", line 172, in __call__
      File "e:\ python\third party\chainer-master\chainer\", line 199, in _check_data_type_forward
      File "e: \python\third party\chainer-master\chainer\functions\connection\", line 101, in check_type_forward
      type_check.Variable(self.W.shape[1], 'W.shape[1]')),
      File "e: \python\third party\chainer-master\chainer\utils\", line 457, in expect
      File "e: \python\third party\chainer-master\chainer\utils\", line 428, in expect
      '{0} {1} {2}'.format(left, self.inv, right))
      chainer.utils.type_check.InvalidType: Expect: in_types[0].ndim >= 2
      Actual: 1 < 2

      Also, Do I need to use Embed in this dataset? And Generally, Why do we use the embed?

      Thank you very musch
      Best Regards,

      1. I tried to run the code again on the latest chainer (1.3), but I couldn’t. Something is wrong with type check. I coded this at the first version of chainer (1.0, I think). It was before the introduction of type check.

        Now I modified the code. It works on chainer 1.3 with CPU.
        You can check revisions here:

        As for the choice of function, I intentionally choose F.embedID to make the coding easier.
        Reason is as follows.

        This is the standard linear transformation function.

        This is also liner function, but it is different from F.Liner in that:
        It takes integer (int 32) input, and make one-hot-vector representation inside.
        For example, if the input integer take from 0 to 2, the vector will be in the three dimensional space. So,
        0 becomes (1, 0, 0)
        1 becomes (0, 1, 0)
        2 becomes (0, 0, 1)
        Note that input of this IRNN is a pixel, which is an integer from 0 to 255.
        Let the one-hot vector is be x.
        Then the output the function is,
        no bias b

        Based on the description above, Let’s think about IRNN in the paper.

        The point is that the input is a pixel in which is an integer from 0 to 255. So you can’t put an integer value into F.Liner directly. So you need to convert it into one hot vector representation.Also, based on the paper, there is no bias b in the input.

        Now you know, F.embedID automatically meet these two requirement (convert to one-hot-vector, no bias). So I used F.embedID. Of course, if you want to use F.Liner, you can. But you need to convert a pixel into one-hot-vector, and fix bias zero vector.

        Sorry I didn’t try your modification, but hope this reply helps.

      2. This comment has been removed by the author.

      3. Thank you very much. I really appreciate your help.

        My code ( and My dataSet are in the following link:

        I was fixed my error.
        Because of my dataSet is floating digit, I didn't use EmbedID. I hope that my code is correct.

        Now, my question is:
        As you said, input of your IRNN is a pixel, So you use following code in “forward” method:

        def forward
        for pixel_data in images.T:
        pixel = Variable(np.int32(pixel_data),volatile=volatile)
        h = F.relu(model.pixel_to_h(pixel) + model.h_to_h(h))

        y = model.h_to_y(h)

        But when I use this section in my code, I got following error:
        File "E: \chainer-master\chainer\functions\connection\", line 111, in forward
        Wx =
        ValueError: shapes (128,1) and (3,100) not aligned: 1 (dim 1) != 3 (dim 0)

        For this reason, I use whole “ images” in F.relu , that's mean

        def forward
        dataVar = Variable(images,volatile=volatile)
        h = F.relu(model.pixel_to_h(dataVar) + model.h_to_h(h))
        y = model.h_to_y(h)

        Is it correct?

        The reason of my question is: loss value in each epoch is too high about 0.7 !!!

      4. My another question is:

        Based on IRNN paper: Initialize recurrent weights matrix with identity matrix.

        But if I use for example following layer for my network (3-100-50-2), In other words, input layer is 3-100, recurrent net layer is 100-50 and output layer is 50-2. In this case, how to use identity matrix for initializing recurrent network?

        Thanks in advance for any help with this
        Best Regards,