Specifically, I tried new recurrent neural network (RNN) called IRNN described in recent Hinton's paper "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units ." It was difficult to train RNN to learn such a long dependency, but IRNN overcame it initializing recurrent weights by identity matrix, and using ReLU as activation function. Awesome!

In this post, I will write about my experiment of IRNN to recognize MNIST digits by putting 724 pixels to the recurrent net in sequential order (a experiment in the paper at section 4.2).

The technique and best parameter value in the paper is:

- Initialize recurrent weights matrix with identity matrix
- Initialize other weights matrix sampled from Gaussian distribution with mean of 0 and standard deviation (std) of 0.001
- Activation function is ReLU
- Train the network using SGD
- learning rate: 10^-8, gradient clipping value: 1, and mini batch size is 16.

The other setting is the same as the paper.

Problem is, it takes 50 mins to run each epoch (forward and backpropagate over whole dataset once) on my local environment (CPU). Perhaps, it's better to buy GPU or use AWS GPU instance. Anyway, I am currently running it wtih CPU for two days so far! The results is shown in the following figure. Though the learning is very slow, the net definitely learns ! Cool! I will continue my experiment.

In the paper, they continue to learn up to 1,000,000 steps. At first, I thought one step is just one update of parameter by 16 examples (a batch) but, after I tried by myself, I started to think the step is not just one update, it's the updates over a whole dataset. I am not sure what the word (step or iteration in the paper) literally means, but if it is just one update, my plot above cannot be explained when it is compared with the plot in the paper.

Sometimes technical words in deep learning or machine learning is confusing to foreigners like me who have studied science not in English. I sometimes not sure whether epoch, iteration, or step means just one batch update or all updates over a whole dataset. Is it depends on situation or is there clear distinction? Does anybody knows?

Anyway, I successfully and relatively easily implemented IRNN for pixel to pixel MNIST. I think chainer made huge difference. Implementing recurrent net is easier with chainer than with other popular deep learning libraries such as Theano or Torch.

I finish this post with my implementation code. In the next post, I may explain how to set up chainer (though it's super easy) and describe the code.

Wow great work:) if you find what epoch is exactly mean please explain me to..!

ReplyDeleteGood luck.

From what I have read so far, now I am sure that epoch must be updates of parameters over a whole training data, not just a update by a batch data.

DeleteI am still not sure when it comes to iteration or steps. At least in this paper, I guess it's same as epoch, comparing my plot and the plot in the paper.

An epoch is complete pass over all data available. If you get N batches with you devide your dataset into batches of size M, all M *N examples must be used on training to constitute one epoch. It is basically the number of times you show a particular example to your model

ReplyDeleteThanks!

ReplyDeleteHi

ReplyDeleteThank you for your excellent code. I have questions about it that I am in need of your help.

1) i have a data set with following specification:

x_all is a matrix with 3*5000 dimension with float data like [1.660513 4.532905 0.13058 ; 5.107513 6.365503 4.571937 ; ...]

y-all is a matrix with 1*5000 dimention with only 0 or 1 digit like [1;1;0;0;0; ...]

and there is not any test data set.

I've changed the following lines in code:

1) x_train = np.array(data).astype(np.float32)

2) y_train = np.array(label).astype(np.int32)

3) model.pixel_to_h = F.Linear(3,100)

4) model.pixel_to_h.W = np.random.normal(0, 0.01,(3, 100)).astype(np.float32)

5) model.h_to_h = F.Linear(100,100)

6) model.h_to_h.W=np.identity(100).astype(np.float32)

7) model.h_to_y = F.Linear(100,1)

8) batchsize = 128

9) num_train_data=5000

But I'll get the following error:

File "E:/ RecurrentNN_MNIST_Test.py", line 46, in forward

h = F.relu(model.pixel_to_h(pixel) + model.h_to_h(h))

File "E: \python\third party\chainer-master\chainer\function.py", line 172, in __call__

self._check_data_type_forward(in_data)

File "e:\ python\third party\chainer-master\chainer\function.py", line 199, in _check_data_type_forward

self.check_type_forward(in_type)

File "e: \python\third party\chainer-master\chainer\functions\connection\linear.py", line 101, in check_type_forward

type_check.Variable(self.W.shape[1], 'W.shape[1]')),

File "e: \python\third party\chainer-master\chainer\utils\type_check.py", line 457, in expect

expr.expect()

File "e: \python\third party\chainer-master\chainer\utils\type_check.py", line 428, in expect

'{0} {1} {2}'.format(left, self.inv, right))

chainer.utils.type_check.InvalidType: Expect: in_types[0].ndim >= 2

Actual: 1 < 2

Also, Do I need to use Embed in this dataset? And Generally, Why do we use the embed?

Thank you very musch

Best Regards,

Lida

I tried to run the code again on the latest chainer (1.3), but I couldn’t. Something is wrong with type check. I coded this at the first version of chainer (1.0, I think). It was before the introduction of type check.

DeleteNow I modified the code. It works on chainer 1.3 with CPU.

You can check revisions here:

https://gist.github.com/apple2373/a4753b26672fc36f58d9/revisions

As for the choice of function, I intentionally choose F.embedID to make the coding easier.

Reason is as follows.

F.Liner

http://docs.chainer.org/en/stable/reference/functions.html?highlight=linear#chainer.functions.Linear

This is the standard linear transformation function.

Y=X.dot(W.T)+b

F.embedID

http://docs.chainer.org/en/stable/reference/functions.html?highlight=chainer.functions.embedid#chainer.functions.EmbedID

This is also liner function, but it is different from F.Liner in that:

It takes integer (int 32) input, and make one-hot-vector representation inside.

For example, if the input integer take from 0 to 2, the vector will be in the three dimensional space. So,

0 becomes (1, 0, 0)

1 becomes (0, 1, 0)

2 becomes (0, 0, 1)

Note that input of this IRNN is a pixel, which is an integer from 0 to 255.

Let the one-hot vector is be x.

Then the output the function is,

Y=x.dot(W.T)

no bias b

Based on the description above, Let’s think about IRNN in the paper.

The point is that the input is a pixel in which is an integer from 0 to 255. So you can’t put an integer value into F.Liner directly. So you need to convert it into one hot vector representation.Also, based on the paper, there is no bias b in the input.

Now you know, F.embedID automatically meet these two requirement (convert to one-hot-vector, no bias). So I used F.embedID. Of course, if you want to use F.Liner, you can. But you need to convert a pixel into one-hot-vector, and fix bias zero vector.

Sorry I didn’t try your modification, but hope this reply helps.

This comment has been removed by the author.

DeleteThank you very much. I really appreciate your help.

DeleteMy code (chainer-test2.py) and My dataSet are in the following link:

https://drive.google.com/file/d/0B5QzBOKhSsdgZjdoQWVrRkVVS3c/view?usp=sharing

https://drive.google.com/file/d/0B5QzBOKhSsdgb0I1MG9EenR0dmc/view?usp=sharing

I was fixed my error.

Because of my dataSet is floating digit, I didn't use EmbedID. I hope that my code is correct.

Now, my question is:

As you said, input of your IRNN is a pixel, So you use following code in “forward” method:

def forward

……

for pixel_data in images.T:

pixel = Variable(np.int32(pixel_data),volatile=volatile)

h = F.relu(model.pixel_to_h(pixel) + model.h_to_h(h))

y = model.h_to_y(h)

But when I use this section in my code, I got following error:

File "E: \chainer-master\chainer\functions\connection\linear.py", line 111, in forward

Wx = x.dot(self.W.T)

ValueError: shapes (128,1) and (3,100) not aligned: 1 (dim 1) != 3 (dim 0)

For this reason, I use whole “ images” in F.relu , that's mean

def forward

……

dataVar = Variable(images,volatile=volatile)

h = F.relu(model.pixel_to_h(dataVar) + model.h_to_h(h))

y = model.h_to_y(h)

Is it correct?

The reason of my question is: loss value in each epoch is too high about 0.7 !!!

My another question is:

DeleteBased on IRNN paper: Initialize recurrent weights matrix with identity matrix.

But if I use for example following layer for my network (3-100-50-2), In other words, input layer is 3-100, recurrent net layer is 100-50 and output layer is 50-2. In this case, how to use identity matrix for initializing recurrent network?

Thanks in advance for any help with this

Best Regards,

Lida