Monday, June 20, 2016

Pre-deep aspect of deep learning

When you play with deep learning, you need data. In most examples, it’s prepared for you (e.g. mnist or cifer-10). For example, in tensorflow tutorial:

for i in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

“mnist.train.next_batch(100)”!! This is great. But how to do this on your own? That’s the issue I call “pre-deep”. It’s ironic when you realize spending more time on preprocessing part when you actually want to enjoy deep learning on your data. It shouldn’t be like that. My philosophy is, most of time should be spent on “deep” part: Find the best model architectures and parameters. Preprocessing should be quick and easy.

I summarize my thought on the issue here. This is kind of brainstorming to determine what kind of library I want. If developers of deep learning libraries see this, please consider to include that features. That’s the part most users wonder when they actually try on their own data. Perhaps it’s better to separate the preprocessing library from the framework so that users can also use with other deep learning frameworks.

Let’s use image classification. Let x a python instance that hides everything (i.e. I want to do “x.train.next_batch(100)” like tensorflow example). The library has the following functions.

User Input: We need to fix input data format to initialize x to feed our dataset. I can think two ways:

 1: Text file with path and class. i.e. each text line is
[path/to/image.jpeg], [label]
 2: Directory by classes. The directory will look like:
root_dir/
    dog/
        dog001.jpg
        dog002.jpg
    cat/
        cat001.jpg
        cat002.jpg
 What users can tune:

  • Network Input Size: fixed like 224*224, flexible but with maximum length, or only fix aspect ratio. 
  • Mean Image Subtraction: Precompute mean in advance and subtract or not. 
  • Dataset Split: Split dataset into train, validate, and test. Sometimes the data is already divided by the provider. In that case,  user can specify the setting when inputting the data.
  • Shuffle Dataset: After each epoch, the order of data should be shuffled. 
  • Data Augmentation: User can designate augmentations (flip, rotate, noise, etc) on either live or preprocessed. Looks like Keras has a similar feature.
  • In Memory or Not: Real large dataset does not fit on memory. It should have an option to use online loading. 
  • Image Storing Methods: I know storing raw image files on disk is not so efficient. Some people use lmdb or hdf5 format. So the library should be able to use them internally.  
  • Feature Extraction: Sometimes we only need features from middle layers such as after conv5_3 in VGG or before softmax in GoogleNet. The library should have a function to precompute the features. 

Note this is just the simplest case with only one label per image. But I still couldn’t find an easy library to do that. I know caffe has a part of the features I mentioned above, but it’s highly customized for caffe. Moreover, caffe is too complicated to install. Especially if you do not have root privilege like in university server, it’s almost impossible to install.

There are more complicated situations in real. I assumed image classification with one label now because the purpose of this post is to raise needs. After we have a library for that, we can think about extensions to deal with the situation: localization, multi labels, multi localization (image detection), or even detection with multi labels. Moreover, you might have captions or question answer pairs on images.

Another story is text processing with RNN. One famous issue is how to make a batch with different lengths of sequences (e.g. Keras have a padding function). But making padding with completely different size is not good because the padded part is actually waste of computational resource. So common practice is to make a batch with almost same length of sequences or sometimes even only with exactly same length. This requires pre indexing with sequence size. Moreover, after training, we need to use beam search to generate sequences. These parts should be also included in the library.

Lastly, I know this is not research. Just an engineering product. But it is true that researcher’s time is wasted on this routine part. I want to argue that there should be a unified library that you can do preprocessing without thinking the details. Just like we rarely implement backpropagation by ourselves now.

 Feel free to comment more features that you think preprocessing library should have.

Friday, January 1, 2016

日本語の画像キャプション生成モデルを公開

明けましたね。久しぶりに日本語で書きます。

さて、英語のキャプション生成の学習データ(MS COCO)を機械翻訳した上で、学習してみました。アルゴリズムは英語と全く同じです。短い文なので、機械翻訳でもそれなりのキャプションになるようです。

モデルは英語と同じgithub経由でダウンロード 可能です。詳しくは"I want to generate Japanese caption."の章を参照してください。
 https://github.com/apple2373/chainer_caption_generation

サンプル