Scene Classification


This project report was compiled as part of the course requirements of CS 6476 Computer Vision under Prof. James Hays

In this project we use convolutional neural networks to classify scenes into categories. In Part 1, we train a neural network from scratch and in Part 2, we use the pretrained VGG network and fine tune it to our data.

Part 1

To start off, we build a CNN ground up and train it.

Data Jittering

As we do not have enough data, we perform data jittering to augment our data set. We perform data jittering by:

  • Image flipping: Flip the image horizontally
  • Random Image rotation: Rotate the image slightly
  • Random Image scaling: Scale the image slightly

Zero centering

To normalize the data set, we subtract the average of the data set from each of the image.

Network Regularization

Regularization is used to fight overfitting of the model to the dataset. We do this by using a dropout layer, which randomly switches connections on and off between layers. This prevents a unit in one layer from relying too strongly on a single unit in the previous layer.

Network Architecture

We train a network to categorize images in the 15 scene categories. The following is the architecture of the network

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 15
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x15 7x7x15 3x3x15 3x3x15 3x3x15 1x1x15 1x1x15

Result

The given network, produces an accuracy of 57.8% with a learning rate of 0.01 in 50 epochs


Architecture Experimentation

The following networks are modified versions of the network specified in the project description. A constant learning rate of 0.001 was used for each of the following networks. The difference in the networks can be realised by examining the 10th layer (Full Conv1)

Full Conv Layer Resolution: 8

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 8
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x8 7x7x8 3x3x8 3x3x8 3x3x8 1x1x15 1x1x15

Results

Lowest validation error for above network: 0.494000


Full Conv Layer Resolution: 10

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x10 7x7x10 3x3x10 3x3x10 3x3x10 1x1x15 1x1x15

Results

Lowest validation error for above network: 0.457333


Full Conv Layer Resolution: 12

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 12
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x12 7x7x12 3x3x12 3x3x12 3x3x12 1x1x15 1x1x15

Results

Lowest validation error for above network: 0.436000


Full Conv Layer Resolution: 18

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 18
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x18 7x7x18 3x3x18 3x3x18 3x3x18 1x1x15 1x1x15

Results

Lowest validation error for above network: 0.412667


Full Conv Layer Resolution: 19

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 19
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x19 7x7x19 3x3x19 3x3x19 3x3x19 1x1x15 1x1x15

Results

Lowest validation error for above network: 0.416667


Full Conv Layer Resolution: 20

layer 0 1 2 3 4 5 6 7 8 9 10 11
Type Input Conv BNorm MaxPool ReLU Conv BNorm MaxPool ReLU Dropout FullConv Soft Max
Details Input Image Filter Size: 9
Stride 1
Filters: 10
  Pool Size: 6
Pad: 5
Stride: 6
  Filter Size: 38
Stride 6
Filters: 20
  Pool Size: 6
Pad: 5
Stride: 6
  Rate: 0.5 Filter Size: 3
Filters: 15
 
Dimensions 64x64x3 56x56x10 56x56x10 11x11x10 11x11x10 7x7x20 7x7x20 3x3x20 3x3x20 3x3x20 1x1x15 1x1x15

Results

Lowest validation error for above network: 0.436000


Collated Results

The following results are with a learning rate of 0.001 in 50 epochs

Spatial Resolution Accuracy(Lowest top 1 err) Accuracy(Last top 5 err) Execution Time(secs)
8 0.494 0.103 245.84
10 0.457 0.094 241.91
12 0.436 0.105 239.26
18 0.413 0.083 238.09
19 0.417 0.097 240.48
20 0.436 0.085 240.52

Learning Rate Experiments

Once the best spatial resolution for the full conv layer was found, the learning rate was chosen by experimenting with multiple values

Learning Rate Accuracy (Lowest Validation error) Learning Rate Curves
10-5 0.758667
10-4 0.509333
10-3 0.412667
10-2 0.412667
10-1 0.412667

Part 2: Transfer Learning

A noteable feature about deep learning is that networks trained for one task can be repurposed for other (similar)tasks and with minor changes and fine tuning, give good results on the “transferred” tasks.

This can be especially useful when the data set required for a certain task is not enough.

We fine tune and modify the trained VGG network to classify our 15 category dataset

The original VGG-16 Network: VGG-16 Network

Architecture 1

The following is the first architecture that I tried for part 2. The architecture is based off the VGG network. To build it, the following changes were made to the base network.

  • VGG expects 224 x 224 resolution-ed images
  • VGG expects 3 channel images(RGB) so update the input code to repeat the grayscale image in all channels.
  • Normalize images
  • Replace the existing fully conv layer in the last stage of VGG to return a 1 x 15 vector instead of the 1 x 1000 vector
  • Replace the softmax layer (this is done to replace the node values in the pretrained layer)
  • Add dropout layers between fc6 - fc7 and fc7 - fc8(i.e. new-fc8)

The results of this architecture follow.

Replace the Fully Convolutional Layer by a new(randomly initialized) FC Layer with output size of 4096 x 15

Result

The above network was retrained on the last 9 layers, and an accuracy of the 87.6% was achieved in 5 epochs with a constant learning rate of 0.001


Architecture 1 (with image size jittering)

The above architecture was used with image size jittering to produce the following results:

Result

I added some image size jittering in this part, hoping to gain an increase in accuracy, but it didnt affect the results much.

Accuracy: 86.87% in 5 epochs and 87.3% in 10 epochs


Architecture 2

Tried to add normalization after each conv layer, but failed miserably

Result

Validation Accuracy: 50.6%


Architecture 3 with Backprop Depth: 5

Replace the last Fully Convolutional Layer(FC-8) by a new(randomly initialized) FC Layer with output size of 1024 x 15 and replace the (FC-7) by a new(randomly initialized) FC Layer with output size of 4096 x 1024

Result

Validation Accuracy in 5 epochs: 87.7% (756.40 secs)


Architecture 3 with Backprop Depth: 6

Result

Validation Accuracy: 86.2% in 4 epochs (537.444951 seconds)


Architecture 4

Replaced all fully convolutional layers with 1 fully convolutional layer

Result

Validation Accuracy: 82.66% in 5 epochs (588.709 seconds)


Collated Results

The following results are with a learning rate of 0.001 in 5 epochs

Network Accuracy
Architecture 1 87.6%
Architecture 1(image size jittering) 86.8%
Architecture 2 50.6%
Architecture 3(Backprop Depth 5) 87.7%
Architecture 3(Backprop Depth 6) 86.2%
Architecture 4 82.6%

Extra credit

Image sketch token Recognition

We use the data set provided as part of the article: How Do Humans Sketch Objects?

Setup

  • Download the image dataset from here The dataset contains 250 image categories and 80 images in each category.
  • The images are named as numbers.
  • To split into data and train set, I duplicated the folders and called them train and test.
  • In train, I delete all images with odd numbered endings and even numbered endings in test

The following shell script can be used to split the dataset into the training and test set

cp -r dataset train
mv dataset test

find train/ -name "*1.png" -delete 
find train/ -name "*3.png" -delete 
find train/ -name "*5.png" -delete 
find train/ -name "*7.png" -delete 
find train/ -name "*9.png" -delete 

find test/ -name "*0.png" -delete 
find test/ -name "*2.png" -delete 
find test/ -name "*4.png" -delete 
find test/ -name "*6.png" -delete 
find test/ -name "*8.png" -delete 

I added proj6_extra_credit_setup_data.m, proj6_extra_credit.m, proj6_extra_credit_cnn_init.m for the extra credit section.

I used the VGG network from part2 for this task with minor modifications

  • Data Set: As opposed to the given dataset which had 15 categories with 200 images each, the sketch tokens data set has 250 categories with 80 images each. VGG requires that each image be 224 x 224 pixels. Due to memory constraints I choose 50 categories at random and train/test my network on these categories.
  • Network Changes: I have used the exact same network from part2, but changed the final layer to predict 250 categories instead of 15.

Results

Validation Accuracy: 74.75%

Hyper Parameters

  • Back prop depth: 5
  • Learning Rate: 0.001
  • Epochs: 10
  • Batch Size: 50

References