Scene Classification

This project report was compiled as part of the course requirements of CS 6476 Computer Vision under Prof. James Hays

In this project we use convolutional neural networks to classify scenes into categories. In Part 1, we train a neural network from scratch and in Part 2, we use the pretrained VGG network and fine tune it to our data.

Part 1

To start off, we build a CNN ground up and train it.

Data Jittering

As we do not have enough data, we perform data jittering to augment our data set. We perform data jittering by:

Image flipping: Flip the image horizontally
Random Image rotation: Rotate the image slightly
Random Image scaling: Scale the image slightly

Zero centering

To normalize the data set, we subtract the average of the data set from each of the image.

Network Regularization

Regularization is used to fight overfitting of the model to the dataset. We do this by using a dropout layer, which randomly switches connections on and off between layers. This prevents a unit in one layer from relying too strongly on a single unit in the previous layer.

Network Architecture

We train a network to categorize images in the 15 scene categories. The following is the architecture of the network

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 15		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x15	7x7x15	3x3x15	3x3x15	3x3x15	1x1x15	1x1x15

Result

The given network, produces an accuracy of 57.8% with a learning rate of 0.01 in 50 epochs

Architecture Experimentation

The following networks are modified versions of the network specified in the project description. A constant learning rate of 0.001 was used for each of the following networks. The difference in the networks can be realised by examining the 10th layer (Full Conv1)

Full Conv Layer Resolution: 8

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 8		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x8	7x7x8	3x3x8	3x3x8	3x3x8	1x1x15	1x1x15

Results

Lowest validation error for above network: 0.494000

Full Conv Layer Resolution: 10

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x10	7x7x10	3x3x10	3x3x10	3x3x10	1x1x15	1x1x15

Results

Lowest validation error for above network: 0.457333

Full Conv Layer Resolution: 12

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 12		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x12	7x7x12	3x3x12	3x3x12	3x3x12	1x1x15	1x1x15

Results

Lowest validation error for above network: 0.436000

Full Conv Layer Resolution: 18

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 18		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x18	7x7x18	3x3x18	3x3x18	3x3x18	1x1x15	1x1x15

Results

Lowest validation error for above network: 0.412667

Full Conv Layer Resolution: 19

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 19		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x19	7x7x19	3x3x19	3x3x19	3x3x19	1x1x15	1x1x15

Results

Lowest validation error for above network: 0.416667

Full Conv Layer Resolution: 20

layer	0	1	2	3	4	5	6	7	8	9	10	11
Type	Input	Conv	BNorm	MaxPool	ReLU	Conv	BNorm	MaxPool	ReLU	Dropout	FullConv	Soft Max
Details	Input Image	Filter Size: 9 Stride 1 Filters: 10		Pool Size: 6 Pad: 5 Stride: 6		Filter Size: 38 Stride 6 Filters: 20		Pool Size: 6 Pad: 5 Stride: 6		Rate: 0.5	Filter Size: 3 Filters: 15
Dimensions	64x64x3	56x56x10	56x56x10	11x11x10	11x11x10	7x7x20	7x7x20	3x3x20	3x3x20	3x3x20	1x1x15	1x1x15

Results

Lowest validation error for above network: 0.436000

Collated Results

The following results are with a learning rate of 0.001 in 50 epochs

Spatial Resolution	Accuracy(Lowest top 1 err)	Accuracy(Last top 5 err)	Execution Time(secs)
8	0.494	0.103	245.84
10	0.457	0.094	241.91
12	0.436	0.105	239.26
18	0.413	0.083	238.09
19	0.417	0.097	240.48
20	0.436	0.085	240.52

Learning Rate Experiments

Once the best spatial resolution for the full conv layer was found, the learning rate was chosen by experimenting with multiple values

Learning Rate	Accuracy (Lowest Validation error)	Learning Rate Curves
10^-5	0.758667
10^-4	0.509333
10^-3	0.412667
10^-2	0.412667
10^-1	0.412667

Part 2: Transfer Learning

A noteable feature about deep learning is that networks trained for one task can be repurposed for other (similar)tasks and with minor changes and fine tuning, give good results on the “transferred” tasks.

This can be especially useful when the data set required for a certain task is not enough.

We fine tune and modify the trained VGG network to classify our 15 category dataset

The original VGG-16 Network:

Architecture 1

The following is the first architecture that I tried for part 2. The architecture is based off the VGG network. To build it, the following changes were made to the base network.

VGG expects 224 x 224 resolution-ed images
VGG expects 3 channel images(RGB) so update the input code to repeat the grayscale image in all channels.
Normalize images
Replace the existing fully conv layer in the last stage of VGG to return a 1 x 15 vector instead of the 1 x 1000 vector
Replace the softmax layer (this is done to replace the node values in the pretrained layer)
Add dropout layers between fc6 - fc7 and fc7 - fc8(i.e. new-fc8)

The results of this architecture follow.

Replace the Fully Convolutional Layer by a new(randomly initialized) FC Layer with output size of 4096 x 15

Result

The above network was retrained on the last 9 layers, and an accuracy of the 87.6% was achieved in 5 epochs with a constant learning rate of 0.001

Architecture 1 (with image size jittering)

The above architecture was used with image size jittering to produce the following results:

Result

I added some image size jittering in this part, hoping to gain an increase in accuracy, but it didnt affect the results much.

Accuracy: 86.87% in 5 epochs and 87.3% in 10 epochs

Architecture 2

Tried to add normalization after each conv layer, but failed miserably

Result

Validation Accuracy: 50.6%

Architecture 3 with Backprop Depth: 5

Replace the last Fully Convolutional Layer(FC-8) by a new(randomly initialized) FC Layer with output size of 1024 x 15 and replace the (FC-7) by a new(randomly initialized) FC Layer with output size of 4096 x 1024

Result

Validation Accuracy in 5 epochs: 87.7% (756.40 secs)

Architecture 3 with Backprop Depth: 6

Result

Validation Accuracy: 86.2% in 4 epochs (537.444951 seconds)

Architecture 4

Replaced all fully convolutional layers with 1 fully convolutional layer

Result

Validation Accuracy: 82.66% in 5 epochs (588.709 seconds)

Collated Results

The following results are with a learning rate of 0.001 in 5 epochs

Network	Accuracy
Architecture 1	87.6%
Architecture 1(image size jittering)	86.8%
Architecture 2	50.6%
Architecture 3(Backprop Depth 5)	87.7%
Architecture 3(Backprop Depth 6)	86.2%
Architecture 4	82.6%

Extra credit

Image sketch token Recognition

We use the data set provided as part of the article: How Do Humans Sketch Objects?

Setup

Download the image dataset from here The dataset contains 250 image categories and 80 images in each category.
The images are named as numbers.
To split into data and train set, I duplicated the folders and called them train and test.
In train, I delete all images with odd numbered endings and even numbered endings in test

The following shell script can be used to split the dataset into the training and test set

cp -r dataset train
mv dataset test

find train/ -name "*1.png" -delete 
find train/ -name "*3.png" -delete 
find train/ -name "*5.png" -delete 
find train/ -name "*7.png" -delete 
find train/ -name "*9.png" -delete 

find test/ -name "*0.png" -delete 
find test/ -name "*2.png" -delete 
find test/ -name "*4.png" -delete 
find test/ -name "*6.png" -delete 
find test/ -name "*8.png" -delete 

I added proj6_extra_credit_setup_data.m, proj6_extra_credit.m, proj6_extra_credit_cnn_init.m for the extra credit section.

I used the VGG network from part2 for this task with minor modifications

Data Set: As opposed to the given dataset which had 15 categories with 200 images each, the sketch tokens data set has 250 categories with 80 images each. VGG requires that each image be 224 x 224 pixels. Due to memory constraints I choose 50 categories at random and train/test my network on these categories.
Network Changes: I have used the exact same network from part2, but changed the final layer to predict 250 categories instead of 15.

Results

Validation Accuracy: 74.75%

Hyper Parameters

Back prop depth: 5
Learning Rate: 0.001
Epochs: 10
Batch Size: 50

Scene Classification

Part 1

Data Jittering

Zero centering

Network Regularization

Network Architecture

Architecture Experimentation

Full Conv Layer Resolution: 8

Full Conv Layer Resolution: 10

Full Conv Layer Resolution: 12

Full Conv Layer Resolution: 18

Full Conv Layer Resolution: 19

Full Conv Layer Resolution: 20

Collated Results

Learning Rate Experiments

Part 2: Transfer Learning

Architecture 1

Architecture 1 (with image size jittering)

Architecture 2

Architecture 3 with Backprop Depth: 5

Architecture 3 with Backprop Depth: 6

Architecture 4

Collated Results

Extra credit

Image sketch token Recognition

Setup

Results

References