Scene Recognition with Bag of Words

Felix Christian Lv2

1. Setting

1.1 Objective

The goal of this project is to give you a basic introduction to image recognition. Specifically, we will examine the task of scene recognition starting with a very simple method, e.g., tiny images and nearest neighbor classification, and then move on to bags of quantized local features.

The dataset consists of images from 15 different categories. Each category contains multiple images, and the task is to classify these images into their respective categories.

1.2 Environment

  • OS: Linux (Ubuntu 22.04)
  • Lib: Python 3, scikit-learn, OpenCV, NumPy.

2. Algorithm

2.1 Tiny Image + KNN

The Tiny Image + KNN method works by resizing images to a small fixed size and flattening the pixel values into a vector. This vector is used as a feature representation for the image, which is then classified using the k-Nearest Neighbors algorithm.

  • Steps:
    1. Resize the images to a smaller size (e.g., 32x32 pixels).
    2. Flatten the resized image into a 1D vector of pixel intensities.
    3. Use KNN to classify the image based on the distance between feature vectors.

This method is simple but may not capture high-level patterns or features in the image, as it relies purely on raw pixel intensities.

2.2 Bag of SIFT + KNN

The Bag of SIFT + KNN method involves several key steps:

  1. SIFT Feature Extraction: Key points are detected in each image, and descriptors are computed for those key points.
  2. Clustering: SIFT descriptors are clustered into a predefined number of clusters (vocabulary size).
  3. Feature Representation: Each image is represented by a histogram of the number of occurrences of each visual word (cluster center).
  4. KNN Classification: The image's histogram is compared to the histograms of training images, and the class with the most neighbors is assigned to the image.

In my code, SIFT features are extracted and clustered into visual words using k-means clustering. The resulting vocabulary size is tested with different numbers of clusters (10, 30, 50, 70, 100).

3. Experiments

3.1 Results

The following table summarizes the classification accuracy for both the Tiny Image + KNN method and the Bag of SIFT + KNN method with different visual vocabulary sizes.

MethodVocabulary SizeAverage Accuracy
Tiny Image + KNNN/A0.2236
Bag of SIFT + KNN100.2624
Bag of SIFT + KNN300.3354
Bag of SIFT + KNN500.3550
Bag of SIFT + KNN700.3568
Bag of SIFT + KNN1000.3578

3.2 Accuracy Details

Here are the detailed accuracy results for each category in the dataset.

Tiny Image + KNN Results

CategoryAccuracy
coast0.4154
forest0.1053
highway0.5375
insidecity0.0673
mountain0.1606
office0.1826
opencountry0.3581
street0.5000
suburb0.3546
tallbuilding0.1719
bedroom0.1638
industrial0.0900
kitchen0.0818
livingroom0.1376
store0.0279

Bag of SIFT + KNN Results (Different Vocabulary Sizes)

CategoryVocabulary Size = 10Vocabulary Size = 30Vocabulary Size = 50Vocabulary Size = 70Vocabulary Size = 100
coast0.33080.34230.38460.37690.3654
forest0.72370.77630.82890.83770.8026
highway0.19380.28120.29380.31250.3375
insidecity0.13940.24520.27880.30290.2740
mountain0.20800.37230.45620.40510.4307
office0.26960.21740.33040.34780.3130
opencountry0.18710.25160.18390.22900.1935
street0.28120.32290.30730.28120.2448
suburb0.51060.70210.77300.73760.8014
tallbuilding0.17580.28520.30860.32420.3438
bedroom0.13790.22410.11210.16380.1466
industrial0.10900.16110.18010.23700.1848
kitchen0.14550.20910.16360.11820.1818
livingroom0.24870.29630.24870.22750.2487
store0.27440.34420.47440.45120.4977

4. Conclusion

  • The Tiny Image + KNN method results in lower accuracy (0.2236 on average) because it relies solely on raw pixel values, which do not capture the complex features and patterns of the images effectively.
  • Bag of SIFT + KNN performs much better, with the accuracy improving as the vocabulary size increases. The highest accuracy was obtained with a vocabulary size of 100, yielding an average accuracy of 0.3578.
  • Increasing the visual vocabulary size improves the model’s ability to classify images by capturing more detailed and distinct features from the SIFT descriptors.

In conclusion, Bag of SIFT + KNN is a more effective method for image classification compared to Tiny Image + KNN, especially when a larger vocabulary is used.

  • 标题: Scene Recognition with Bag of Words
  • 作者: Felix Christian
  • 创建于 : 2025-01-09 15:49:05
  • 更新于 : 2025-01-10 19:35:32
  • 链接: https://felixchristian.top/2025/01/09/14-SceneRecognition/
  • 版权声明: 本文章采用 CC BY-NC-SA 4.0 进行许可。
评论