Recently, I participated in the National Data Science Challenge (NDSC) 2019 by Shopee, together with 3 of my colleagues (one of them recently published his Medium article here). We were in the Advanced Category of the competition, which required us to predict the attributes of products in three categories (namely beauty, fashion, and mobile) based on inputs such as product titles and images given in the dataset by Shopee. As my interest is more towards computer vision (CV), I concentrated more on working with the product images.
The Motivation behind Object Detection
While I was doing the standard data exploration, I noticed that the product images contained a significant amount of noise.
As seen in the example above, the image shows not only the product but also irrelevant things (e.g. people, words). This causes a naive CV model (e.g. a Convolutional Neural Network model) to mistake such irrelevant stuff as features of the product.
Why is it undesirable for the CV model to treat people, words, etc as features? As a simple example, let’s say we have a set of training images where images for fashion product A contain European models and images for fashion product B contain Asian models (no offense intended towards any group of people). If a naive CV model trained on such a training set comes across a new image of product A containing an Asian model, it may think that the image shows product B and vice versa. This is because the model has mistakenly associated the product with the race of the person in the image, rather than the actual features of the product (color, shape, etc).
As such, there is a need to somehow separate the relevant object (i.e. the product itself) among all the irrelevant the stuff. Describing this in a more technical way, there is a need to improve the signal-to-noise ratio of the images before feeding them to a CV model for training. Object detection is one of the possible methods for performing this job.
The Object Detection Model
The model architecture that I applied for object detection was the SSD300 model. SSD is the abbreviation for Single Shot Detector, which is a type of object detection model, while 300 indicates that the required dimension of the input images is 300 pixels x 300 pixels. I came across this repository by Github user sgrvinod, which gives a detailed and useful explanation on how the SSD300 works and also contains the codes for implementing the SSD300 in PyTorch. As such, I would not go into details of how the whole model works.
In terms of the training and evaluation dataset for the SSD300, I sampled 500 images from each of the 3 categories and created XML files containing annotations in the Pascal Visual Object Classes (VOC) format. The VOC format is commonly used for training and validation data for various object detection models, the SSD300 being one of them. The code snippet below is an example of how an annotation in VOC format looks like:
<annotation> <folder>mobile_image</folder> <filename>fd8e9d0609dfd43924731b646a3f8690.jpg</filename> <path>C:\Users\thefo\Downloads\bb_images\mobile_image\fd8e9d0609dfd43924731b646a3f8690.jpg</path> <source> <database>Unknown</database> </source> <size> <width>640</width> <height>640</height> <depth>3</depth> </size> <segmented>0</segmented> <object> <name>mobile</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>1</difficult> <bndbox> <xmin>147</xmin> <ymin>147</ymin> <xmax>277</xmax> <ymax>315</ymax> </bndbox> </object> <object> <name>mobile</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>1</difficult> <bndbox> <xmin>10</xmin> <ymin>142</ymin> <xmax>171</xmax> <ymax>335</ymax> </bndbox> </object> <object> <name>mobile</name> <pose>Unspecified</pose> <truncated>1</truncated> <difficult>1</difficult> <bndbox> <xmin>1</xmin> <ymin>336</ymin> <xmax>125</xmax> <ymax>499</ymax> </bndbox> </object> <object> <name>misc</name> <pose>Unspecified</pose> <truncated>0</truncated> <difficult>1</difficult> <bndbox> <xmin>131</xmin> <ymin>130</ymin> <xmax>292</xmax> <ymax>333</ymax> </bndbox> </object> <object> <name>misc</name> <pose>Unspecified</pose> <truncated>1</truncated> <difficult>1</difficult> <bndbox> <xmin>1</xmin> <ymin>130</ymin> <xmax>183</xmax> <ymax>348</ymax> </bndbox> </object> <object> <name>misc</name> <pose>Unspecified</pose> <truncated>1</truncated> <difficult>1</difficult> <bndbox> <xmin>1</xmin> <ymin>315</ymin> <xmax>152</xmax> <ymax>509</ymax> </bndbox> </object> </annotation>
Thanks to the help from some of my colleagues, I found a tool that helps me create the annotations very quickly and easily. With this software, I managed to annotate around 1,000 images within 8 hours (The typical amount of time I spend at work each day).
The annotation file contains information like the image path, image size, as well as the label(s) and corner coordinates of the bounding box(es). Codes in Python, such as those by sgrvinod, can be written to extract the relevant information from the XML file via packages (e.g. the built-in ElementTree XML API).
I made some modifications to modularize the processes of training, evaluation, and detection, as well as to load in my own images and annotations rather than the dataset used by sgrvinod. The modified codes can be found in my Github repository for NDSC 2019.
One SSD300 model was trained for each of the 3 product categories, as the products from each category have very different characteristics from those in the other categories. For each category, the corresponding dataset was bootstrapped (sampled with replacement) to obtain a larger amount of out-of-bag validation data while keeping the training dataset size constant. To prevent overfitting, the training dataset is put through data augmentation(random transformations on image to “create” new images) during the training phase.
After some training and evaluation, I fed one image per category from the validation set to get a sense of how the object detection model proposes bounding box(es) and predicts the class of the object within each bounding box. The images with their proposed bounding boxes are shown below:
In terms of the evaluation metrics:
- Multibox Loss (a combination of regression loss for bounding box corner coordinates and classification loss for the class within the bounding box): The multibox loss was the lowest for the SSD300 for fashion products (at 1.527), followed by 2.607 for mobile products and 3.173 for beauty products.
- Average Precision: The average precision was the highest for the SSD300 for fashion products (at 0.90), followed by 0.83 for mobile products and 0.75 for beauty products.
The object detection model for fashion products seems to work the best, probably because fashion products (primarily clothing) are pretty much the same in terms of general shape. In contrast to this, beauty products (e.g. cosmetics, toiletries) come in all shapes and sizes, so it can be tougher for an object detection model to tell if one or more beauty products exist in an image. Nevertheless, the SSD300 models seem to work pretty well in the task of object detection, given that the amount of time needed to train these models is relatively short.
Ending the Story
Although my team didn’t win the top few prizes, NDSC 2019 was really memorable, at least for me. Through this competition, I realized that there was still much for me to learn in the field of machine learning. I also stepped out of my comfort zone by building CNN models in PyTorch (I used Keras prior to this competition). If there was any reward I got out of all this, it would be having object detection models that work surprisingly well with a small training set, and this reward is good enough for me.
This article was originally published here.