In this post, I’ll describe the Single-Shot MultiBox Detector designed for generalized object detection tasks and explain how our research team applied it to detect faces.
In object detection, the task is to determine both the location and class of the object: the “localization task” and “classification task”. Deep learning methods for object detection use convolutional neural network based approaches to achieve this. The earliest methods of object detection separated these tasks into two logical steps: 1) propose candidate regions using a region proposal network, and 2) classify them using a classifier. Faster-RCNN and its predecessors are two-stage networks. While these have very high accuracy, they are also relatively heavy since there are two neural networks used to do the detection task. Thus, they are termed “two-stage” networks.
Newer methods complete both tasks in one neural network, which is why they are called single-shot detectors. Both YOLO (You Only Look Once) and SSD (Single-Shot Multibox Detector) (SSD) fall under this category. These networks are fundamentally similar, but differ in their architecture details.
The basic idea behind a single-shot detector is that it extracts features, proposes regions, and classifies them all in one forward pass. It does this through an architecture that uses feature-extracting backbones and detection layers with pre-defined anchor boxes. SSD’s architecture is shown below:
The original paper uses VGG-16 as the feature extracting network, but this can be swapped out for any network, like MobileNetV2. In the picture, you can see it labeled “VGG-16 through Conv5_3 layer”. After the base network, SSD augments additional feature layers denoted in the picture, until it produces a feature map of size 1x1x256. This is denoted by “Extra Feature Layers”.
All of the layers with a horizontal line attached (feature maps of size 38, 19, 10, 5, 3, and 1) are used for prediction. That is, anchor boxes at each of these scales are used as “proposed regions” to predict classification scores of bounding box coordinates. The fact that SSD has so many feature scales at which it is able to predict regions, makes it a great choice for large scale detections. This is one fundamental difference between SSD and YOLO (YOLO only produces 98 detections total, while SSD produces 8732). 8732 bounding box predictions are calculated as follows: (38x38x4) + (19x19x6) + (10×10×6) + (5×5×6) + (3×3×4) + (10×10×4) = 8732. Each layer has a certain number of designated anchors (4, 6, 6, 6, 4, 4, respectively) with predefined aspect ratios to handle objects of varying scales.
During training, since localization and classification are being learned at the same time, a specialized loss function is required. MultiBox loss takes care of both of these tasks simultaneously: smooth L1 Loss is used to regress the bounding box parameters while categorial cross-entropy loss is used to learn the classification task.
We trained two models: 1) the original SSD with the VGG-16 backbone, and 2) SSD-MobileNet, the same detection network but with a MobileNetV2 backbone instead, on the WIDER Face benchmark dataset. This dataset has over 12,000 images of faces at varying scales, with different levels of occlusion, poses, and lighting, making it a great dataset to train the networks on for robust face detection.
The original SSD and SSD-MobileNet produced some visibly different results. We trained both models for about 50 epochs and observed the results on one of the test images from the benchmark:
While both networks were able to detect faces, SSD-MobileNet was able to converge a lot more quickly than the original SSD, most likely due to the depth-wise convolution aspect of MobileNet which makes it much more robust in learning feature maps. The boxes were also a lot tighter using SSD-MobileNet. Based on this, and the model sizes (switching VGG out for MobileNet reduced the model size by 84 MB!) we decided to move forward with training SSD-MobileNet for around 200 epochs total. By the end, we were able to see the result shown in the first picture, by setting the confidence threshold to 95%. Here’s another result:
Our research team was able to see some really good real time results as well. Once, while testing it in the lab on our Jetson Nano using a Raspberry Pi camera, the model detected all our faces robustly and tightly as usual, but we noticed a “phantom” floating box on the window nearby. We thought it was a mistake, but realized there was a reflection of one of our faces in the window that we couldn’t even pick up on with our human eyes. That was pretty amazing and attests to the quality of our model.
SSD is great for large scale, real time, detections, as we saw here when detecting faces at different scales. SSD also has different aspect ratios built in, which can be useful. For example, giraffes are shaped very differently than cars. In object detection challenges like ImageNet, with 1000 classes, this can be a huge asset. For detecting faces though, that’s a different story, as most of our faces are shaped pretty similarly. While SSD does not account for this, there is research in the direction of optimizing object detectors for faces. Google’s BlazeFace uses the idea of anchor scheme optimization, among others, to design a super fast face detector meant for close-up faces on mobile phones. I’ll get into that on my next post!
 Single-Shot MultiBox Detector: https://arxiv.org/abs/1907.05047
 BlazeFace: Sub-millisecond Neural Face Detection on Mobile GPUs: https://arxiv.org/abs/1907.05047
 WIDER FACE: A Face Detection Benchmark: http://shuoyang1213.me/WIDERFACE/