Abstract |
We describe the development of an image classification process for the recognition of traffic congestion. Transport for London manages a network of surveillance cameras which are used to monitor road junctions in the city. Images and video sequences (of a few seconds duration) from over~1,200 cameras are collected and made available for third party use in the development of new applications. The images and video sequences are released with a five-minute refresh rate. This paper focuses on processing the still images; these are 352x288 pixels. Imagery was collected over an extended time period and includes examples at all times of the day and night, in different location types and in the range of traffic and environmental conditions prevailing at collection time. Typical examples of images collected and used in this paper are shown below. These illustrate just some of the variation in lighting, traffic and road configuration which is present in the dataset. A subset of the imagery covering five locations and a 24 hour time period was selected for ground truth labelling. In order to carry out this step an internet accessible image annotation tool was developed which allows the labelling of each image as one of: uncongested, congested, unknown or broken. These allow both hard and soft (probabilistic) labels to be derived for the images, enabling their use for training and testing a classifier. Deep learned classifiers have been designed based on: GoogLeNet with transfer learning with a follow-on subnet, and; as a bespoke network trained from scratch. For the former the subnet is a five-layer fully-connected deep network which takes its inputs from the final convolutional layer of GoogLeNet. The subnet transforms the image features into the classification result we seek. The bespoke network is comprised of convolutional, max-pooling and fully-connected network layers. In both cases an Adam optimizer is used on stochastic mini-batches until the cross-entropy on a held-out validation image set stops improving. Results show classification performances of over 95% correct are achievable, and examination of the misclassified cases suggests many of these examples are borderline cases in which it is difficult to decide whether they are congested or not. The images above show examples of an image classified as: uncongested (left), congested (right) and one of the borderline cases (centre). Extracting intermediate outputs from the deep networks allows them to be used as a regression model enabling the ordering of scenes, from low congestion to high congestion to be carried out. The classifier is designed to be one of several inputs to a higher-level capability which combines data from different sources and modalities for the extraction of situational awareness. Future activities will add interpretability to our classification model. This will help address the issue of trust of deep-learned classifiers when sharing intelligence with other processes, and may help share justifications for decisions without having to release restricted input data. |