Abstract |
We propose an approach to detect violence in CCTV feeds that is robust to new datasets and situations. This approach breaks with the traditional assumption of having large amounts of training data that are representative samples. Detecting violence in CCTV feeds is an objectively hard problem that is of paramount importance to solve for effective situational understanding. Violence comprises a large spectrum of activities that can go from abuse, to fighting, to road accidents, that can therefore take place in completely different environments, from public buildings, to underground stations, to roads during the day or the night. This large spectrum of activities and environments makes this a hard classification task for machines. We show that there are specific, detectable, and measurable features of video feeds that correlate with—among other things—violence and, by fusing such features with semantic knowledge, we can in principle provide estimates of sequences of videos that correlate with violence |