||As the amount of recorded and stored videos on mobile devices increase, efficient techniques for searching video content become more and more important, especially for applications like searching for the moment of crime or other specific actions. When a user sends a query searching for a specific action in a large amount of data, the goal is to respond to the query accurately and fast. In this paper, we address the problem of responding to queries which search for specific actions in mobile devices in a timely manner by utilizing both visual and audio content processing approaches. We build a system, called VidQ, which consists of several stages and uses various Convolutional Neural Networks (CNNs) and Speech APIs to respond to such queries. As the state-of-the-art computer vision and speech algorithms are computationally intensive, we use servers with GPUs to assist mobile users in the process. After a query has been issued, we identify the possible different stages of processing that will take place. This is followed by identifying the order of these stages that build up the system. Finally, we distribute the process among the available network resources to further improve the performance by minimizing the processing time. Results show that VidQ reduces the completion time by at least 50% compared to other approaches.