With the increasing prevalence of cameras and promotion from social media platforms, it is seamless for users to record and share massive amounts of video. Therefore, video analysis has become one of the essential problems in the computer vision and multimedia analysis. Thanks to the recent development of deep learning techniques, researchers are now able to boost the performance of video analysis significantly, such as action recognition, object tracking, video retrieval, and object segmentation. They hence initiate vision-language research directions to analyze video content. Among them, the task “Temporal action localization in videos via natural language query” as an important branch, has attracted wide attention from both industry and academia.
As shown in the figure, given a long and untrimmed video, as well as a natural language query “The girl in orange walks by the camera”, the model should output the start and end time of the moment related to the query. Compared with the traditional temporal action localization task, it not only allows for an open set of activities but also natural specification of additional constraints, including objects and their properties as well as relations between the involved entities. Hence, it is desirable to use natural language queries to localize activities. Moreover, it is helpful for some real-world scenarios, e.g., robotic navigation, autonomous driving, and surveillance.