PhD proposal - Inria



PhD ProposalINRIA Sophia Antipolis, STARS group2004, route des Lucioles, BP9306902 Sophia Antipolis Cedex – France. Title : People Tracking using Deep Learning algorithms on embedded hardware 2. Scientific contextSTARS group works on automatic sequence video interpretation. The “SUP” (“Scene Understanding Platform”) Platform developed in STARS, detects mobile objects, tracks their trajectory and recognises related behaviours predefined by experts. This platform contains several techniques for the detection of people and for the recognition of human postures and gestures using video cameras. However there are scientific challenges in people detection when dealing with real word scenes: cluttered scenes, handling wrong and incomplete person segmentation, handling static and dynamic occlusions, low contrasted objects, moving contextual objects (e.g. chairs)...Moreover new hardware components have been released improving people detection, for instance high resolution cameras, GPU, FPGA which are becoming more popular and accessible. The basic idea of new hardware is to embed real-time algorithms with high demand in processing power, such as Deep Learning algorithms for People Detection or Pose Estimation or People Tracking. This kind of algorithms is well adapted for applications which aim at monitoring people (e.g. security or monitoring Alzheimer patient in hospital).3. Current state of People Detection and Tracking systemsThe accuracy of People Tracking algorithms is directly linked to the performance of People Detection systems. There is a wide variety of People Detection systems. According to [10], the state-of-art in People Detection systems are able to achieve an impressive performance with a miss-rate of 22.5% on the Caltech dataset, which is the most used benchmarks for People Detection with a very long video taken from a car and looking at pedestrians. This state-of-art approach uses a random forest classifier and combines a number of machine learning strategies for People Detection. The false negatives in the present systems primarily stem from contextual reasons such as heavy illumination variations, low contrast, high density crowd contexts where individual delineation of people is non-trivial and uncommon poses and clothing which have little representation in datasets. At the same time it is not easy to comment on reasons owing to false positives because that requires a detailed analysis of feature representation and actual information that is encoded by features and classifiers. [10] also outlines that major improvements in People Detection systems have resulted from the use of better features and to some extent from the use of contextual information. The capacity of trained models to transfer expertise across datasets is however still in need of improvements.It is important to mention that depending on the nature of datasets and specific settings of cameras, a multitude of different methods have been proposed for People Detection. While a lot of work has been focused on detecting pedestrians from single-camera static images, a number of methods take advantage of temporal information (e.g - optical flow) and stereo information (if available) to get more successful performance figures. Since 2015, the new trend consists in designing Deep Convolutional Neural Network CNN architecture to build efficient People Detection algorithms. Deep Learning has become very popular owing to the fact, that it takes away the responsibility of constructing handcrafted features. A deep learning based system can be looked upon as a feature extractor. Deep learning based features are shown to be more robust to object category variations and several deep learning based methods have proved their potential usefulness for object detection. Deep learning based models can encapsulate large number of object categories and variations within the same model [11]. There are various deep learning based schemes such as Convolutional Neural Networks (CNN), Restricted Boltzmann Machines (RBM) etc. For the purposes of object detection CNNs are most commonly used since they are intuitively most easily usable and provide excellent results. A number of works following [11] have further improved the application of deep learning models for People Detection [12, 13] outperforming previous machine learning approaches. However, these CNNs have been trained mostly for videos taken from cars and are still struggling to detect people with low resolution, low contrast or partially occluded. Most of these studies are driven by automotive industries targeting self-driving cars. Video-surveillance conditions are less taking into account and include static video cameras, top view, and large field of view, which bring new challenges for People Detection algorithms.Extending People Detection CNNs, other architectures have been proposed for Posture Recognition: DeeperCut [14] and Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields [15], but there are requiring good image resolution and high processing power.As described in [1, 6], CNN works fine with Images for object recognition as there are large annotated datasets with good resolution (e.g. Imagenet) and architectures (e.g. Caffe) suitable for still images. However, for videos, there is still the challenge on how to capture motion information and temporal coherency through CNN. There are several techniques which have been tested for visual mono object tracking, for people re-identification [9], but little has been done for multi object tracking [7]. So using deep features extracted by People Detection algorithms in an efficient manner for People Tracking remains an un-explored topic.In conclusion, open issues in People Detection can be summarized as followed:Integrating CNN into an embedded system (compactness, power efficiency, accelerators)Extend People Detection CNN to address challenging conditions:low resolution, low contrast or partially occluded peoplegeneric camera viewpoint including viewing angle from the top as for video-surveillancehandling crowding sceneDesign People Tracking algorithm taking advantage of deep features.4. General objectives of the PhDThis work consists first in evaluating the performance of Deep Convolutional Neural Network CNN algorithms applied to images or videos, related to security applications such as People Detection and Posture, Gesture and Action Recognition algorithms [6]. Several types of accelerators will be tested: server processors with a large number of cores, FPGA accelerators and GPU accelerators. We will focus on the calculation phases that run on the embedded computer. The rates, latencies and number of parallel events processed will be compared according to hardware architectures, software frameworks and algorithms. The hardware architectures, software frameworks and expertise will be provided by the Kontron Company, localized in Toulon.As DL algorithms are effective, but they are still a challenge to be run in real-time on embedded architecture, this work aims at designing DL algorithms matching the hardware constraints. The goal is to review the literature, evaluate existing Deep CNN libraries and propose new optimized algorithms.The second part of the work consists in designing a new People Detection and Tracking approach taking into account the last advance of Deep CNN, extending People Detection for the temporal domain. So the goal will be first to study the latest algorithms for people detection and extended them for improving the detection reliability. A first research direction consists in applying classical Machine Learning techniques, such as Boosting, Decision Trees and Classifier Ensemble, to People Detection CNNs. As shown in [8], Boosting CNN can highly increase detection accuracy. Also by structuring the training dataset into fine categories of positive and negative samples, and organizing them into a tree hierarchy, detection performance can be improved. Increasing the filter resolution is another technique to detect people with low resolution (under 40 pixel size). A second research direction is to combine body parts to increase the likelihood of detecting the whole human body, as this has been done with DPM – Deformable Part Model [16]. For instance, detecting people head can provide strong clues to detect people in crowd.Then, the objective will be to process the extracted deep features on each person and to track them throughout the video stream. A challenge will be to find the appropriate features to both address people detection and tracking. A first direction consists in using state-of-the-art trackers [17] and feeding them with deep features. A second direction is to design Long Short-Term Memory LSTM architecture to handle the variation of deep features extracted while tracking a target throughout the video. A third direction is to compute an invariant visual signature of the target as this has been done for the Re-Identification task [9].To validate the work we will assess the proposed algorithms on video-surveillance applications and homecare videos from Nice Hospital and from public places. In particular, People Detection and Tracking algorithms will be validated through the Caltech benchmark dataset [19] and MOT challenge MOT benchmark [18].There is a possibility of conducting first an internship, before the PhD thesis.5. Pre-requisites: Master 2 (or Research Engineer) in Computer Vision, With Strong background in C++ programming, OpenCL, Linux, artificial intelligence, cognitive vision, and Machine Learning.6. Place of PhD Shared between Kontron Toulon and Inria Sophia Antipolis7. Schedule1st year:Study the limitations of existing DL People Detection and Tracking algorithms on specified architectures. Proposing detailed research directions for People Detection and Tracking algorithms.2nd year:Proposing dDesigning and implementation for DLnew People Detection algorithms to address in particular:Small people by increasing the resolution of the filters within the convolution layers.Crowd by combining head and full body detection.Testing these algorithms on Caltech dataset and MOT challenge.Designing a novel and Tracking algorithms taking as input the deep features from People Detection CNN to build a visual signature of the target.Take into account the temporal coherency of extracted features to get real-time performance.3rd year:Evaluate, improve and optimize proposed DL People Detection and Tracking algorithms by exploring the other research directions listed above.Oral presentation and Writing PhD manuscript.8. Bibliography:Karen Simonyan, Ken Chatfield, Andrew Zisserman, Return of the Devil in the Details: Delving Deep into Convolutional Nets. BMVC 2014,M. Koperski and F. Bremond. Modelling Spatial Layout of Features for Real World Scenario RGB-D Action Recognition. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal-Based Surveillance, AVSS 2016, in Colorado Springs, Colorado, USA, 24-26 August, 2016.F. Khan and F. Bremond. Unsupervised data association for Metric Learning in the context of Multi-shot Person Re-identification. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal-Based Surveillance, AVSS 2016, in Colorado Springs, Colorado, USA, 24-26 August, 2016.P. Bilinski and F. Bremond. Human Violence Recognition and Detection in Surveillance Videos. In Proceedings of the 13th IEEE International Conference on Advanced Video and Signal-Based Surveillance, AVSS 2016, in Colorado Springs, Colorado, USA, 24-26 August, 2016.E. Corvee and F. Bremond. Haar like and LBP based features for face, head and people detection in video sequences. In the International Workshop on Behaviour Analysis, Behave 2011, Sophia Antipolis, France on the 23rd of September 2011.C. Roberto de Souza, A. Gaidon, E. Vig, and A. Lopez. Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition, ECCV 2016, Amsterdam, 11-14 October 2016.Anoop Sathyan, Joseph Cohen, and Manish Kumar. "Deep Convolutional Neural Network For Human Detection And Tracking In FLIR Videos", AIAA Infotech @ Aerospace, AIAA SciTech Forum, (AIAA 2016-1412) Liliang Zhang, Liang Lin, Xiaodan Liang, Kaiming He, “Is Faster R-CNN Doing Well for Pedestrian Detection?” ECCV 2016, Amsterdam, 11-14 October 2016.S?awomir Bak and Peter Carr, “One-Shot Metric Learning for Person Re-identification”, CVPR 2017, Hawaii, USA, 21-26 Jul 2017.Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. Ten years of pedestrian detection, what have we learned? In Computer Vision - ECCV 2014 Workshops, pages 613–627. Springer, 2014.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pages 1–9, 2012.Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. SSD: Single Shot MultiBox Detector. arXiv preprint, pages 1–15, 2015.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R- CNN: Towards Real-Time Object Detection with Region Proposal Net- works. 2015. Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik, Deformable Part Models are Convolutional Neural Networks, USA, CVPR 2015.L.A. Nguyen, F.M. Khan, F. Negin and F. Bremond. Multi-Object tracking using Multi-Channel Part Appearance Representation. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal-Based Surveillance, AVSS 2017, in Lecce, Italy, 29 August - 1st September, 2017. HYPERLINK "" . Contact:serge.tissot@Francois.Bremond@inria.fr ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download