National Chung Cheng University



YOLO Algorithm Based Object Detection with Object Tracking and Distance EstimationNational Chung Cheng University, Chiayi City, TaiwanDepartment of Computer Science & Information TechnologyEmbedded Systems LaboratoryGuided by:Dr. Pao-Ann Hsiung, ProfessorBy :Michael GriffithSWCU, SalatigaContents:IntroductionPurposesLayout and DesignExperimental ResultsComparisonConclusion IntroductionBefore getting deeper, this chapter will include some definitions and explanations of various things that will be used in this paper.Artificial IntelligenceIllustration of the relationship of A.I, Machine Learning and Deep LearningArtificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". ( source : Wikipedia )The main purpose of A.I is to learn and/or do problem solving works as if it is a human mind working to solve a problem. There are many ways and approach on A.I itself, one of them is Machine Learning.Machine LearningMachine learning itself, is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. ( source : Wikipedia )In Machine Learning, learner have an objective of generalizing from its learning “experience”. There are many approach in machine learning algorithms such as decision tree learning, association rule learning, deep learning, and many others. Deep learning is one of the approach that will be used in this paper.Deep LearningDeep learning itself, is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. ( source : Wikipedia )Deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design and board game programs, where they have produced results comparable to and in some cases superior to human experts. ( source : Wikipedia )Artificial Neural Network and Deep Neural NetworkArtificial Neural Network or connectionist systems are computing systems inspired by the biological neural networks that constitute animal brains. Such systems learn (progressively improve their ability) to do tasks by considering examples, generally without task-specific programming. ( source : Wikipedia )Deep Neural Networks, or DNN, is simply an Artificial Neural Network with multiple hidden layers between the input and output layer.Illustration of an Artificial Neural Network mimicking the biological brainIllustration of ANN and DNNThere are many types and classes of Neural Networks in Deep Learning, one of them is called Convolutional Neural Network or CNN. The difference between CNN and other type of Neural Network is that CNN has convolution layer(s). Convolution layer is basically a layer in Neural Network that do convolution computation.Convolution is a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it.( source : Wikipedia )Convolutional Neural NetworkConvolutional Neural Network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. ( source : Wikipedia )Example of a CNNYOLO AlgorithmYOLO architecture Visualized ( source : Medium, YOLO )More details on YOLO algorithm ( source : arxiv )YOLO uses a single CNN network for both classification and localizing the object using bounding boxes. ( source : Medium , YOLO Explained )A full image is applied to a single neural network. Then, this network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. ( source : pjreddie )Illustration on how YOLO detection works. ( source : arxiv )Why using YOLO in this topic? YOLO has a very good balance between accuracy and speed of object detection. There are also other existing Neural Networks such as R-CNN, Fast R-CNN, Faster R-CNN, SSD MobileNet, and so on, but YOLO is a balanced network and also pretty easy to use and learn especially in this parison between neural networks with PASCAL VOC 2007 and 2012 dataset ( Source : Medium, Object Detection Comparison ) As seen from the table, YOLO has a good balance between speed and accuracy. I had also done some test before and I can agree to the result of this table.Object DetectionObject detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. ( source : Wikipedia )The object detection with YOLO Algorithm is done by calling few functions in OpenCV. Regarding the usage of OpenCV for object detection, it should be noted that the tutorial, example and documentation on YOLO DNNs written in python in OpenCV Documentaries and also other sources is fairly limited and few times are wasted on finding out how to use YOLO object detection using OpenCV.Calling the functions from darknet itself is also not helpful especially using python since it doesn’t have tutorials nor examples and documentaries on how to call the YOLO object detection’s function. That is why the OpenCV is preferred rather than using and importing the darknet library directly.Distance EstimationThere are limited sources of what is distance estimation is, but let’s define it by firstly put definition on each word. What is the meaning of the word distance and what is the meaning of the word estimation ?Distance is a numerical measurement of how far apart objects are. ( source : Wikipedia )Estimation is the process of finding an approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. ( source : Wikipedia )By looking at both of these definitions, distance estimation roughly means the process of finding an approximation of distance. Which in this case is the distance of an object that has been detected by the object detection.PurposesIn this topic, I wanted to propose a method on distance estimation of an object. The object itself supposed to be detected before being tracked and predicted thus requiring a program that can do these functions simultaneously.Layout and DesignThe program was built using Python3 script and is using standard library such as numpy, multithreading and also OpenCV v3.4.1 to handle the computer vision part of the script such as object detection and object tracking.Object DetectionYOLO algorithm is used for the object detection. In this paper, object detection is using the YoloV2 pretrained weight which can be acquired from pjreddie website. To do the object detection, a function in OpenCV v3.4.1 can be called, thus the utilization of using OpenCV for object detection is served.Object TrackingThere are many object tracking algorithm that are available in OpenCV v3.4.1, such as BOOSTING, MIL ( Multiple Instance Learning ), KCF ( Kernelized Correlation Filters ), TLD ( Tracking, learning and detection ), MEDIANFLOW, GOTURN, and MOSSE ( Minimum Output Sum of Squared Error ).In this topic, the Medianflow algorithm is used because based on several tests, it gives a good balance between accuracy and speed of tracking, so it will not give too much load to the CPU that already do object detection.Distance EstimationThe distance estimation that’s proposed is using a single camera, in this case a webcam. The estimation is calculated using basic trigonometry while exploiting a little bit on how the camera works.Illustration on getting camera VFOV from the DFOVThe technical specification of a camera usually shows the Diagonal Field of View or DFOV of a camera. For example, the one used in testing was Logitech C615 with DFOV of 74?. So, with the formula given, the VFOV can be extracted to be used in the calculation for distance estimation.Simplified diagram of the proposed distance estimationThe object distance can be determined if the angle of the object is known using basic trigonometry. This method can work if the camera has VFOV of less than 90?.xobj = htan(?) Where :Xobj = object distanceh = camera height? = object angleThe initial angle is called α, which has the value of :α= θ+ βWhere :α = Initial angleθ = ?? of VFOVβ = Camera tilt angleIllustration on the frameFor example, this is what the frame looks like. Where :y = the bottom pixel of the object’s bounding boxymax = maximum pixel heightThe object angle can be obtained by using the following formula :?=y . ( 2θymax )+ αThe camera tilt angle must met the following condition to work :θ< β ≤90°- θWhere :θ = ?? of camera VFOVβ = camera tilt angleFew things must be known for this method to work, such as :Camera heightCamera tilt angleCamera DFOV and/or VFOVAnd so, the working condition of this method are as follows :The camera tilt angle is already knownThe camera height is already knownThe camera DFOV and/or VFOF is already knownThe object is on a level ground Program DesignThe program is designed to do tasks parallelly. Below is the following working diagram of the program :Illustration on how the program worksThere are 4 main process working separately in loops. They all update and/or take variables which are global variables such as frames and bounding box(es). The simple way of explaining the diagram is as follows :The “Read Video” reads frame from camera and update the frame variableThe “Detection” takes the most recent frame and update bounding box(es).The “Tracking and Distance Prediction” takes bounding box(es) to be tracked and kept updated while estimating the object’s distance.The “Show Frames” shows frame that has been updated with information such as object’s bounding box and also object’s distance4. Experimental ResultsTests were conducted where the tests were basically trying to detect an object’s distance 3 times, 1.2m apart, 1.8m apart and 2.1m apart.The first test was pretty accurate with the object detected on 1.298m ( from the actual distance of around 1.2m ).Next, the object is detected 1.885m away from the camera ( from the actual distance of around 1.8m )Lastly, the object is detected 2.689m away from the camera ( from the actual distance of around 2.1m )The inaccurate measurement of the camera height and also especially the camera tilt angle makes object’s distance that are located further away from the camera, getting falsely parisonThere are not many papers available for public regarding distance estimation using monocular camera. There are 4 widely used methods on distance estimation :Using Triangle Similarity for Object/Marker to Camera DistanceExampleThis method is one of the common method in calculating object’s distance. The drawbacks of this method are as follows :Initial width or height of the object must be knownThe initial distance of the object must be knownThis method is not suitable for autonomous vehicle since the initial distance of the object must be know, meanwhile width or height of object can be obtained from object detection.Using object size informationWidth of a vehicle in image is proportional to real width of the vehicle according to the pinhole camera geometry. If real width of a vehicle is known, range d to the vehicle can be calculated as in the following:d= Fc . WawaWhere :d = object distanceFc = Focal Length of cameraWa = vehicle real widthWa = vehicle width in imageApplying this formula for range estimation without prior knowledge of vehicle real width may introduce significant error, which can be as much as 30%. ( source : hindawi )This method has few drawbacks such as :The real object width must be knownFocal Length of camera must be knownUsing Position InformationIllustration when camera pitch angle is zeroIllustration when the camera pitch angle is not equal to zeroDistance between bottom line of a vehicle and horizon is inversely proportional to range to the vehicle. When camera pitch angle is negligibly small, range d to vehicle can be calculated as in the following:d= Fc . Hcyb- yhWhere :d = object’s distanceFc = camera focal lengthHc = camera heightyb = vertical coordinates of vehicle bottomyh = vertical coordinates of the horizonWhen camera pitch angle??is considerably large, range has to be calculated as in the following:d= 1cos2θ . Fc . Hcyb- yh-Hc . tan(θ)Where :θ = camera pitch angleWhen θ is not zero, horizon does not pass through the center of image anymore, and its position has to be determined. Small variations in horizon position may result in large range error. In highway traffic environment where horizon varies in a small range, range can be calculated with a fixed horizon determined by camera calibration.( source : hindawi )In urban traffic environment where horizon can vary much due to vehicle motion and road inclination, it should be located at run-time. Horizon can be determined by analyzing lane markings. However, this method cannot be appropriately used when road inclination varies continuously or lane markings are not seen. ( source : hindawi )This method has the following drawbacks :Might get horizon variation which cause errorWhen horizon determined by analyzing lane markings, lane markings are needed and road inclination might make calculation inaccurateVirtual Horizon EstimationIf real width Wa of a vehicle is known and both size and position of the vehicle in image are given, vertical coordinate yh of horizon can be determined and presented in the following:yh= yb- Hc . waWaWhere :Hc = camera heightWith Wa can be represented as Wa+ ?Wa where Wa is average real width of vehicles and ?Wa is difference between real width of a vehicle and the average real width. If sufficiently many vehicles are detected, ?Wa converges to zero and can be ignored. As a result, horizon can be determined only from the information on position and width of detected vehicles with a fixed average real widthWhen several vehicles are detected in object detection stage, average horizon yh can be determined from average of vehicle positions and average of vehicle widths with a fixed real width as in the following:yh? yb- Hc . waWaVirtual horizons are always located if vehicles are detected in image, as the method estimates horizons only from information on size and position of vehicles in image. Small variation in position of horizon may result in large range error. Virtual horizons are always located above bottom lines of detected vehicles as they are estimated from vehicle positions, and range error is bounded. ( source : hindawi )Estimated horizon position can fluctuate due to occurrence of false detections and insufficient number of detections, as well as pitch motion of subject vehicle such as vibration and acceleration. Fluctuation of horizon due to pitch motion of subject vehicle can be ignored as both vehicle position and horizon in image are influenced at the same time by the pitch motion. Range error due to pitch angle itself is negligible if pitch angle is small as we described in Section 2.2.2. However, fluctuation of horizon due to false detections and insufficient number of detections need to be removed. As road inclination changes slowly in most cases, fluctuation of horizon can be reduced by reflecting the information on previously estimated horizon. ( source : hindawi )Virtual horizon at image frame t can be estimated as in the following:yht= γ . yht+1-γ . yht-1 , 0<γ≤1 Where :yht = virtual horizon at frame tyht-1 = virtual horizon at frame t-1yht = average horizon at image frame t obtained by applying previous formulaγ = a constant experimentally determinedWhen t = 0, a default horizon position is used ( for yh(t-1) ) . Once virtual horizon is determined, range d can be calculated from vehicle position with the virtual horizon by using d= Fc . Hcyb- yhFalse detections can be reduced by restricting width of detected vehicles in object detection stage. Applying virtual horizon yht-1 estimated at previous image frame to yh= yb- Hc . waWa , min/max width of a vehicle at position yb in image can be restricted as in the following:yb- yh(t-1)Hc . Wa'min ≤ wa ≤ yb- yh t-1Hc . Wa'maxWhere :Wa'min = min of vehicle real widthWa'max = max of vehicle real widthA detected vehicle whose width in image is out of these bounds should be regarded as a false detection. ( source : hindawi )This method has drawbacks as follows :If only small object is detected, there is a chance of the virtual horizon to become inaccurateReal width of object must be known in order to make accurate distance estimation6. Conclusion In this paper, there are some advantages on the my proposed method compared to other distance estimation methods such as :There is no need for the width of the real object to be known in the first placeThere is no need for lane markings or other point or reference to make accurate estimationAlthough, there are some disadvantages in the proposed method in this paper, such as :Camera height must be knownCamera tilt angle must be knownCamera specification ( DFOV and/or VFOV ) must be knownOnly work on level groundThis method can be used in many application such as :Collision warning ( both rear and forward )2D object mappingOther application that require distance estimationSources : Wikipedia. (n.d.). Artificial Intelligence. URL . (n.d.). Deep Learning. URL . (n.d.). Machine learning. URL . (n.d.). Convolutional Neural Network. URL ,Aashay. (2017). YOLO?—?‘You only look once’ for Object Detection explained. URL , Joseph. (n.d.). YOLO: Real-Time Object Detection. URL , A.B. (n.d). Darknet. URL , Joseph. Divvala, Santosh. Girschick, Ross. Farhadi, Ali. (2016). You Only Look Once: Unified, Real-Time Object Detection URL , Jonathan. (2018). Object detection: speed and accuracy comparison ( Faster R-CNN, R-FCN, SSD, FPN, RetinaNet and YOLOv3 ). URL Oliveira, Alessandro. (n.d). YOLO DNNs. URL . (n.d). Object Detection. URL , Satya. (2017). Object Tracking using OpenCV. URL . (n.d). Distance. URL . (n.d). Estimation. URL , Adrian. (2015). Find distance from camera to object/marker using Python and OpenCV. URL , Ki-Yeong. Hwang Sun-Young. (2014). Robust Range Estimation with a Monocular Camera for Vision-Based Forward Collision Warning System. URL ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download