VISUAL LEARNING WITH WEAKLY LABELED ... - Stanford …

VISUAL LEARNING WITH WEAKLY LABELED VIDEO

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Kevin Tang May 2015

? 2015 by Kevin Dechau Tang. All Rights Reserved. Re-distributed by Stanford University under license with the author. This dissertation is online at:

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Fei-Fei Li, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Daphne Koller, Co-Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Silvio Savarese

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii

Abstract

With the rising popularity of Internet photo and video sharing sites like Flickr, Instagram, and YouTube, there is a large amount of visual data uploaded to the Internet on a daily basis. In addition to pixels, these images and videos are often tagged with the visual concepts and activities they contain, leading to a natural source of weakly labeled visual data, in which we aren't told where within the images and videos these concepts or activities occur. By developing methods that can effectively utilize weakly labeled visual data for tasks that have traditionally required clean data with laborious annotations, we can take advantage of the abundance and diversity of visual data on the Internet.

In the first part of this thesis, we consider the problem of complex event recognition in weakly labeled video. In weakly labeled videos, it is often the case that the complex events we are interested in are not temporally localized, and the videos contain varying amounts of contextual or unrelated segments. In addition, the complex events themselves often vary significantly in the actions they consist of, as well as the sequences in which they occur. To address this, we formulate a flexible, discriminative model that is able to learn the latent temporal structure of complex events from weakly labeled videos, resulting in a better understanding of the complex events and improved recognition performance.

The second part of this thesis tackles the problem of object localization in weakly labeled video. Towards this end, we focus on several aspects of the object localization problem. First, using object detectors trained from images, we formulate a method for adapting these detectors to work well in video data by discovering and adapting them to examples automatically extracted from weakly labeled videos. Then, we

iv

explore separately the use of large amounts of negative and positive weakly labeled visual data for object localization. With only negative weakly labeled videos that do not contain a particular visual concept, we show how a very simple metric allows us to perform distributed object segmentation in potentially noisy, weakly labeled videos. With only positive weakly labeled images and videos that share a common visual concept, we show how we can leverage correspondence information between images and videos to identify and detect the common object.

Lastly, we consider the problem of learning temporal embeddings from weakly labeled video. Using the implicit weak label that videos are sequences of temporally and semantically coherent images, we learn temporal embeddings for frames of video by associating frames with the temporal context that they appear in. These embeddings are able to capture semantic context, which results in better performance for a wide variety of standard tasks in video.

v

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download