7



Abstract

To segment all the objects of any image and recognize their type has always been the Holy Grail of Computer Vision. And yet, even if this task is simple and instinctive for a human, computers only manage it in very restricted cases (with specific type of objects, objects without background, etc.). In this paper, we will use the context information to help the recognition of generic objects.

Introduction

[pic]

What do you see on the two images above? This is an easy question for a human and even with such a blurred image you can still recognize a street, a building and a car for image a) and the same street and building with a pedestrian in the middle of the road for image b). In fact, that’s what you will recognize if you didn’t read Torralba’s article [3] where those images come from. In reality, the pedestrian on the second image is the car image from image a) with a translation and a simple 90 degrees rotation! This example gives us an idea of the importance of the context to recognize objects.

Indeed, most of the current object recognition systems [1, 8, etc.] focus on only the local properties of the objects. However, human usually recognize an object by not only its intrinsic properties but also by the surrounding information. For example, while entering into a very dark office we will probably guess that the cubic object in front of us is more likely a computer than a microwave! If we know in which type of room we are, we can restrict the set of possible objects. Furthermore, if we manage to recognize some objects (table, floor, wall, street, building, etc.) then we can have an idea of the size and shapes of the other objects and their relative location on the image (i.e. computer are more likely to be situated above a table than under it, etc.). So, in this project, the idea is to use local features to recognize some objects and then, use all the context information we have acquired (objects that have already been recognized, type of room, etc.) to help us to recognize the other objects.

Contrary to Scale Invariant Feature Transform (SIFT), we try to recognize types of object (and not specific objects). In contrast to many approaches, objects are learned in their context (and not in front of a uniform screen). And contrary to Torralba’s approach [2-5], we are not trying to recognize in which type of room we are. Here, we assume this information is already known (by a map for example) and focus on the recognition and localization of the objects themselves.

More precisely, in this project, we are using some sort of SIFT features [8] (but modified to be less specific and concatenate to other features) to recognize some objects and then use a probabilistic model (learn by boosting) to improve our recognition rate by using the context information.

We have taken 121 indoor images of various places (7 categories) containing 20 types of objects. We have labeled manually those images to know what objects each image contains and the position of those objects (square box around it). All the images have been taken at the same height (84 cm).

We are starting to have some results with the SIFT features algorithm (without context yet). However, those results are not very accurate. The Matlab implementation is simply too slow (about 6 minutes for a 1280*960 image!) and gives worst results than the C++ implementation. Even the C++ implementation is far from perfect. For specific objects (my computer for instance) learnt on nearly uniform background, the results are really good (the object is recognize even if it has not the same orientation or if some of its parts are occluded). But the program is not able to recognize generic classes of objects in all cases. For instance, if the algorithm has learned my computer it will be able to recognize only my own computer but not any other computer not even itself if I change its wallpaper! We are currently trying to make those features less specific.

Concurrently, we are starting to implement the probabilistic model which should improve our recognition rate by adding the contextual information.

Literature Review:

Most of the previous object recognition techniques focus on the local properties of objects. B. Moghaddam and A. Pentland utilized training images and mapped them into a higher dimensional space. Then, they used Principle Component Analysis (PCA) to get the most representative space and performed linear discriminant analysis [13]. Some object recognition systems use local feature properties. David Lowe’s SIFT [1] algorithm used local feature properties to form the “keys”, and then matching different keys from different images in order to recognize the closest feature points between them. However, those approaches usually tend to be brutal. The images in our daily life usually have background knowledge in it; we refer to this as contextual information in our project.

Recently, there are more and more works focus on utilizing contextual information in the images to perform object recognition. A. Torrabla and P. Sinha use scene detection as the first step, and then incorporate the scene and object correlation as prior for object recognition [5]. P. Perona uses local feature (image block) as training set to perform PCA to get the most representative space for important features in the images [9]. K. Murphy et.al [2] use Gabor filters as preprocessing to get information properties of the objects, and then use statistical boosting to learn the contextual information.

All those techniques improve the recognition rate. However, we believe that if we combine the state of the art local feature detector and contextual information about the scene, we could improve the recognition rate even further. Thus, we have decided to use a modified SIFT with a boosting learning model to achieve our goals.

Our approach:

The following diagram shows the learning model of our object recognition system. We first use an object recognition algorithm to segment the different objects from the image. At the end of this step, we get the locations and number of each type of object and the confidence (probability given observation) we have on those results. Then, we use contextual information (i.e. spatial, inter-object relationship and illumination) to increase/decrease those confidence level. Once we have updated our belief, we can decide which objects were really detected (high confidence) and which one are still not reliable. We can then use those detected objects to update our prior of contextual information in the scene, and then perform another loop with those new context information. We will now explain this algorithm in detail.

[pic]

Feature recognition can give us a certain degree of confidence on the likelihood of a specific object. To utilize this information, we first consider the density of matching points appearing on the input and training images. We define a 2D points set [pic] as the points of matching on the input images (index kth denotes the image and index jth the object). The set Sk,j provides information of the probability that a specific region on the input image contains the object of interest. Under our assumption, the density of points that match will decrease dramatically outside the real object location on the input testing images. This gives us the first evidence of how likely the object will appear on a specific region. Besides, we can use our database information on the shape of each object to update those probabilities. In some objects with strong geometrical configurations, e.g. computers are usually composed of edges, we can use Gabor filters to perform further detection.

Asides for the intrinsic properties of objects, we also utilize the contextual information in the scene to perform object detection. We classify the contextual information in the images as spatial, inter-object relations, and illumination aspects. In the spatial information, we focus on the coordinates and scale of the objects. As our camera is mounted on tripod of the same height with the optical axis parallel to the horizon, we can assume that there are strong correlations between the objects centroids of the image planes and the object types. For instance, lights usually appear on the upper portion of the image, computers are almost in the middle portions, and chairs are usually on the bottom portion of the images. Thus, we can incorporate this information into our first step of detection. We assume that the distributions of objects centroids on the image planes follow a normal distribution. Furthermore, the scale of the object is also a strong clue for further recognition. We assume that objects of a same category are usually about the same scale in real world. For instance, the physical dimensions of laptops are usually about the same size no matter which location they are placed. However, this needs more information about the depth of objects of interest in the images (the scale of objects on image planes are varied in different depth). We can derive the following. After some computing from the training set, we can get the following density function: The probability of that an object V with feature θ will be detected can be written as[pic]. If we further incorporate the coordinates of object centroid ω, we can also get [pic]. Then we can add the scale of the objects φ if it is appropriately normalized. We can also calculate[pic] and come to the following formula:

[pic] = [pic][pic][pic].

The spatial relationships also provide a strong clue for us to recognize objects. For instance, the mouse and keyboard usually appear near to the computer. We define the probability of finding an object V on a normalized distance d of another object U as P(V|U,d). Besides, the spatial arrangement between objects is also incorporated in this step. For example, the computers are usually more likely to appear above the tables rather than under the tables. We define the probability of object V’s spatial arrangement parameters Ω to object U as[pic].

We further define the illumination condition parameters as δ (this includes the absolute illumination and contrast factors), we can also derive P(V| δ) from our training set. We can derive the following formula:

P(V|θ,ω,φ,U,d,Ω)= P(V|θ) P(V|ω) P(V|φ) P(V| U, Ω) P(V| δ)

We can derive the above density function parameters by maximum likelihood estimation.

In the training step, we are going to use a statistical boosting approach to combine the weak classifiers described in the previous paragraphs, i.e. spatial, inter-object relationship, and illumination probabilistic models. The weak learners are different probability distribution corresponding to different parameter sets, i.e. spatial relation, inter-object relation and illumination. Those different weak learners provide different contextual information of the different objects from our training database.

Combining the feature detection and probability models we built above, we can construct the whole framework for our object recognition program by utilizing contextual information in the images.

Current results

Creation of the database

We used a digital camera to take 121 photos of the interior of the Gates Building. The photos have a size of 1280x960 pixel and were stored in JPEG format. All photos were taken with the camera mounted on a tripod at the height of 84 cm. We did use various depths and degree of pan (fixed tilt) between the camera and the objects, in an attempt to identify generic objects by their context, despite their variations in scale or orientation.

All photos were taken in different environments inside the Gates Building: Offices, libraries, conference rooms, clusters, classrooms, open spaces and kitchens. Among the objects founded inside this places, those which we will try to recognize are: Computers, tables, black white boards, chairs, books, notes, shelves, microwaves, refrigerators, coffee makers, projectors, laptops, telephones, windows, cans, lights, humans, posters, bins and clocks. It is expected that some of these objects would be easier to identify than others. For instance, a computer should be identified more easily than a phone due to its particular shape and relative position to other objects because phones are smaller and more variable in shape and color.

The next step was the labeling of all places and objects to be identified for each photo, and the introduction of all these data in a database. The structure we choose is:

image (#of the image) .jpg

# of objects in the image

type of room

[object label] coorx-leftup coory-leftup coorx-rightdown coory-rightdown

[object label] coorx-leftup coory-leftup coorx-rightdown coory-rightdown



For each image we introduce (1) the number of objects in it (as it makes easier to manipulate the data base), (2) the place of the image, and (3) a list of the different objects in the image identified with a letter and four numbers with the coordinates x and y of the left-up and right-down corners of the imaginary square that best fits the object. An example of this follows:

image (7).jpg

7

office

c 310 301 599 577

t 1 580 1194 702

ch 517 563 931 842

n 142 558 364 592

n 987 552 1127 606

l 3 359 203 602

w 0 0 895 528

We wrote two MATLAB programs to create and correct the database. The first one was used to fill the database with the coordinate of the squares and the second one was used to show those boxes and such check if they were reliable.

[pic]

An example of labeling

The Matlab implementation of the SIFT program.

As SIFT features are the state of the art in feature base recognition, we decided to try to use them in the first part of our project. Therefore, the first thing we tried was to use the IBM Matlab implementation of the SIFT program. Routines for scale peak detection were already implemented in the program, so we just added new routines for features detection and matching.

For the features detection, we implemented a histogram of the gradient orientation of the surrounding pixels around each “scale peak”. To be less sensible to rotation, this histogram was based on overlapping intervals and was normalized and translated such that its biggest peak was on the first interval. After some tests, we decided to use a window size of 9*9 (in the current scale coordinate), a number of intervals of 20 and an overlapping factor of 1.4 (this implies that the size of each interval is 1.4*360/20). The orientations were computed at each level of the pyramid. All those value are far from optimal, they were selected by experiment with the assumption that they are independent from each other.

For the matching of two images (the object learned and an image where the object could be), we did simply match the closest features between the two images. To do that, we take each scale peak of the first image and compute its Euclidian distance (in the features space) to all the scale peaks of the second image. We decide that the closest feature is a “match” if it is n% closer than the second closest feature (we are currently using the same test in the C++ program). After some tests, 70% seems to be a good value (with 60 don’t get enough matches whereas with 80 the added matches are too unreliable). We could use other type of threshold (fix threshold, etc.) but the real improvement will come when we will take into account the spatial relation between features of a same object.

However the results with this improved Matlab algorithm were far worst than those of the Lowe library implementation (at least 5 times less matches, see images on the next page, and about 4 times slower). An image of 916*675 pixel took 39 seconds (31 if we only extract the scale peak and don’t use sub pixel/scale accuracy) to be analyzed, versus about 11 seconds with the Lowe program. Both programs take a large amount of memory (300 MB for a 1280*960 image) and such become impossibly slow for big images (6 minutes for a 1280*960 image). For the final version of the program we are either going to use images smaller than 916*675 pixel or reduce the maximum level of the pyramid (for instance, the pyramid currently start at level –1: which mean that the image is enlarge once, we could remove this step).

[pic]

On the left Lowe C++ program, on the right the Matlab implementation.

The Lowe program and its limitations

First and foremost, we all thanks a lot Prof. David G. Lowe who accepted last Thursday (02/05/04) to give us its source code of its SIFT algorithm (on research purpose license). The source code is well commented and the algorithm itself, and its results, are really inventive and impressive. SIFT is really a great tool, and its applications are numerous. But, as we will see, those applications are based on the idea that we learn an object with the algorithm, and then we expect the algorithm to find it on any images whatever its orientation and scale. For instance, it can be used to help disabled people with a robotic system: First, we learn some common objects that disabled people are more likely to need, and then the vision system could recognize those objects and give instructions to the robotic system to fetch them. But SIFT was not implemented to memorize categories of objects and that is the use we require for our project.

Since Thursday, we have ported the code to Windows (Visual C++), add it to the matching algorithm (presented above) and have started to do some tests in our training set of images.

As you can see in the images in the previous page, when we learn a specific object (like my laptop, with its current wallpaper, on a background without too much texture), we can easily recognize it on a whole image even if it has been rotated and is partially occluded. But if we present to the algorithm a different laptop, or even if we simply change the wallpaper of the laptop we have learned, the SIFT algorithm currently fail.

[pic]

SIFT algorithm in cases that doesn’t work so well: 2 different object of the same category

They are two reasons why the SIFT program could fail when we ask it to recognize categories of objects. Either it’s because the scale peaks are different from one object to the other of the same category or it’s because the local descriptor of each scale peak is too specific. In the first case, it means that using the scale peaks is more harmful than useful and that attaching our local features to random points would probably perform better! In this case we should probably use a simpler corner detector like Harris and Stephens. In the second case, it means that we somehow have to get those descriptors less specific by relaxing some constraints or adding other features (color, etc.).

Actually, the second case makes sense, the richness of SIFT features (128 features per scale peak which memorize the spatial organization of the gradient around each peak) could well be responsible of the absence of matching between objects of the same category. Indeed, this is one of the qualities of the SIFT algorithm when we want to recognize each object specifically (face recognition, etc.) However, if the main problem is the variety of scale peak themselves between objects of the same category, then using SIFT will probably degrade the performance of our system.

And our current results tend to prove that this is the case. However, we can’t yet be totally categorical. We have only tried the algorithm on 4 categories of objects and in 17 images. Furthermore, the SIFT program works with 18 different thresholds and such we have not yet been able to test them all. But the current results, so far, tend to lead to the conclusion that scale peaks are not invariant from one object to another object of the same category.

The scale peak detector for two laptops: each peak is symbolized by an arrow representing its orientation and scale. For the last 2 images we have lowered the edge/corner threshold.

Let’s illustrate this point with an example of two microwaves of the first and second floor of the Gates Building. In fact, those microwaves are so similar that at a first glance, we didn’t realize that they were different ones! But still the algorithm discerns them all to well!

The matching results for the same microwave (left) and for 2 type of microwave (right) in the right case we had to augment the matching threshold to get some matching

Here we can see that the scale peak location, orientation and scale vary more between 2 types of microwaves than between 2 different orientations of a same microwave.

The scale peak detected for two different microwaves (left) or a different orientation

Next Steps

As describing in the previous sections, the SIFT features algorithm works well for the recognition of a particular object. However, it is too specific for the recognition of objects category. Several approaches can be taken in order to do that. Among others we will like to highlight the following:

a.- The use of information stemming from the surroundings of the object. For example, we know that computers are usually allocated above tables, and that there will be normally more light above than below them. This information that came from the context can be used to help the recognition of the computer.

b.- To take into account the features distribution (i.e. the spatial relationship between the keypoints of an object obtained with the SIFT methods). This method follows the idea about generic features of Pietro Perona (see references [8], [9], [10] and [11]). From any selected keypoint, that it is taken as the origin, we draw vectors toward all the other keypoint. These vectors symbolize the spatial distribution of features of a particular object. By comparing this distribution of features for objects of the same type, we will try to find those who are more important to characterize a generic type of object. See Figure

[pic]

We compute the spatial distribution on the set of computer keypoints, to try to find the most distinctive SIFT features.

c.- The use of statistical boosting or PCA method (principal component analysis base on Singular value decomposition) to select the more relevant features for each object. The number of time that a particular feature is found and the weight that is given will allow us to identify those which are more discriminative for each object.

d.- The use of an intensity or spatial normalization of the image (this could explain the variety of the scale peak).

e.- The use of other features such as color, texture, luminosity, shape, moments, segmentation, spatial organization, spatial filter (cross, lines, etc.). The use of Gabor and Steerable filters (as implemented in OpenCV) will also be considered.

f.- The use of a more restrictive threshold value in the SIFT algorithm, to avoid false positive recognitions. With the new threshold value, the SIFT algorithm will obtain less matches but it will be more accurate.

g.- Finally, we would like to introduce our program in a robot, to try to find objects. So, the robot will have to maximize the probability of finding a particular object in a particular place. We are thinking of augmenting our database with more places (corridors... ) or even take photos outdoors. In parallel Florian is developing a motion planning algorithm that could be used to find the best path to find the query as fast as possible.

[pic]

Diagram showing the next steps.

Conclusion

The SIFT features are currently too specific to represent generic type of objects. So, even though we are very happy to have received Lowe implementation of the SIFT algorithm, we are wondering more and more if it can really be useful for our project. So the first thing we are going to do this week is to do more test to clarify this point. Depending on the results we will either improve SIFT features or simply choose another corner detector to fix our features. The next step is probably going to be the incorporation of the probabilistic model which is the main part of the algorithm.

References

[1] D. G. Lowe. “Distinctive Image Features from Scale-Invariant Keypoints”. Draft submitted for publication, Jun 2003.

[2] A. Torralba, K. P. Murphy, W. T. Freeman, M. A. Rubin. “Context-based vision system for place and object recognition”.

[3] A. Torralba. “Contextual Priming for Object Detection”.

[4] K. Murphy, A. Torralba, W. T. Freeman. “Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes”. Draft.

[5] A. Torralba, P. Sinha. “Recognizing Indoor Scenes”. July 2001.

[6] R. Dugad, U. B. Desai. “A Tutorial on Hidden Markov Models”. Technical Report No. SPANN-96.1, May 1996.

[7] W. T. Freeman, E. H. Adelson. “The Design and Use of Steerable Filters”. IEEE Trans. Patt. Anal. And Machine Intell., Vol. 13, No. 9, pp. 891-906, Sept.,1991.

[8] R. Fergus, P. Perona, A. Zisserman. “Object Class Recognition by Unsupervised Scale-Invariant Learning”.

[9] M. Weber, M. Welling, P. Perona. “Unsupervised Learning of Models for Recognition”.

[10] M. Weber, M. Welling, P. Perona. “Unsupervised Learning of Models for Visual Object Class Recognition”.

[11] M. Weber, M. Welling, P. Perona. “Towards Automatic Discovery of Object Categories”.

[12] E. Trucco, A. Verri. “Introduction and Techniques for 3-D Computer Vision”. Prentice Hall Inc., 1998.

[13] B. Moghaddam, A. Pentland. “Probabilistic Visual Learning for Object Representation”. IEEE Trans. Pattern Analysis and Machine Vision, 1997.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download