Loading
Study Reminders
Support
Text Version

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Moving on from the previous lecture, we will now talk about how do you describe entire images so far we spoke about how to obtain descriptors. For certain key points in images, which could be in several multiples in each image. How do you aggregate these descriptors to obtain a single descriptor for an image? And why do we need this? We talked about image anonymous, stitching, where you try to match key points. You may not want to match images themselves. For example, in image search or a table, or even to perform image classification. As we will see, when we go through these slides in this lecture, we will focus on two specific kinds of descriptors. The bag of words, descriptor, and another extension of it known as the Vielight descriptor. Once again, the lecture slides are based on professor Yannis lectures at embryo. So we've spoken about so far obtaining key points from two different views of the same scene or object. You have two different views. You extract key points in both of these views, you get descriptors of key points in both of these views and you match them. And perhaps played to match the geometry of the configuration between them using methods, such as at an SAC or half done spot, so on and so forth. But the question that we're asking here is what if we want to match different instances of an object? The object is no more the same in the earliest light. It was the same object from different viewpoints. Now the object could be completely different in this case, it's a helicopter, but completely two different helicopters. They're not the same object itself. Then how do you match? Or how do you search for example, major, imagine your Google limit search, where you you're going to search for lick up the kind of images. So you post an image and ask the system retrieve similar images for you. How do you do this is what we're going to talk about. Now, the main observation here is that the reject transformations may not work here so far. If you only wanted to match. Two different objects or scenes taken from different perspectives. You could consider one to be a rigid. So mission of the other, you could try to estimate your fine parameters, your rotation parameters, or your translation parameters, but then these are completely two different instances of an object. It may not be a rigid transformation at all. So ideally what we want to do is now to see if we can completely discard geometry itself, because it's normally going to be a really legit transformation. Can we completely discard the idea of geometry? Try to match images in some other way beyond geometry, and then maybe later using other methods, we'll try to bring back certain elements of geometry, little loosely into the mix. So our first attempt towards getting image level descriptors. Is a method known as bag of words. An old method has been used a lot in text before, but let's try to see how do we use this images. So we're going to start with a set of samples, a set of data points. Then you form a vocabulary of those data points. What do we mean by vocabulary? A set of common words that exist across images. And finally, based on the vocabulary, you're going to get a histogram, which is simply a frequency column of words in a certain document. In our case, visual words, in a certain image, let's try to describe that in more detail. So if you had text information, your samples would simply be different sentences. Your capillary would be the words that occur in these sentences, all the possible words that you have occurring in your sentences. And finally, for each of these sentences, you would come up with a histogram of how many occurrences of a particular word in your vocabulary happened in this particular sentence. So for each of your sentences, you can look up or capillary or a dictionary. And try to see how many times each word in the vocabulary or dictionary occurred in this particular sentence, let's slate or draw battle to images. Now let's say you have three different images in this particular case. So you form a vocabulary. So what does a vocabulary mean? For images in text, it's perhaps simple. It's all the words that occur in your documents. But if you had images, what would be your vocabulary? So the way we are going to define vocabulary is very simple. Again, you take different parts of each of these images, group them together. And see which of them correspond to something common. For example, it could be possible that there are six kinds of visual words that occur in the first dimension, which all looks similar. Maybe they may look similar in color, in texture or in any other appearance of, uh, the, of the representation you choose. Similarly, the second image is, has a set of visual words. And the third image has a set of visual words. Once again, we define visual words as extracting several parts from an image. So if you have a set of images, you extract parts from all of those images and then try to pull them in some way and then group them in. I see these parts in the image have seem to have similar properties and I'm going to group them into one particular word. We'll see this a little bit more clearly over the next few slides. And once you get these visual words, similar to text, you have a histogram of how many times each of these visual words occurred in a given image. That histogram is what is drawn here for each of these three images. Let's see a more real world example to understand this a bit better before we go there. What could be users of bag of words? It could be used for image retrieval as well as image classification as we just spoke. Why? Because this is a way to represent an entire image as a single vector. Let's see this using a more tangible example. Yeah. Had an example of an image and let's assume that this is the quality image. And here is an image from a dataset. So you could imagine that the data set images are your entire, uh, repository of images that you have. And the quality image is like Google image, such where you put up this image. And you're asking for a search engine to retrieve all images similar to this image. Right? For example, you could upload a cat image on any search engine and ask the search engine to retrieve similar cat images. A very similar example here, let's assume that there is a 15th image in your data set image, which has an occurrence of the similar, uh, of the similar structure. Now let's see what would happen earlier and how we would change things now. Audio, we would take key points in each of these images. Descriptors around these key points in each of these images and try to do a brand-wise matching between the descriptors in both of these images. And this is obviously a time taking process, because if you had many key points, many descriptors in each image, you would ideally have to do a better wise matching between every number and the first dimension versus every number in the reference image. This can get computationally expensive. If you had another image, you would have to repeat the same process for as many times between the image and the second reference image and so on and so forth for all the reference images. So how do you want to resolve this problem? So what are you going to say now is that if there red descriptors that are similar, that would match between the two images instead of matching them bare wise between the credit image and the reference image. Let's try to assume that similar descriptors will any way correspond to similar representations in that representation space in a certain metric, let's say the Euclidean metric. Right? So a group of visual words or regions that are similar to each other would probably have similar representations and they all grouped together in the representation space. Similarly, for those regions Mudd with blue and the regions, Margaret, right. They are similar and they would probably match ended up presentation space. How do we use this? So what we would say now is that, so this is just to say that how these representations match again. So what we will do now is to come up with a representation, a common representation for all similar descriptors. How would we do that? We could simply take your entire pository of images or training images. Big visual words from them, which are common to each other and simply obtain their meaning in that space. That becomes the visual word corresponding to all of those visual words in the repository of images for training. And we just, so you will do this process offline before the retrieval process starts. So you would, you could construct your visual words offline and keep them stored and ready when your search process actually begins. So these which will wards are only stored given a query image, you take the descriptors corresponding to the key points in your inquiry image, and now you simply match them with the visual word, which represents the mean of similar visual boards in the other films they use. You don't need to go pairwise matching with every descriptor in your reference would anymore. You would have a set of visual words that are common across your repository of images. You only need to match with them now to be able to get your answers. This makes this process very feasible to do as well as very effective there's no pairwise matching required. And your visual words now act as a proxy for your descriptors in your repository. Now, let's see this in a more, uh, more well-defined way. So let's imagine that your image now is represented by a rector Zed, belonging to art parquet, where K is the size of the code book or the number of visual words that you choose, which would be about a meter that a user would set for a particular data set, each element of Zed, which is alive. Is represented us. W I N I w I is a fixed week or visual word. You can choose to assign a weight. If you don't have a weight, they could just have uniform weights. And in, I is the number of occurrences of that particular word in this query image. Be that. So WWI is, is a weight per visual word, which is optional. And I is the number of occurrences of that word in this stench. So given a set of images in your repository reference images, which are represented by R K cross N by a capitalization RK cloth prop cross N remember, they're going to be key visual words, and in such images for each image, you would have a gay dimensional vector. So, and then in such images becomes K cross N and for a quality image. You would simply compare the similarities between your quality images, vector representation. So, which is going to count the number of visual words in the query image and the number of visual words in each of the reference images. You simply take a dot product or co-sign distance between these two and you get a set of scores. You sot order your scores in descending order, and that gives you the best match followed by the next best match. So on and so forth. Note here that when we take a co-sign distance based on seeing Dr. Arctic remember here, what we are doing as a Dr. is you would take every column of Zac, which would correspond to one specific image and you're taking an individual dot product of. Dr. Phillips image with this gradient. It's cute. The second reference image with the, with the aquarium, it's cute. So on and so forth. And which one has the highest doc product? That image is what you would recommend as the closest match. You have quality image, just to point out to you a more general observation that. If you're thinking a doctorate, it is similar to getting Euclidean distance, the similarity that you measure using a doc product or a co-sign distance. Cause I in similarity is complementary to the use of Euclidean distance white because Z minus Q whole squid do not give up. Also be written as two into one. Sometimes we'll skip, look, it don't as homework. So which means. Something closer as co-sign similarity, we will have a low Euclidean distance, a high co-sign similarity will be a low Euclidean distance. So in some sense, these two metrics of comparing similarities or distances are complimentary and important observation here is. When K is much, much greater than Pete with PS, the number of features per image on average, remember K is the number of visual words that you have in your code book, which we'll be using to find. And P is the average number of features that we have in a given image will be solved with, uh, with examples, like sift that that could be several hundreds, if K is much, much greater than P. Then Z and Q are going to be spots white. If you have thousand visual words and say a hundred in each image, then obviously there are going to be quite a few, at least 900 in a given image that will have zero count. Because those visual words do not exist in this image number case the number of visual words and B is the number of features in a given image. So if K is much, much greater than P both zip and Q are going to be sparse. So to make the competition more effective here. But other than check whether a word isn't a given image, which is what you would typically do. We'll check for every word in a given image, but then the entire image could be a very sparse representation of the possible set of words. We can inward this problem and check which images contain a particular word. And we will see now that this can be a more effective way. To compute similarity in the scenario. Let's see this a bit more intuitively. So let's imagine now that you have a quick image, which has visual words, 54 67 and 72. These are York. Visual words. Remember you have a bunch of visual words and every image has a certain number of occurrences of each of these visual words. Let's assume. Now that you have just one occurrence of visual words, 54 67 and 72 in a given query image, 54 67 and 72 are just random numbers. You could have taken any other numbers. That's just to explain. Now, if you have a repository of images, That you want to retrieve from? Let's say they are given by a set of numbers in your dataset. So you're not trying to draw this out. So you look at each of these images, let's see the 15th image in your repository had the 72nd word, the 60th seventh, seventh word and 54th word. Whereas the 13th image has only the 67th word. The 17th image has only the 72nd word. The 19th images image has the 67th and the 54th word and so on and so forth. Now, what do you do? So you take one of these visual words, which was occurring in your credit image, and you tried to see which of your query images. Also hype that particular visual word, and you add a count for that particular image. Okay. Similarly, 19, would a get, get a count of one 21 would get a count of one for these images. So you're trying to see, you're not trying to compare which of these repository images contain the same word, which is there in your query image. Similarly, you would do this for the 67th word, and you'll get an increase in count for these two images, but no increase in an increase in count for 13th and 21st and sorry, 21st just had a green there's no increase in count, but the 13th image has an increase in count. Now, similarly, you will do this for the red visual word or the 72nd visual word. And at this time, the 15th image gets one more count. The 17th image gets a new, uh, count added here. And the 22nd image gets a new count added here. So it's everything from this that the 15th image has the highest count with respect to the quality image. And you now rank order. All your repository images based on a short list with respect to these visual words. So you'll finally find that the 15th image has the best match. The second best match is the 19th image with two common visual words, and you can repeat this process to be able to get your best match from your repository of images. And this would be known as the image retrieval problem. Let's try to ask how you would extend back of words to classification. No, maybe talked about retrieval, but we now want to classify a given image as belonging to a particular scene. For example, in the previous image that we saw, maybe that building has a particular name and there are many monuments that you have, and you want to classify an image as belonging to a particular monument. So you want to treat this now as a classification problem. How do you adapt this for classification? The way you will do it is once again, you represent an image by a vector Zed, belonging to art parking, where case number of the visual words, very similar to how we did it for retrieval problem. But once you do this, now you can use a classifier. To be able to classify these, the frequency of visual words of every image now becomes a vector representation of a set of visual words, a histogram of visual words. And you would have a set of similar images in your training data. Which you would have also had a class label associated with them because remember classification as a supervised learning problem. So now you could just use a classify such as say, Navy base or support vector machines or any other classifier for that matter. If you're using Navy base. You would estimate the maximum posterior probability of a class C given an image, Zed, assuming features are independent. Assuming the presence or the count of each visual word is independent. You would simply run a Navy base classifier in this particular scenario. And remember nave bays would become a linear classifier. You could do those using a support vector machine. You could use an appropriate kernel function in a support vector machine to perform the classification. Or for that matter, you could use any other classifier and extinction of the bag of words is known as the vector of locally aggregated descriptors or V light. It's very similar to back of words with a small difference, but a significant difference back of words, as we just saw would give you a scale of frequency of how many times a particular visual word. Which is obtained by clustering, uh, a lot of regions in your cleaning images or repository of images, how many times each of those visual words occurred in equity image. So in some sense, this gives you limited information. You do not have a geometry here, or you do not have what were the exact regions in your query image. You're going to map them innovate to a common visual word, which is the average of similar visual words in your repository. In Vielight instant. What you do is you again, have visual words, all of that remains the same, but now you have, instead of a scale of frequency, you have a vector per visual words, which looks at how far each of these features in your query image are from the visual word. So you have a visual word, which you would have obtained by clustering groups of. Features in your training images now, given a new feature in your quality image. You look for how far that is from the visual word, you would get a residual vector. Similar. You take another feature that's mapped to the same ritual word and see how far that is with respect to the visual word. And you add up all the residuals and you would get one vector residual, which is the sum of all the syllables that get mapped to the visual word that corresponds to digital visual word. So it's not a scale of frequency anymore. It's a vector of how far. Other features that map to this visual word are in each. This gives you a little bit more information that could be more effective. Let's see this a bit more purely. If we had a bag of words and a presentation, given a colored image, which is a three channel RGB input, you first converted to gray scale. You could adapt back of words to a color too, is taking a simpler example here. So once you have that, you get a set of say thousand features and you convert that say to one dimensional. If you had a thousand. Key points. You convert that to one 20 dimensional, sift descriptors for each of those thousand key points. Then you do an element wise in coding to say a hundred visual words. Let's assume you had a hundred visual words. You take each of those thousand features and map them to a hundred visual words. So you would have case such visual words and you would end up counting them, counting the occurrences of each of those thousand features in. With respect to one of these a hundred visual boards, you'll get a set of frequencies. Then you could do a global sampling and Alto normalization of that final histogram vector and get your final bag of words and a presentation. On the other hand with Vielight, you will do the same. You would convert the three channel RGB input into a one channel gray scale. You would once again, get thousand features in your query image. Convert that to one 20 and dimensional representations for each of those thousand features, you would again assign each of those thousand features to one of key visual words. But this is where the difference comes that you're now not going to get a scaler representation or a histogram. You're going to get a residual vector for each. Uh, of those scale visual words, which means this now is going to be a one 28 cross K indicate other, uh, representation rather than a clear emotional representation, because for each visual word, you're going to see for all the features that got mapped or all the key points that got mapped to a particular visual word. What was the residual for each dimension for each of those one 20 dimensions? And finally you'll get a one 28 into cake representation, which you would do normalize and use for practice and to normalize is simply to ensure that the vector becomes a unit. No, I do know to be one. And that's the Vielight descriptor to conclude this lecture, please read chapter 14.3 and 14.4 of Cylus skis book and something for you to think about. At the end of this lecture is how is bag of words connected to the k-means clustering algorithm? We already, I think, spoke about it briefly during the lecture, but think about it and answer this more carefully, assuming you understood the bag of words is connected to k-means clustering. How are extinctions of gaming's clustering, such as hierarchical k-means clustering or approximate k-means clustering relevant for the bag of words? Problem. Your hint is look up what are known as vocabulary piece. And here are some references for this lecture.