Loading
Study Reminders
Support
Text Version

Set your study reminders

We will email you at these times to remind you to study.
  • Monday

    -

    7am

    +

    Tuesday

    -

    7am

    +

    Wednesday

    -

    7am

    +

    Thursday

    -

    7am

    +

    Friday

    -

    7am

    +

    Saturday

    -

    7am

    +

    Sunday

    -

    7am

    +

Last lecture we spoke about describing images using the back of boys approach or the Villa approach. We'll now take this forward and show how these descriptors can be used for matching between images. Before we go there, once again, an acknowledgement that these slides are taken from the excellent lectures of professor of this at Andrea Ren. We also left behind one question from last time, which is. Since bag of words is inherently dependent on gaming's to define its cluster centers. Can we consider extensions of gamings to improve? How bag of words performs? So one specific example could be hierarchical gamings, which is an extension of gaming's clustering, a Gundam where the cluster centers are organized in a hierarchical manner. Starting from a root note, all the way to a few set of leaf nodes. So it happens that this has been yeah, explored of a bag of words to using a method known as vocabulary tree in a way back in 2006. And what does matter does is to take hierarchical k-means and build a final partition tree. And now your image descriptors. Descent from each, from the root to one of the leaves in each level of the tree. So you have, uh, a bunch of plastic centers that you put together from all of the visual words that you have from different images. We called it. And back of words, when you build your cluster centers, you pull all the images in your data together, you take all the features, the descriptors that own the features. Plus to them using a method. And then you take those clusters as, as your, what are known as good book centers to which each key point is assigned. Your image is familiar. The presentation has X, which is one of the elements in your image, in the descriptor of your match. Is given by w I N w I is the rate of that particular node in the tree. And ignite is the number of key points assigned to that particular plus the center, plus the centroid in the tree. So one evident thing here is it's difficult to know how WWI is given a value. One could argue that perhaps four levels down in the tree, you must have a higher weight because they are a better match. One could also argue otherwise, depending on a particular setting at a high level of matches, perhaps more than one match, it depends on the application. Depends on what is important in a particular context. So, which is a constraint of this method that there's no principle way of defining WWI. Although you could come up with some heuristics on top of the metric, the dataset is again, searched using emergent filing and fundamentally there is a there's is an issue here, which is that distortion is often minimized only locally. For example, I don't know the particular plus the centroid thing. So any editor that you make or any differences between images are only local with respect to that particular cluster, same point distortion is not measured on a global sense. It's all with respect to each cluster, centroid across the occurrence of each cluster, centroid in each image across two images that do magic. That's how k-means can be extended. For a bag of words. So one could also consider other extensions of gamings and be able to use them in such methods to be able to describe images. Let's now talk about how do you mean much descriptors from two different images in a more principle before we go there, let's try to assess what do we know so far? One of the simplest methods that we've explored so far is. Nearest neighbor matching where you have one image, which is the set of feature points, another dimension, which is a set of another feature points. At this time, we are not talking about any bag of words of aggregation. It's simply a set of features in one images, one image, and a set of features. In the other image, we use each feature in a sec and it's got a sporting descriptor. To independently index into a feature from the second sibling, we simply do a nearest neighbor matching based on the descriptive of the feature in each of these images. Can you think of what could be a limitation of such an approach? An inherent limitation of this approach is that you could be ignoring useful information of coworkers. So it's possible that there could be one image where there could be multiple instances of the same feature. Think of say spots in a leopard or stripes and a zebra or so on and so forth. And you may be mapping a single feature in one image to multiple different features in the other image, because they all look similar. For example, a spot on a leopard or one image could get mapped to multiple spots. On a leopard in a second image, you ideally would not want to have, want this to happen. You would want each sport spot on the leopard to get Mack to be one spot on the leopard in the other image and under the sport, again, not under the sport and so on and so forth. But the example that you see on the slide here, which is a giant bhanda, which again has some features, which repeat in its structure, which could get mapped to the same instance in the second image. Which is what you see illustratively here. This would be the ideal setting where each feature gets mapped the same to an independent feature in the second image. And this could be another scenario where two different instances of the same feature get mapped to the same feature on the second image, which is not desirable. This could lead to some problems when you use the nearest neighbor matching approach. People also seen the bag of woods matching upward so far, any glaring limitation that you see of discipline and everyday limitation is that the backwards approach is limited to a full image matching, not a partial magic, because you're going to be looking at the histogram of. The occurrences of a plaster Android in image one versus such a histogram for image two, you would see two images match only even the complete Instagrams are fairly close to each other. So you're only a part of the second image matches the first dimension. These histograms will not match and you will not get a good batch in the scheme of a city. So in other words, you could say the bag of words, is it all, all magic? But I really want some sense of a one-to-one magic of one part of the image matching with another part. So that begin, still match some images, which could be partially close to one another. Ideally speaking, the nearest neighbor matching was a one-to-one matching approach, but it has its own limitations. Now let's try to generalize how you can do this descriptor matching using a method that you would have seen in machine learning, which is going to this. You may have heard about going to lose in support vector machines. So beyond going to use a similar idea here, to be able to generalize matching between descriptors, we call that even in support vector machines or for that matter, any other machine learning algorithm that panels can be used, uh, gunner artist's sense of similarity between two data points. It's the same principle that's being used here also to be able to do that, let's define the, those, those, the setting so far. So you haven't met, described by end descriptors. Let's say so X is in a match given by X one X student exempt, but each of these is off is a D dimensional vector, but this could just be the cluster centroid or it could be individual features and looking at the bag of words, example, these descriptors are typically quantized using gaming's clustering or for that matter, any other clustering method they are quantized, which means you simply don't take all the features in an image. You tried to see which cluster center they belong to. I only have a Gump in that particular cluster sample as a representation of that particular feature. For that particular cluster center. This one PlayStation function is given by Q, which takes us from our power B to a subset of C of our body, which is called, is a good book. Which is given by C one to Seagate. So that gave possible clusters and prices, which as I just mentioned, you get by doing a k-means clustering on the features from all images and in a single dimension, you're tried to see, Oh, you use a quantize, a function, which takes you from every feature to one of these cluster centroids, which is nearest to it. This is the setting. Now let's later to find the colon. So the Connell now is given by given two images, X. And white, the matching kernel K of X come away is given by gamma of X into come of white, rich, as we will see soon, our normalization functions. We'll see why we need this in a moment into some issue or all the cluster centers that you have N of X, C Y, C.

Right. And MSI with themselves matching function. So it's the M is the matching Cardinal function that you're talking about. And that happens for each the occurrences of each clusters, embroidered in one image and the occurrences of that cluster centroid and a second image. So we still have to define what Ms. We'll see a few examples of that as we go forward. So the Gamow of X and Gummo of white. Is required here because you do not want to be biased by the presence of number of features in a given image. For example, it's possible that you may detect say a thousand features in one image and just do hundred in the other. If you simply did a summation, you would always be biased by an image that has a lot of features because the count would go up and which may not really be a perfect match. Okay. So the gamma of X and demo of white are normalization functions that you can divide by the total number of features in the image so that the matching is not biased by the simply the number of features in an inch. Now let's try to see a few examples of how M would look in examples that you've seen so far. So if you're doing the bag of words, matching back and forth, matching is. Innovate a co-sign similarity between the cluster centers and the two images you could define that whatever we have seen in similarity so far can be defined by a matching code, which is given by you take exceed, which is one of your good book elements or your cluster centroids and you count how many appearances of that. Is there an image X. How many occurrences is that an image white. And you simply add them on rather for one particular court book and PC three, if you had been such of features in image X that corresponded to C3 and three such features in image white, because wonder to see three, the corresponding matching current is simply. And in two, three years, which becomes study is simply gone back for a double summation. So that's simply a bag of words, model going to respect, but remember the nominalization function, as I said, we'll take care of normalizing by the total number of features in the, in the image itself. But this is what M is defined as. You could extend this to another approach known as having embedding for matching, where if you assume that each descriptive can be finalized in some way, for example, you could pick a descriptor and simply say that anything greater than a threshold is one and anything less than a threshold is zero. You could buy, analyze any descriptor for that matter. Then you compute your matching Connell as. It's again, similar to your back of what's going on. The only difference now is you're only going to count the number of instances where the hamming distance between BX and B white is going to be less than a threshold. This is simply your having embedded. So they're having hedge is the having distance between the two binary vectors. And doubt is a threshold that you have to specify to be able to get a matching this particular setting. You can also define Vilade matching in the same framework, but remember again, that Vielight is similar to bag of words. The only difference is you do not count how many features belong to a code book element or a cluster center. You'd rather. Obtain all the features that belong to a good book element. Oh, it lost the same time. residual vectors. Recall again, in be light, you have a cluster center. For example, you could have two stars, two cluster centers. You have a set of features that are closest to this particular cluster center. You have another set of features, which are closest to this particular cluster center. So you take. The difference between them, which is going to be a residual vector. And you add up all the residual vectors that belong to one particular cluster center. And that becomes the representation for this cluster center. And similarly, you will do this for other cluster centers. Now, if this was the representation, how do you do the matching color? So the matching Colonel is given by the entire presentation of image. X is going to be. A sec or, uh, an entire vector represented by of that's going to be, um, the representations corresponding to each of your work elements I do matching kernel is now given by V off exceed transports V of YC, which is an inner product between the representation for the. Secret code book entry in X, add the seat code book entry in white. But this is simply to expand because we have exceed is a summation, what all the Xs that belong to that particular code entry and the corresponding residuals. So you expand, we have exceed using a submission similarly, or B of YC using it summation. And you now have the new matching colon as a summation or all the elements or the, all the features that belong to that code book and dream image X and the summation over white for the same, for the same code book entry and inside, you're going to have out of X in a product, part of what the residuals. How they are aligned with each other. So if you play to understand the intuition here, you're playing to say that if you have a cluster center, say three in image X, the same cluster centers, C3 in image white, let's say you have three features in image X belonging to this cluster center. Similarly say three features belonging to this cluster center in image. Y. You're going to take one of these reservoirs and see how the other residual matches with this recipe. You have to do it as it does match. You're going to get a high matching score, but other if even hope the features were configured around the cluster center match between the two languages. It's a better match, a more general way of combining all of these ideas. What's known as the aggregated selective match governance or is NK, which combines a few ideas that you've seen so far. It combines the non-linear selective function that we saw with having in breeding. We'll see that in a moment. And also it combines ideas from Villa. So the way this method works is you take the light, which is what you see as the argument insight. And you normalize the Vilade vectors, which is what you define as V hype. So we had, if you see at the bottom of the slide is given by V of exi divided by normal. We have exceed the fact that we are trying to make the BILAG vector, which is the sum of all the tools into a unit vector. And you now take an inner product between the very similar to what we saw in the previous light. Because this is an inner product output of this is going to be a scaler. Now you use a nonlinear selective function, Sigma alpha of this scaler to get your final matching function. What does the Sigma alpha? The Sigma alpha is defined as for any input you Sigma alpha is defined as sign of view into the absolute value of view poet alpha. If you is greater than a short out, the inner product is a measure of similarity higher, the value, the better for you. So if you is greater than the threshold, you may want to weigh things a bit more, but you can control using alpha. And if you use less than a threshold, it is very similar to be having distance. We say that having distance was less than a short. We counted more. If not, we won't count it. This is a very similar idea. Although we use relied vectors for achieving the same goal. So you can see here that if alpha is one, this is just you itself without FYS. One sign of you in the observed value of you will be . Let's see a few illustrations of this idea to make this a piano. So this is a few illustrations of different choices of alpha and Tal values. So on the top left here, you see alpha is equal to one, which as we just said, which is you itself, you're really not. It's simply the, not the doctorate or the normalized we laminate and doubt in this case. It's zero. So anything greater than zero. You're going to consider that as you itself with the, at the score to be, you would say, so you can see here in this case, yellow corresponds to say zero similarity and red corresponds to maximum similarity, but image better. So there are a few features that do not match at all. And there are a few things that matched very well and all of them are shown in the top left image. If you see that the top, right, you can see here again, alpha is equal to one. But doubt is equal to 0.25. You can see now that a lot of the yellows have come down, you know, saying that that score, you must be at least 0.25 for us to consider that to be a match. And you see now that many of these false matches disappeared because anything, any low score is now disregarded. The bottom row shows an example where alpha is equal to three. When you see again, when alpha is equal to three and remember that because you're normalizing your relapse vectors, alpha being height is going to reduce the value because you is going to be a value Lang between zero and one, because you have normalized directors, it's likely to be evaluating zero and one sort of alpha, which is an exponent of view is height, the values vertical, even smaller. And that's what you see here on the left. Where do you see that? Some of these lines have disappeared? Cause they've gone closer to zero and on the right, you see a similar thing where when alpha is equal to three, and that was equal to 0.2, five. Again, you get a few more yellows, which could be smaller values at this point in time, because you explain you, uh, exponent alpha, which is three here may have reduced the values because you likes between zero and one. So you can see here that largest selectivity. Don't make false correspondences, which is what we see here, but that was equal to 0.2, five on, on both the images on the right and this entire approach replaces the heart thresholding that we have in having embedding and gives a different way of going about, uh, a similar approach. Here is another illustration of, uh, the results after applying the ASM K method. Where do you see here that. Each of these colors in these different images correspond to the same visual word, like it's the green or the yellow, the blue is the same visual word occurring in different images. So you can see here that if you take one, any particular example, for example, if you take, say the pink or the red, you would see the pink or the red visual worker responds to some Gardner or some pointed corner in each of these images. All of these girls can be generalized into efficient match girls, where you could define this as some continuous function, cuppa of X, crema, white, and a wide using any code books for that matter, recall. Cause good books can be computationally intensive and compute and then compute the residuals and so on and so forth. You could definitely impose a gunnel function between the individual features in one image and the individual features in the other image. Once again, very similar to how carnal functions when used in a support vector machines or other machine learning algorithms, which are external component to it. Ideally you would want this cup of Eskimo can be decomposed into an inner product of Firefox transpose, fire of white, where fight is the representation of each feature in a different space. Then you would have, you'll find a carpet to be some normalization of X into the submission of all of your representations of X for each of these, uh, for all your features, transpose.